Supporting 64-bit ARM systems
One might well wonder whether a 64-bit ARM processor is truly needed. 64-bit computing seems a bit rich even for the fanciest handsets or tablets, much less for the kind of embedded controllers where ARM processors predominate. But mobile devices are beginning to push the memory-addressing limits of 32-bit systems; even a 1GB system requires the use of high memory in most configurations. So, even if the heavily foreshadowed ARM server systems never materialize, there will be a need for 64-bit ARM processors just to be able to efficiently use the memory that future mobile devices will have. "Mobile" and "embedded" no longer mean "tiny."
Naturally, Linux support is an important precondition to a successful 64-bit ARM processor introduction, so ARM has been supporting work in that area for some time. The initial GCC patches were posted back in May, and the first set of kernel patches was posted by Catalin Marinas on July 6. All this code exists despite the fact that no 64-bit ARM hardware is yet available; it all has been developed on simulators. Once the hardware shows up, with luck, the software will work with a minimum of tweaking required.
64-bit ARM support involves the addition of thousands of lines of new code via a 36-part patch set. There are some novel features, such as the ability to run with a 64KB native memory page size, and a lot of important technical decisions to be reviewed. So the kernel developers did what one would expect: they started complaining about the name given to the architecture. That name ("AArch64") strikes many as simultaneously redundant (of course it is an architecture) and uninformative (what does "A" stand for?). Many would prefer either ARMv8 (which is the actual hardware architecture name—"AArch64" is ARMv8's 64-bit operating mode) or arm64.
Arguments in favor of the current name include the fact that it is already used
to identify the architecture in the ELF triplet used in binaries; using the
same name everywhere should help to reduce confusion. But, then, as Arnd
Bergmann noted: "If everything else
is aarch64, we should use that in the kernel
directory too, but if everyone calls it arm64 anyway, we should
probably use that name for as many things as possible.
"
Jon Masters added that, in classic
contrarian style, he likes the name as it is; Fedora is planning to use
"aarch64" as the name for its 64-bit ARM releases. Others, such as Ingo Molnar, argue in favor of changing the
name now when it is relatively easy to do. Catalin seems inclined to keep the current name but
says he will think about it before posting the next version of the patch
series.
An arguably more substantive question was raised by a number of developers: wouldn't it make more sense to unify the 32-bit and 64-bit ARM implementations from the outset? A number of other architectures (x86, PowerPC, SPARC, and MIPS) all started with separate implementations, but ended up merging them later on, usually with some significant pain involved. Rather than leave that pain for future ARM developers, it has been suggested that, perhaps, it would be better to start with a unified implementation.
There are a lot of reasons given for the separate 64-bit ARM architecture implementation. Much of the relevant thinking can be found in this note from Arnd. The 64-bit ARM instruction set is completely different from the 32-bit variety, to the point that there is no possibility of writing assembly code that works on both architectures. The system call interfaces also differ significantly, with the 64-bit version taking a more standard approach and leaving a lot of legacy code behind. The 64-bit implementation hopes to leave the entire 32-bit ARM "platform" concept behind as well; indeed, as Jon put it, there are hopes that it will be possible to have a single kernel binary that runs on all 64-bit ARM systems from the outset. In general, it is said, giving AArch64 a clean start in its own top-level hierarchy will make it possible to leave a lot of ARM baggage behind and will result in a better implementation overall.
Others were quick to point out that most of these arguments have been heard in the context of other architectures. x86_64 was also meant to be a clean start that dumped a lot of old i386 code. In the end, things have turned out otherwise. It may be possible that things are different here; 32-bit ARM has rather more legacy baggage than other architectures did, and the processor differences seem to be larger. Some have said that the proper comparison is with x86 and ia64, though one gets the sense that the AArch64 developers don't want to be seen in the same light as ia64 in general.
This decision will come down to what the AArch64 developers want, in the end; it's up to them to produce a working implementation and to maintain it into the future. If they insist that it should be a separate top-level architecture, it is unlikely that others will block its merging for that reason alone. Of course, it will also be up to those developers to manage a merger of the two in the future, should that prove to be necessary. If nothing else, life as a separate top-level architecture will allow some experimentation without the fear of breaking older 32-bit systems; the result could be a better unified architecture some years from now, should things move in that direction.
Thus far, there has been little in the way of deeper technical criticism of
the AArch64 patch set. Things may stay that way. The code has already
been through a number of rounds of private review involving prominent
developers, so the worst problems should already have been found and
addressed. Few developers have the understanding of this new processor
that would be necessary to truly understand much of the code. So it may go
into the mainline kernel (perhaps as early as 3.7) without a whole lot of
substantial changes. After that, all that will be needed is actual
hardware; then things should get truly interesting.
Index entries for this article | |
---|---|
Kernel | Architectures/Arm |
Posted Jul 10, 2012 20:59 UTC (Tue)
by smoogen (subscriber, #97)
[Link] (35 responses)
[And yes I know it is a lot more work than say 6-8 instructions but the amount of work sounds less than the Aarch64 takes.]
Posted Jul 10, 2012 21:39 UTC (Tue)
by jzbiciak (guest, #5246)
[Link] (6 responses)
The original IA64 parts couldn't run x86 code worth a crap, and so not only did you have to change everything to use the new 64-bit functionally, you had to change everything before the device was even useful. That's a much bigger hurdle to clear.
Assuming ARMv8 processors run ARMv7 code relatively well outside of AArch64 mode, then you could deploy quite a lot of hardware with ARMv8, and only move to 64-bit when needed. I think that was much of the allure of x86-64. Many people ran 32-bit OSes on 64-bit devices for quite awhile, until the 64-bit software matured.
Posted Jul 11, 2012 3:01 UTC (Wed)
by ringerc (subscriber, #3071)
[Link] (5 responses)
ARM64 doesn't have to run ARM32 code, it just has to be vaguely sane to port code for ARM32 over to ARM64.
Posted Jul 11, 2012 3:27 UTC (Wed)
by jzbiciak (guest, #5246)
[Link] (4 responses)
My understanding is that ARM generally has kept pretty strong source-level compatibility, and that includes at the assembly level (a few esoteric CPxx registers notwithstanding). So, if I've optimized some algorithm in ARM v5 assembly, lets say, I have a pretty good chance of running that code on ARM v7 with at most a recompile, but quite likely just a relink.
That's at least my understanding. If that weren't true, then it seems like ARM could have shed more baggage more often. After all, ARMv7 still "supports" Jazelle (albeit with a "null" implementation) for example. Why would it need to, unless there's the expectation of running some ARMv5J binaries?
If the ARMv8 AArch64 code is as different as everyone's saying, then it's not clear that you'll be compatible at the assembly level at all. You may be compatible at the C/C++ level, except in cases where pointer sizes break you. Porting may be vaguely sane, but I suspect still a bigger chore than moving among existing 32-bit ARM variants.
In any case, I imagine the first year or two of ARMv8 hardware will come with 32-bit OSes, or at least 32-bit userlands on 64-bit kernels if ARMv8 supports that, until the software stabilizes.
I guess it's time for me to start reading up on ARMv8, eh? *sigh* I just finished absorbing a couple thousand pages of docs on ARM Cortex-A15 and the current round of AXI/ACE protocols....
Posted Jul 11, 2012 3:55 UTC (Wed)
by ringerc (subscriber, #3071)
[Link] (1 responses)
I've only dealt with ARM in a few areas like mutex/locking asm helpers, where the best sequence to use is different for different sub-architectures.
I can see that the break of source compatibility would make v8/ARM64 a bigger challenge if perfect source-level ASM compatibility has been the norm up until now. It still sounds more like an ia32-to-x86_64 move than an ia32-to-itanic move, though.
Posted Jul 11, 2012 7:09 UTC (Wed)
by cmarinas (subscriber, #39468)
[Link]
Posted Jul 17, 2012 21:58 UTC (Tue)
by Tuna-Fish (guest, #61751)
[Link] (1 responses)
Posted Jul 17, 2012 22:40 UTC (Tue)
by jzbiciak (guest, #5246)
[Link]
But SP's not a GPR? Wow... It makes a certain amount of sense, really, since stack pointers move in rather particular ways. Treating it specially may help when speculating memory accesses. I'm going to guess they have dedicated ways of generating SP-relative addresses for local variables, along with frame pointers.
Otherwise, it sounds very MIPS-y, at least superficially.
Posted Jul 10, 2012 21:55 UTC (Tue)
by stevenb (guest, #11536)
[Link]
I just hope they're not going to do a 64-bits Thumb :-)
Posted Jul 11, 2012 2:45 UTC (Wed)
by ringerc (subscriber, #3071)
[Link] (6 responses)
They implemented ia32 emulation, but so slowly that they'd have been better off not doing so and requiring software emulation with a few helper instructions from the outset.
Because of the big-end-of-town focus, most of the commercial software for Itanium only came in special price-gouging editions, further inflating its already uncompetitive costs.
More importantly, they were really expensive parts to make and were based on a completely new very long instruction word design that turned out not to perform half as well in reality as it did in theory. It also relied on compilers to do a huge amount of work, but the compilers just weren't ready for it. They had huge caches (esp. for the time), low yields, and insanely expensive packaging with complex daughterboards. There was no such thing as a low-cost, entry-level Itanium system; the whole thing was a push-it-down-from-the-top big iron and big clusters concept.
'cos, y'know, that's exactly how ia32 came to dominance over SPARC, PPC, Alpha, etc, right?
It sounds like ARM64 is closer to "pure" x86_64 than Itanium/IA64; it's a major cleanup and extension of ARM32, but recognisably related.
I doubt ARM will screw this up, not least because they will've learned from the Itanium fiasco. They're sure to have low cost entry-level parts available directly or via licensees from the beginning. They'll be be focused on keeping costs low, yields high, and porting relatively easy. I'd be surprised if some of the changes in ARM64 weren't to get rid of hard- or expensive- to implement instructions and other simplifications for easier manufacturing.
ARM also have another advantage: there isn't much of a legacy base of binary software that people expect to be portable to new arches. It's the norm to have to rebuild binaries for different ARM sub-arches. They don't have Intel's problem of people expecting to be able to run their 1981 copy of QBase on their new Itanium server.
Posted Jul 11, 2012 2:59 UTC (Wed)
by ringerc (subscriber, #3071)
[Link] (1 responses)
For an example of insanely expensive packaging, check this image of an Itanium 2 CPU assembly out. Compare to this Intel Pentium III Coppermine CPU package. It isn't hard to see part of why Itanium struggled even before AMD came along with AMD64 and finished it off. Have a look at the Intel price list for some of the Itanium family. Do you really think they're that much better than a Core i7 or the Xeon variants? I don't. Sure, those are current prices not historical ones from the time when itanium used to be almost-relevant, but the prices then were at least as bad. Also, notice how the TDP ratings on those parts are absolutely nuts? Some of those parts dissapate 180W and cost nearly $4k in bulk! You aren't going to get many of those into a rack, not least because the CPU packages are so physically huge as well. Now there are further disincentives like the lack of AESNI instructions in Itanium.
Posted Jul 11, 2012 8:23 UTC (Wed)
by jengelh (guest, #33263)
[Link]
Well, it was about the P4-Netburst time, so the TDP is... accurate :)
Posted Jul 11, 2012 11:29 UTC (Wed)
by kskatrh (guest, #73410)
[Link] (2 responses)
The number I've heard bandied about was $15K for the some of the first Aarch64 boards/systems. Perhaps that was for one of the big Calxeda boxes, fully populated?
Contrast that with a $60 gooseberry or a $45 Raspberry Pi (if my order ever gets filled.)
I'll be much more interested when I can get something for <$200.
Posted Jul 11, 2012 13:55 UTC (Wed)
by khim (subscriber, #9252)
[Link]
I think you are comparing apples and oranges. I don't believe in $15K Aarch64 system for a nanosecond. Boards, sure, this is about what you can expect from brand-new architecture. I remember how we've developed software for $5 cryptogadget (that's retail price, bulk prices were about 2-3 times smaller and CPU itself was well below $0.5). Development board for a CPU was $5K.
Posted Jul 11, 2012 16:35 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link]
I fondly remember paying several thousand dollars for development boards, with retail prices of the end-user hardware below $200.
Posted Jul 18, 2012 0:10 UTC (Wed)
by plugwash (subscriber, #29694)
[Link]
Posted Jul 11, 2012 7:19 UTC (Wed)
by gidoca (subscriber, #62438)
[Link] (2 responses)
Posted Jul 11, 2012 14:12 UTC (Wed)
by mwsealey (subscriber, #71282)
[Link] (1 responses)
It's not like PPC where the instruction encodings are identical to a high degree and 64-bit code barely has a handful of extra instructions for dealing with 64-bit values directly. Power Architecture was designed very, very well from the beginning, and the mnemonics are identical and the generated bytecode is identical barring the 64-bit specifics. You can't say that about ARM. ARM decided to reinvent the architecture to get rid of all that cruft (but they have to keep compatibility with at least the ARMv7).
Most of the available userspaces are going to stay ARMv7-A for a long, long time.. like running a 64-bit kernel with a 32-bit userspace is common on x86 these days (and x32 just compounds it). Those that need to have 64-bit userspaces will have them..
Posted Jul 11, 2012 22:02 UTC (Wed)
by ncm (guest, #165)
[Link]
DEC provided a translator from x86 to Alpha, and the translated code ran lots faster on Alpha than the original code did on x86. IIRC, interpreted x86 ran as fast as x86. It didn't save them in the end. How the market will respond to a new architecture depends on details, and nobody knows which details will turn out to be important.
Compatibility can be a trap: IBM OS/2 ran windos programs unchanged. windos could not run OS/2 programs. Which target should you code for?
Posted Jul 11, 2012 14:39 UTC (Wed)
by Lumag (subscriber, #22579)
[Link] (7 responses)
X86_64 gained additional registers and that is one of the main reasons, why it is really faster than i386 (IA-32)
Posted Jul 11, 2012 17:26 UTC (Wed)
by gnb (subscriber, #5132)
[Link] (6 responses)
That's equally true of AArch64: you get 31 GP registers against 15 in 32-bit ARM mode.
Posted Jul 11, 2012 17:33 UTC (Wed)
by Lumag (subscriber, #22579)
[Link] (5 responses)
P.S. I'm not voting against Aarch64. I just say that "just extending registers to 64-bit" probably won't work as expected.
Posted Jul 11, 2012 18:20 UTC (Wed)
by gnb (subscriber, #5132)
[Link]
Posted Jul 12, 2012 22:16 UTC (Thu)
by cesarb (subscriber, #6266)
[Link] (3 responses)
For instance, the inner loop of SHA-512 needs 8 64-bit registers, plus an array of 16 64-bit values. On a 32-bit architecture with 15 registers, you will have to spill to the stack (each 64-bit register would be represented by a pair of 32-bit registers, so you would need 16 of them). On a 64-bit architecture with 31 registers, you could have the whole state (8+16 values) in the registers, and still have a few left for the intermediate calculations. The entire inner loop can run without having to touch the stack.
Posted Jul 14, 2012 22:40 UTC (Sat)
by jzbiciak (guest, #5246)
[Link] (2 responses)
In the specific case of ARMv7 vs. ARMv8, you need to consider NEON. An ARMv7 with NEON, wouldn't you do the SHA-512 hash using the NEON registers?
(These guys show decent results for other crypto algorithms using NEON. They suggest SHA-512 would speed up well also, but say they "didn't bother" with it yet.)
I guess my point is, ARM v7 already offers a path to way more registers than the base 16 x 32. I would imagine anyone springing for a Cortex-A15 would also include NEON. NEON is designed to absorb the heavy duty bulk computation, leaving the 16 GPRs for the more general control stuff.
Posted Jul 14, 2012 23:00 UTC (Sat)
by dlang (guest, #313)
[Link] (1 responses)
But if you are trying to build software for many different devices (like a disto needs to), then you can't count on optional features being there
Posted Jul 15, 2012 4:42 UTC (Sun)
by jzbiciak (guest, #5246)
[Link]
Your point stands, though, for devices that would have to run the fallback version. Those would benefit from ARMv8 without having to use NEON. Plus, keeping the multiple versions around takes up more space, wrangling them is added complexity, etc. etc.
For something like a cell phone, where interpreters like Dalvik have a JIT, the compiled code can match the exact CPU in the phone. After all, that isn't exactly a user-serviceable part. ;-)
Posted Jul 11, 2012 15:52 UTC (Wed)
by iabervon (subscriber, #722)
[Link] (8 responses)
There's nothing terrible about designing an architecture that's just different from the other architecture the chip supports, so long as the new architecture isn't nuts. If arm64 is a different architecture from a company that knows a lot about a certain sweet spot from their 32-bit architecture, that's a whole lot better than IA64 being a different architecture from a company that'd been driven temporarily insane by their 32-bit architecture.
Posted Jul 11, 2012 19:39 UTC (Wed)
by zlynx (guest, #2285)
[Link] (7 responses)
The Itanium was nearly the complete OPPOSITE of a P4 design. In the Itanium design the compiler was responsible for figuring out what memory to preload, what branches to predict and what instructions to run in parallel. The Itanium CPU itself was a very RISCy design in its way without much special logic.
In a P4 and other IA32 designs, the CPU has big piles of logic dedicated to branch predictions, instruction decoding, speculative execution and parallel instruction dispatch with the associated timeline cleanup at the end to make it all appear sequential.
Itanium dropped quite a lot of that, which I think was a very good decision.
Posted Jul 11, 2012 20:46 UTC (Wed)
by khim (subscriber, #9252)
[Link] (6 responses)
P4 is step in the same direction as Itanic - just not as big. That's why it was merely "reputation disaster" instead of "billions down the drain". For example for good performance it needed branch taken/branch not taken hints from a compiler (they were added to x86 specifically for P4). Yeah, good decision. For AMD, that is. Itanium is designed for weird and quite exotic corner case: tasks where SMP is unusable yet compiler is capable of making correct predictions WRT memory access and branch execution. Sure, such tasks do exist (most cryptoalgorithms, for example), but they are rare. We all know the result.
Posted Jul 11, 2012 20:58 UTC (Wed)
by zlynx (guest, #2285)
[Link] (5 responses)
Is there ANYTHING that CPU silicon can do with an instruction stream that software cannot do? No, not really.
It's not rare at all. Most software improves quite a bit on any CPU when rebuilt with performance feedback optimizations. After profiling, true data is available for branch and memory predictions. CPU silicon without hints can only look a few steps ahead and make guesses.
And I'm not sure what you mean by tasks where SMP is unusable. Most Itanium systems are SMP. IA64 SMP works even better than Xeon's because Intel fixed some of x86's short-sighted memory ordering rules.
Posted Jul 12, 2012 14:31 UTC (Thu)
by khim (subscriber, #9252)
[Link]
This is were you are wrong. Dead wrong as Itanic story showed. There is one thing software can not do - and sadly (for guys who dreamed up this still-born archtitecture) this is the only thing that matters. Software can not optimize your software for the CPU which does not even exist yet. This is classic case of the In theory, there is no difference between theory and practice. But, in practice, there is. Theory: you can use PGO and produce nice and fast code. Itanium should kill all other CPUs!
Practice: Even if software is built in-house it's usually built once and used on system with different CPU generations (often from different vendors). COTS is used for decades without recompilation. Itanic is gigantic waste of the resources. Not really. Compare obvious competitors: Itanium® Processor 9350 and Xeon® Processor E7-4860. Surely Itanium wins because of it's better architecture? Nope: if your task is SMP-capable then 10 cores of Xeon can easily beat 4 Itanium cores. And this is not an aberration: Xeon systems usually had 2x more cores (often difference was more then 2x) than identically priced Itanium systems. Even will all fixes in memory-ordering rules Itanium was never able to win such competition if task was able to use all the cores of Itanium and all the cores of Xeon.
Posted Jul 13, 2012 8:55 UTC (Fri)
by etienne (guest, #25256)
[Link] (3 responses)
Seems that CPU silicon can adapt to memory access taking different times (in layer 1 cache / in layer 2 cache...) (so re-order instructions) better than compiler, mostly when the same code is executed at different times.
Posted Jul 14, 2012 22:58 UTC (Sat)
by jzbiciak (guest, #5246)
[Link] (2 responses)
For example, the compiler doesn't have sight of the other software on the system. Suppose process A is polluting the caches. When running unrelated process B, the CPU still could improve its performance through out-of-order scheduling and other tricks that hide the miss penalties. The compiler had no way of predicting those when it compiled process B's executable.
Statically scheduled architectures such as IA64 won't reorder the instruction scheme, although they will (especially once they got to McKinley) aggressively try to reorder memory requests and allow multiple misses in flight. As a result, to address concerns such as the pollution issue above, the compiler needs to try to schedule loads as early as possible. This is why IA64 has "check loads." they all you issue a speculative load even earlier -- perhaps before you even know the pointer is valid -- and then issue a "check load" to copy the loaded value into an architectural register at the load's original execution point.
The "check load" is where all exceptions get anchored in the case of a fault. If the speculative load got invalidated for some reason (it doesn't have to be a fault -- the hardware can drop a speculative load for any reason at all), the check-load will re-issue it. It allows the compiler to mimic the hardware's speculation to a certain extent.
It's not problem free. Speculative load / check load pairs tie up more issue slots than straight loads that hardware might speculate. If the check load stalls, it stalls all the code that follows it (statically scheduled, remember?), whereas with hardware speculation, only the dependent instructions stall while others can proceed.
The original promise of the IA-64 architecture was to be able to ramp up the processor clock given the simplified instruction pipeline thanks to static scheduling, to overcome any losses associated with lack of hardware speculation. Furthermore, it was supposed to be more energy efficient since it wasn't spending hardware tracking dependencies in large instruction windows. In the end, it didn't seem to live up to any of that.
I don't think IA64 failure is an indictment of VLIW principles, though. I work with a VLIW processor regularly that doesn't seem to have any of the same problems. When measured against the promises, EPIC (the official name for IA64's brand of VLIW), was an EPIC failure, IMHO.
Posted Jul 27, 2012 21:06 UTC (Fri)
by marcH (subscriber, #57642)
[Link] (1 responses)
Not just hardware but self-modifying software too. This how "Hotspot" got its name. This looks like a more general compile-time versus run-time question rather than a hardware versus software one.
At this stage this thread could probably use an informed post about Transmeta?
Posted Jul 27, 2012 21:48 UTC (Fri)
by jzbiciak (guest, #5246)
[Link]
JITs can use dynamic profile information (both block and path frequency) and specific details of the run environment to tailor the code, but they can't respond at the granularity of, say, re-ordering instructions due to cache misses, for example. If you have a purely statically scheduled instruction set like EPIC, no amount of JIT will help reorder loads if the miss patterns are data dependent and dynamically changing. (Speculate/Check loads can help, but only so much.)
Speaking of the ultimate JIT platform Transmeta: Even Transmeta has some hardware patents for hardware memory access speculation mechanisms. I recall one which was a speculative store buffer: The software would queue up stores for a predicted path, and then another instruction would either commit the queue or discard it. Or something like that... Ah, I think this might be the one: http://www.freepatentsonline.com/7225299.html
Posted Jul 11, 2012 12:15 UTC (Wed)
by etienne (guest, #25256)
[Link]
Posted Jul 12, 2012 23:08 UTC (Thu)
by pabs (subscriber, #43278)
[Link]
Posted Jul 19, 2012 3:03 UTC (Thu)
by guillemj (subscriber, #49706)
[Link]
http://anonscm.debian.org/gitweb/?p=dpkg/dpkg.git;a=commi...
And I'd be happy to change the GNU triplet match, if that got renamed to something less unfortunate.
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
As for the prices, the enterprise segment is never understandable to mortal users: the $4k price tag does not even include a next-day replacement service.
Supporting 64-bit ARM systems
> via licensees from the beginning. They'll be be focused on keeping costs
> low,
Supporting 64-bit ARM systems
The number I've heard bandied about was $15K for the some of the first Aarch64 boards/systems.
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Maybe that is true of the microcontroller variants of ARM and of kernel code but userland code on linux tends to keep running fine on newer versions of arm. Debian armel will run happilly on everything from V4t right through to v7.
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
In a P4 and other IA32 designs, the CPU has big piles of logic dedicated to branch predictions, instruction decoding, speculative execution and parallel instruction dispatch with the associated timeline cleanup at the end to make it all appear sequential.
Itanium dropped quite a lot of that, which I think was a very good decision.
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Is there ANYTHING that CPU silicon can do with an instruction stream that software cannot do? No, not really.
Most software improves quite a bit on any CPU when rebuilt with performance feedback optimizations. After profiling, true data is available for branch and memory predictions. CPU silicon without hints can only look a few steps ahead and make guesses.
IA64 SMP works even better than Xeon's because Intel fixed some of x86's short-sighted memory ordering rules.
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems
I just have 64 memory window base address to manage, plus things which are really windows into some other kind of memory.
Also, having a massive amount of I/O mapped control/status registers, I am no more using readl()/writel() with #defined register address and content and MASKs, but array of structures containing arrays of structures and a lot of bitfields in there - the whole lot declared volatile.
Supporting 64-bit ARM systems
Supporting 64-bit ARM systems