Supporting 64-bit ARM systems

By Jonathan Corbet
July 10, 2012

ARM is one of the most successful processor architectures ever created; most of us possess several ARM cores for every x86 processor we have. ARM is very much thought of as an embedded systems processor; it is focused on minimal power use and the ability to be built into a variety of system-on-chip configurations. The "small systems" image of ARM is certainly encouraged by the fact that ARM processors are all 32-bit only. That situation is about to change, though, with the arrival of 64-bit ARM processors. Linux will be ready for these systems — the first set of 64-bit ARM support patches have just been posted — but there is still some debate around a couple of fundamental decisions.

One might well wonder whether a 64-bit ARM processor is truly needed. 64-bit computing seems a bit rich even for the fanciest handsets or tablets, much less for the kind of embedded controllers where ARM processors predominate. But mobile devices are beginning to push the memory-addressing limits of 32-bit systems; even a 1GB system requires the use of high memory in most configurations. So, even if the heavily foreshadowed ARM server systems never materialize, there will be a need for 64-bit ARM processors just to be able to efficiently use the memory that future mobile devices will have. "Mobile" and "embedded" no longer mean "tiny."

Naturally, Linux support is an important precondition to a successful 64-bit ARM processor introduction, so ARM has been supporting work in that area for some time. The initial GCC patches were posted back in May, and the first set of kernel patches was posted by Catalin Marinas on July 6. All this code exists despite the fact that no 64-bit ARM hardware is yet available; it all has been developed on simulators. Once the hardware shows up, with luck, the software will work with a minimum of tweaking required.

64-bit ARM support involves the addition of thousands of lines of new code via a 36-part patch set. There are some novel features, such as the ability to run with a 64KB native memory page size, and a lot of important technical decisions to be reviewed. So the kernel developers did what one would expect: they started complaining about the name given to the architecture. That name ("AArch64") strikes many as simultaneously redundant (of course it is an architecture) and uninformative (what does "A" stand for?). Many would prefer either ARMv8 (which is the actual hardware architecture name—"AArch64" is ARMv8's 64-bit operating mode) or arm64.

Arguments in favor of the current name include the fact that it is already used to identify the architecture in the ELF triplet used in binaries; using the same name everywhere should help to reduce confusion. But, then, as Arnd Bergmann noted: "If everything else is aarch64, we should use that in the kernel directory too, but if everyone calls it arm64 anyway, we should probably use that name for as many things as possible." Jon Masters added that, in classic contrarian style, he likes the name as it is; Fedora is planning to use "aarch64" as the name for its 64-bit ARM releases. Others, such as Ingo Molnar, argue in favor of changing the name now when it is relatively easy to do. Catalin seems inclined to keep the current name but says he will think about it before posting the next version of the patch series.

An arguably more substantive question was raised by a number of developers: wouldn't it make more sense to unify the 32-bit and 64-bit ARM implementations from the outset? A number of other architectures (x86, PowerPC, SPARC, and MIPS) all started with separate implementations, but ended up merging them later on, usually with some significant pain involved. Rather than leave that pain for future ARM developers, it has been suggested that, perhaps, it would be better to start with a unified implementation.

There are a lot of reasons given for the separate 64-bit ARM architecture implementation. Much of the relevant thinking can be found in this note from Arnd. The 64-bit ARM instruction set is completely different from the 32-bit variety, to the point that there is no possibility of writing assembly code that works on both architectures. The system call interfaces also differ significantly, with the 64-bit version taking a more standard approach and leaving a lot of legacy code behind. The 64-bit implementation hopes to leave the entire 32-bit ARM "platform" concept behind as well; indeed, as Jon put it, there are hopes that it will be possible to have a single kernel binary that runs on all 64-bit ARM systems from the outset. In general, it is said, giving AArch64 a clean start in its own top-level hierarchy will make it possible to leave a lot of ARM baggage behind and will result in a better implementation overall.

Others were quick to point out that most of these arguments have been heard in the context of other architectures. x86_64 was also meant to be a clean start that dumped a lot of old i386 code. In the end, things have turned out otherwise. It may be possible that things are different here; 32-bit ARM has rather more legacy baggage than other architectures did, and the processor differences seem to be larger. Some have said that the proper comparison is with x86 and ia64, though one gets the sense that the AArch64 developers don't want to be seen in the same light as ia64 in general.

This decision will come down to what the AArch64 developers want, in the end; it's up to them to produce a working implementation and to maintain it into the future. If they insist that it should be a separate top-level architecture, it is unlikely that others will block its merging for that reason alone. Of course, it will also be up to those developers to manage a merger of the two in the future, should that prove to be necessary. If nothing else, life as a separate top-level architecture will allow some experimentation without the fear of breaking older 32-bit systems; the result could be a better unified architecture some years from now, should things move in that direction.

Thus far, there has been little in the way of deeper technical criticism of the AArch64 patch set. Things may stay that way. The code has already been through a number of rounds of private review involving prominent developers, so the worst problems should already have been found and addressed. Few developers have the understanding of this new processor that would be necessary to truly understand much of the code. So it may go into the mainline kernel (perhaps as early as 3.7) without a whole lot of substantial changes. After that, all that will be needed is actual hardware; then things should get truly interesting.

Index entries for this article
Kernel	Architectures/Arm

Supporting 64-bit ARM systems

Posted Jul 10, 2012 20:59 UTC (Tue) by smoogen (subscriber, #97) [Link] (35 responses)

Actually in reading the differences between ARM32 and ARM64.. I kept thinking "Isn't this what happened with IA-64?" I wonder how long until some ARM manufacturer decides it is too much work to redo everything for the AArch64 platform and adds in 6-8 instructions to make ARM work with 64 bit instructions like AMD did with the x86_64?

[And yes I know it is a lot more work than say 6-8 instructions but the amount of work sounds less than the Aarch64 takes.]

Supporting 64-bit ARM systems

Posted Jul 10, 2012 21:39 UTC (Tue) by jzbiciak (guest, #5246) [Link] (6 responses)

As long as the ARMv8 instantiations run both instruction sets relatively well, I doubt it'll have quite the hurdle as IA64 did even if the ISAs are rather different.

The original IA64 parts couldn't run x86 code worth a crap, and so not only did you have to change everything to use the new 64-bit functionally, you had to change everything before the device was even useful. That's a much bigger hurdle to clear.

Assuming ARMv8 processors run ARMv7 code relatively well outside of AArch64 mode, then you could deploy quite a lot of hardware with ARMv8, and only move to 64-bit when needed. I think that was much of the allure of x86-64. Many people ran 32-bit OSes on 64-bit devices for quite awhile, until the 64-bit software matured.

Supporting 64-bit ARM systems

Posted Jul 11, 2012 3:01 UTC (Wed) by ringerc (subscriber, #3071) [Link] (5 responses)

I suspect the reverse: ARM users don't expect new ARM sub-architectures to run their code without changes, let alone new major architectures. Porting is a normal part of developing against ARM hardware. There just isn't a legacy base of binary software to worry about being compatible with.

ARM64 doesn't have to run ARM32 code, it just has to be vaguely sane to port code for ARM32 over to ARM64.

Supporting 64-bit ARM systems

Posted Jul 11, 2012 3:27 UTC (Wed) by jzbiciak (guest, #5246) [Link] (4 responses)

Disclaimer: I haven't read the ARMv8 docs yet. (Still wrapping my head around v7.)

My understanding is that ARM generally has kept pretty strong source-level compatibility, and that includes at the assembly level (a few esoteric CPxx registers notwithstanding). So, if I've optimized some algorithm in ARM v5 assembly, lets say, I have a pretty good chance of running that code on ARM v7 with at most a recompile, but quite likely just a relink.

That's at least my understanding. If that weren't true, then it seems like ARM could have shed more baggage more often. After all, ARMv7 still "supports" Jazelle (albeit with a "null" implementation) for example. Why would it need to, unless there's the expectation of running some ARMv5J binaries?

If the ARMv8 AArch64 code is as different as everyone's saying, then it's not clear that you'll be compatible at the assembly level at all. You may be compatible at the C/C++ level, except in cases where pointer sizes break you. Porting may be vaguely sane, but I suspect still a bigger chore than moving among existing 32-bit ARM variants.

In any case, I imagine the first year or two of ARMv8 hardware will come with 32-bit OSes, or at least 32-bit userlands on 64-bit kernels if ARMv8 supports that, until the software stabilizes.

I guess it's time for me to start reading up on ARMv8, eh? *sigh* I just finished absorbing a couple thousand pages of docs on ARM Cortex-A15 and the current round of AXI/ACE protocols....

Supporting 64-bit ARM systems

Posted Jul 11, 2012 3:55 UTC (Wed) by ringerc (subscriber, #3071) [Link] (1 responses)

Thanks for that. I didn't realise that ARM arches retained such a high level of source compatibility despite continually breaking binary compatibility.

I've only dealt with ARM in a few areas like mutex/locking asm helpers, where the best sequence to use is different for different sub-architectures.

I can see that the break of source compatibility would make v8/ARM64 a bigger challenge if perfect source-level ASM compatibility has been the norm up until now. It still sounds more like an ia32-to-x86_64 move than an ia32-to-itanic move, though.

Supporting 64-bit ARM systems

Posted Jul 11, 2012 7:09 UTC (Wed) by cmarinas (subscriber, #39468) [Link]

The assembler syntax is different. It has some similarities for a few mnemonics but you cannot compile one into the other (like we could do with the Thumb-2 unified syntax). The register names are different: R0 on 32-bit vs X0 on 64-bit. Even the comment syntax has been changed from "@" to "//" (because of some clashes of the former).

Supporting 64-bit ARM systems

Posted Jul 17, 2012 21:58 UTC (Tue) by Tuna-Fish (guest, #61751) [Link] (1 responses)

ARMv8 really is that different. As 64-bit forced break in backcompat, they got rid of everything they felt was not a good idea in a modern instruction set. Gone are the multiple load instructions, simd on gprs and conditions on everything. Instead, we now have a zero reg and 31 gprs, IP and SP(!) not included.

Supporting 64-bit ARM systems

Posted Jul 17, 2012 22:40 UTC (Tue) by jzbiciak (guest, #5246) [Link]

I'm not sad to see the IP (aka PC) leave the GPR file. It's a concept that was past its freshness date in the 80s.

But SP's not a GPR? Wow... It makes a certain amount of sense, really, since stack pointers move in rather particular ways. Treating it specially may help when speculating memory accesses. I'm going to guess they have dedicated ways of generating SP-relative addresses for local variables, along with frame pointers.

Otherwise, it sounds very MIPS-y, at least superficially.

Supporting 64-bit ARM systems

Posted Jul 10, 2012 21:55 UTC (Tue) by stevenb (guest, #11536) [Link]

Perhaps this isn't so easy for ARM as it was for x86_64. The instruction length on i386 and x86_64 is basically variable, so the x86_64 instructions sometimes are just the same opcodes with a different or extra prefix. On ARM you don't have that, ehm, luxury?

I just hope they're not going to do a 64-bits Thumb :-)

Supporting 64-bit ARM systems

Posted Jul 11, 2012 2:45 UTC (Wed) by ringerc (subscriber, #3071) [Link] (6 responses)

The biggest problem with IA64 seemed to be that the early parts were crap. All the parts were and are expensive and confined to pricey big iron.

They implemented ia32 emulation, but so slowly that they'd have been better off not doing so and requiring software emulation with a few helper instructions from the outset.

Because of the big-end-of-town focus, most of the commercial software for Itanium only came in special price-gouging editions, further inflating its already uncompetitive costs.

More importantly, they were really expensive parts to make and were based on a completely new very long instruction word design that turned out not to perform half as well in reality as it did in theory. It also relied on compilers to do a huge amount of work, but the compilers just weren't ready for it. They had huge caches (esp. for the time), low yields, and insanely expensive packaging with complex daughterboards. There was no such thing as a low-cost, entry-level Itanium system; the whole thing was a push-it-down-from-the-top big iron and big clusters concept.

'cos, y'know, that's exactly how ia32 came to dominance over SPARC, PPC, Alpha, etc, right?

It sounds like ARM64 is closer to "pure" x86_64 than Itanium/IA64; it's a major cleanup and extension of ARM32, but recognisably related.

I doubt ARM will screw this up, not least because they will've learned from the Itanium fiasco. They're sure to have low cost entry-level parts available directly or via licensees from the beginning. They'll be be focused on keeping costs low, yields high, and porting relatively easy. I'd be surprised if some of the changes in ARM64 weren't to get rid of hard- or expensive- to implement instructions and other simplifications for easier manufacturing.

ARM also have another advantage: there isn't much of a legacy base of binary software that people expect to be portable to new arches. It's the norm to have to rebuild binaries for different ARM sub-arches. They don't have Intel's problem of people expecting to be able to run their 1981 copy of QBase on their new Itanium server.

Supporting 64-bit ARM systems

Posted Jul 11, 2012 2:59 UTC (Wed) by ringerc (subscriber, #3071) [Link] (1 responses)

For an example of insanely expensive packaging, check this image of an Itanium 2 CPU assembly out.

Compare to this Intel Pentium III Coppermine CPU package.

It isn't hard to see part of why Itanium struggled even before AMD came along with AMD64 and finished it off.

Have a look at the Intel price list for some of the Itanium family. Do you really think they're that much better than a Core i7 or the Xeon variants? I don't. Sure, those are current prices not historical ones from the time when itanium used to be almost-relevant, but the prices then were at least as bad.

Also, notice how the TDP ratings on those parts are absolutely nuts? Some of those parts dissapate 180W and cost nearly $4k in bulk! You aren't going to get many of those into a rack, not least because the CPU packages are so physically huge as well. Now there are further disincentives like the lack of AESNI instructions in Itanium.

Supporting 64-bit ARM systems

Posted Jul 11, 2012 8:23 UTC (Wed) by jengelh (guest, #33263) [Link]

>Also, notice how the TDP ratings on those parts are absolutely nuts? Some of those parts dissapate 180W and cost nearly $4k in bulk!

Well, it was about the P4-Netburst time, so the TDP is... accurate :)
As for the prices, the enterprise segment is never understandable to mortal users: the $4k price tag does not even include a next-day replacement service.

Supporting 64-bit ARM systems

Posted Jul 11, 2012 11:29 UTC (Wed) by kskatrh (guest, #73410) [Link] (2 responses)

> They're sure to have low cost entry-level parts available directly or
> via licensees from the beginning. They'll be be focused on keeping costs
> low,

The number I've heard bandied about was $15K for the some of the first Aarch64 boards/systems. Perhaps that was for one of the big Calxeda boxes, fully populated?

Contrast that with a $60 gooseberry or a $45 Raspberry Pi (if my order ever gets filled.)

I'll be much more interested when I can get something for <$200.

Supporting 64-bit ARM systems

Posted Jul 11, 2012 13:55 UTC (Wed) by khim (subscriber, #9252) [Link]

The number I've heard bandied about was $15K for the some of the first Aarch64 boards/systems.

I think you are comparing apples and oranges. I don't believe in $15K Aarch64 system for a nanosecond. Boards, sure, this is about what you can expect from brand-new architecture. I remember how we've developed software for $5 cryptogadget (that's retail price, bulk prices were about 2-3 times smaller and CPU itself was well below $0.5). Development board for a CPU was $5K.

Supporting 64-bit ARM systems

Posted Jul 11, 2012 16:35 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

$15k is probably the price of a devkit. It's a bit high, but not outrageously so.

I fondly remember paying several thousand dollars for development boards, with retail prices of the end-user hardware below $200.

Supporting 64-bit ARM systems

Posted Jul 18, 2012 0:10 UTC (Wed) by plugwash (subscriber, #29694) [Link]

"It's the norm to have to rebuild binaries for different ARM sub-arches."
Maybe that is true of the microcontroller variants of ARM and of kernel code but userland code on linux tends to keep running fine on newer versions of arm. Debian armel will run happilly on everything from V4t right through to v7.

Supporting 64-bit ARM systems

Posted Jul 11, 2012 7:19 UTC (Wed) by gidoca (subscriber, #62438) [Link] (2 responses)

The Armv8 Technology Preview says: "Instruction semantics broadly the same as in AArch32". Doesn't sound like the differences will be that big to me.

Supporting 64-bit ARM systems

Posted Jul 11, 2012 14:12 UTC (Wed) by mwsealey (subscriber, #71282) [Link] (1 responses)

The instruction encoding is different though. ARM 32 and 64 are mutually incompatible at that level even if the source code looks similar. That said, AArch64 and ARMv8-A has ARMv7-A compatibility built-in the same way you can run Thumb and ARM code together these days. So there's no reason you can't run the old stuff in a 32-bit execution mode, the problem here is that the kernel itself absolutely has to run in 64-bit mode to manage this. The ability to run all this "legacy" code is going to be restricted to userspace.

It's not like PPC where the instruction encodings are identical to a high degree and 64-bit code barely has a handful of extra instructions for dealing with 64-bit values directly. Power Architecture was designed very, very well from the beginning, and the mnemonics are identical and the generated bytecode is identical barring the 64-bit specifics. You can't say that about ARM. ARM decided to reinvent the architecture to get rid of all that cruft (but they have to keep compatibility with at least the ARMv7).

Most of the available userspaces are going to stay ARMv7-A for a long, long time.. like running a 64-bit kernel with a 32-bit userspace is common on x86 these days (and x32 just compounds it). Those that need to have 64-bit userspaces will have them..

Supporting 64-bit ARM systems

Posted Jul 11, 2012 22:02 UTC (Wed) by ncm (guest, #165) [Link]

Perhaps I'm dating myself just a little, but this looks like the 8085->8086 transition. There was no 8080 code out that would run on the 8086. 8086 did retain the weird side effects of 8080 arithmetic and attendant conditionals, so transcoding was easy to automate. 8085+1=8086 was a shuck that worked.

DEC provided a translator from x86 to Alpha, and the translated code ran lots faster on Alpha than the original code did on x86. IIRC, interpreted x86 ran as fast as x86. It didn't save them in the end. How the market will respond to a new architecture depends on details, and nobody knows which details will turn out to be important.

Compatibility can be a trap: IBM OS/2 ran windos programs unchanged. windos could not run OS/2 programs. Which target should you code for?

Supporting 64-bit ARM systems

Posted Jul 11, 2012 14:39 UTC (Wed) by Lumag (subscriber, #22579) [Link] (7 responses)

Judging from the previous experience and architecture similarities, it may end up like ppc64 - it is like ppc32, but slower (according to benchmarks), due to more instructions required to load integers, etc.

X86_64 gained additional registers and that is one of the main reasons, why it is really faster than i386 (IA-32)

Supporting 64-bit ARM systems

Posted Jul 11, 2012 17:26 UTC (Wed) by gnb (subscriber, #5132) [Link] (6 responses)

> X86_64 gained additional registers and that is one of the main reasons, why it is really faster than i386 (IA-32)

That's equally true of AArch64: you get 31 GP registers against 15 in 32-bit ARM mode.

Supporting 64-bit ARM systems

Posted Jul 11, 2012 17:33 UTC (Wed) by Lumag (subscriber, #22579) [Link] (5 responses)

There is no such register shortage on ARM as it is on IA-32.

P.S. I'm not voting against Aarch64. I just say that "just extending registers to 64-bit" probably won't work as expected.

Supporting 64-bit ARM systems

Posted Jul 11, 2012 18:20 UTC (Wed) by gnb (subscriber, #5132) [Link]

Agreed it's nothing like as bad as IA-32 but I suspect the increase in registers will still give some benefit. Also worth noting is that all useful ARMv8 implementations will probably still support 32-bit ARM, so the 32-bit app. on 64-bit kernel pattern can be used.,

Supporting 64-bit ARM systems

Posted Jul 12, 2012 22:16 UTC (Thu) by cesarb (subscriber, #6266) [Link] (3 responses)

It depends on what you are trying to do.

For instance, the inner loop of SHA-512 needs 8 64-bit registers, plus an array of 16 64-bit values. On a 32-bit architecture with 15 registers, you will have to spill to the stack (each 64-bit register would be represented by a pair of 32-bit registers, so you would need 16 of them). On a 64-bit architecture with 31 registers, you could have the whole state (8+16 values) in the registers, and still have a few left for the intermediate calculations. The entire inner loop can run without having to touch the stack.

Supporting 64-bit ARM systems

Posted Jul 14, 2012 22:40 UTC (Sat) by jzbiciak (guest, #5246) [Link] (2 responses)

In the specific case of ARMv7 vs. ARMv8, you need to consider NEON. An ARMv7 with NEON, wouldn't you do the SHA-512 hash using the NEON registers?

(These guys show decent results for other crypto algorithms using NEON. They suggest SHA-512 would speed up well also, but say they "didn't bother" with it yet.)

I guess my point is, ARM v7 already offers a path to way more registers than the base 16 x 32. I would imagine anyone springing for a Cortex-A15 would also include NEON. NEON is designed to absorb the heavy duty bulk computation, leaving the 16 GPRs for the more general control stuff.

Supporting 64-bit ARM systems

Posted Jul 14, 2012 23:00 UTC (Sat) by dlang (guest, #313) [Link] (1 responses)

If you are building software for one device, everything you say is reasonable.

But if you are trying to build software for many different devices (like a disto needs to), then you can't count on optional features being there

Supporting 64-bit ARM systems

Posted Jul 15, 2012 4:42 UTC (Sun) by jzbiciak (guest, #5246) [Link]

There are CP15 registers (ID_ISARx, MVFRx, CPACR, NSACR) that tell you whether VFP and NEON are present, and if present, whether they're powered up and available. (VFP/NEON are in a separate power domain on A15.) So, you could theoretically build software that includes NEON-optimized algorithms, and set up the right versions of the functions at the start of the code.

Your point stands, though, for devices that would have to run the fallback version. Those would benefit from ARMv8 without having to use NEON. Plus, keeping the multiple versions around takes up more space, wrangling them is added complexity, etc. etc.

For something like a cell phone, where interpreters like Dalvik have a JIT, the compiled code can match the exact CPU in the phone. After all, that isn't exactly a user-serviceable part. ;-)

Supporting 64-bit ARM systems

Posted Jul 11, 2012 15:52 UTC (Wed) by iabervon (subscriber, #722) [Link] (8 responses)

Aside from the fact that IA64 wasn't a good match for IA32, it also wasn't that great an architecture on its own. It was very much designed to be implemented like the P4, with deep pipelines and a lot of control logic that figures out what your code is doing and does things in parallel and speculatively in complicated ways, with the goal of making use of lots of functional units when your programs are single-threaded. But then the x86 went the opposite way, towards more processors (and more cores, and even pretending you have more cores than you actually do). If your code runs best on a P4 than any other x86, it would be even better in IA64, but nobody's code is like that.

There's nothing terrible about designing an architecture that's just different from the other architecture the chip supports, so long as the new architecture isn't nuts. If arm64 is a different architecture from a company that knows a lot about a certain sweet spot from their 32-bit architecture, that's a whole lot better than IA64 being a different architecture from a company that'd been driven temporarily insane by their 32-bit architecture.

Supporting 64-bit ARM systems

Posted Jul 11, 2012 19:39 UTC (Wed) by zlynx (guest, #2285) [Link] (7 responses)

I think that you may be confused about your chips.

The Itanium was nearly the complete OPPOSITE of a P4 design. In the Itanium design the compiler was responsible for figuring out what memory to preload, what branches to predict and what instructions to run in parallel. The Itanium CPU itself was a very RISCy design in its way without much special logic.

In a P4 and other IA32 designs, the CPU has big piles of logic dedicated to branch predictions, instruction decoding, speculative execution and parallel instruction dispatch with the associated timeline cleanup at the end to make it all appear sequential.

Itanium dropped quite a lot of that, which I think was a very good decision.

Supporting 64-bit ARM systems

Posted Jul 11, 2012 20:46 UTC (Wed) by khim (subscriber, #9252) [Link] (6 responses)

In a P4 and other IA32 designs, the CPU has big piles of logic dedicated to branch predictions, instruction decoding, speculative execution and parallel instruction dispatch with the associated timeline cleanup at the end to make it all appear sequential.

P4 is step in the same direction as Itanic - just not as big. That's why it was merely "reputation disaster" instead of "billions down the drain". For example for good performance it needed branch taken/branch not taken hints from a compiler (they were added to x86 specifically for P4).

Itanium dropped quite a lot of that, which I think was a very good decision.

Yeah, good decision. For AMD, that is. Itanium is designed for weird and quite exotic corner case: tasks where SMP is unusable yet compiler is capable of making correct predictions WRT memory access and branch execution. Sure, such tasks do exist (most cryptoalgorithms, for example), but they are rare. We all know the result.

Supporting 64-bit ARM systems

Posted Jul 11, 2012 20:58 UTC (Wed) by zlynx (guest, #2285) [Link] (5 responses)

It works great, especially after a profiling run of the code.

Is there ANYTHING that CPU silicon can do with an instruction stream that software cannot do? No, not really.

It's not rare at all. Most software improves quite a bit on any CPU when rebuilt with performance feedback optimizations. After profiling, true data is available for branch and memory predictions. CPU silicon without hints can only look a few steps ahead and make guesses.

And I'm not sure what you mean by tasks where SMP is unusable. Most Itanium systems are SMP. IA64 SMP works even better than Xeon's because Intel fixed some of x86's short-sighted memory ordering rules.

Supporting 64-bit ARM systems

Posted Jul 12, 2012 14:31 UTC (Thu) by khim (subscriber, #9252) [Link]

Is there ANYTHING that CPU silicon can do with an instruction stream that software cannot do? No, not really.

This is were you are wrong. Dead wrong as Itanic story showed. There is one thing software can not do - and sadly (for guys who dreamed up this still-born archtitecture) this is the only thing that matters. Software can not optimize your software for the CPU which does not even exist yet.

Most software improves quite a bit on any CPU when rebuilt with performance feedback optimizations. After profiling, true data is available for branch and memory predictions. CPU silicon without hints can only look a few steps ahead and make guesses.

This is classic case of the In theory, there is no difference between theory and practice. But, in practice, there is.

Theory: you can use PGO and produce nice and fast code. Itanium should kill all other CPUs!

Practice: Even if software is built in-house it's usually built once and used on system with different CPU generations (often from different vendors). COTS is used for decades without recompilation. Itanic is gigantic waste of the resources.

IA64 SMP works even better than Xeon's because Intel fixed some of x86's short-sighted memory ordering rules.

Not really. Compare obvious competitors: Itanium® Processor 9350 and Xeon® Processor E7-4860. Surely Itanium wins because of it's better architecture? Nope: if your task is SMP-capable then 10 cores of Xeon can easily beat 4 Itanium cores. And this is not an aberration: Xeon systems usually had 2x more cores (often difference was more then 2x) than identically priced Itanium systems. Even will all fixes in memory-ordering rules Itanium was never able to win such competition if task was able to use all the cores of Itanium and all the cores of Xeon.

Supporting 64-bit ARM systems

Posted Jul 13, 2012 8:55 UTC (Fri) by etienne (guest, #25256) [Link] (3 responses)

> Is there ANYTHING that CPU silicon can do with an instruction stream that software cannot do? No, not really.

Seems that CPU silicon can adapt to memory access taking different times (in layer 1 cache / in layer 2 cache...) (so re-order instructions) better than compiler, mostly when the same code is executed at different times.

Supporting 64-bit ARM systems

Posted Jul 14, 2012 22:58 UTC (Sat) by jzbiciak (guest, #5246) [Link] (2 responses)

I was going to say much the same thing: Hardware can adapt to dynamically changing conditions. The compiler can, at best, look at a set of profiling runs and try to shoot down the middle. Or, if it's particularly aggressive, try to issue multiple versions for the different dynamic probabilities. It does not have the opportunity to respond to the exact conditions that arise when any given instruction executes.

For example, the compiler doesn't have sight of the other software on the system. Suppose process A is polluting the caches. When running unrelated process B, the CPU still could improve its performance through out-of-order scheduling and other tricks that hide the miss penalties. The compiler had no way of predicting those when it compiled process B's executable.

Statically scheduled architectures such as IA64 won't reorder the instruction scheme, although they will (especially once they got to McKinley) aggressively try to reorder memory requests and allow multiple misses in flight. As a result, to address concerns such as the pollution issue above, the compiler needs to try to schedule loads as early as possible. This is why IA64 has "check loads." they all you issue a speculative load even earlier -- perhaps before you even know the pointer is valid -- and then issue a "check load" to copy the loaded value into an architectural register at the load's original execution point.

The "check load" is where all exceptions get anchored in the case of a fault. If the speculative load got invalidated for some reason (it doesn't have to be a fault -- the hardware can drop a speculative load for any reason at all), the check-load will re-issue it. It allows the compiler to mimic the hardware's speculation to a certain extent.

It's not problem free. Speculative load / check load pairs tie up more issue slots than straight loads that hardware might speculate. If the check load stalls, it stalls all the code that follows it (statically scheduled, remember?), whereas with hardware speculation, only the dependent instructions stall while others can proceed.

The original promise of the IA-64 architecture was to be able to ramp up the processor clock given the simplified instruction pipeline thanks to static scheduling, to overcome any losses associated with lack of hardware speculation. Furthermore, it was supposed to be more energy efficient since it wasn't spending hardware tracking dependencies in large instruction windows. In the end, it didn't seem to live up to any of that.

I don't think IA64 failure is an indictment of VLIW principles, though. I work with a VLIW processor regularly that doesn't seem to have any of the same problems. When measured against the promises, EPIC (the official name for IA64's brand of VLIW), was an EPIC failure, IMHO.

Supporting 64-bit ARM systems

Posted Jul 27, 2012 21:06 UTC (Fri) by marcH (subscriber, #57642) [Link] (1 responses)

> I was going to say much the same thing: Hardware can adapt to dynamically changing conditions. The compiler can, at best, look at a set of profiling runs and try to shoot down the middle.

Not just hardware but self-modifying software too. This how "Hotspot" got its name. This looks like a more general compile-time versus run-time question rather than a hardware versus software one.

At this stage this thread could probably use an informed post about Transmeta?

Supporting 64-bit ARM systems

Posted Jul 27, 2012 21:48 UTC (Fri) by jzbiciak (guest, #5246) [Link]

That is true, although to a lesser extent. I'd say it gives you a point on the continuum.

JITs can use dynamic profile information (both block and path frequency) and specific details of the run environment to tailor the code, but they can't respond at the granularity of, say, re-ordering instructions due to cache misses, for example. If you have a purely statically scheduled instruction set like EPIC, no amount of JIT will help reorder loads if the miss patterns are data dependent and dynamically changing. (Speculate/Check loads can help, but only so much.)

Speaking of the ultimate JIT platform Transmeta: Even Transmeta has some hardware patents for hardware memory access speculation mechanisms. I recall one which was a speculative store buffer: The software would queue up stores for a predicted path, and then another instruction would either commit the queue or discard it. Or something like that... Ah, I think this might be the one: http://www.freepatentsonline.com/7225299.html

Supporting 64-bit ARM systems

Posted Jul 11, 2012 12:15 UTC (Wed) by etienne (guest, #25256) [Link]

ARM 64 bits address space is needed even *now* for embedded boxes, I currently need to map > 4 Gb on PCIe interface (a lot more...) onto an ARM 32 address space and it is not fun doing it with memory windows... DMA delayed writes have a major problem when you change the memory window base address...
I just have 64 memory window base address to manage, plus things which are really windows into some other kind of memory.
Also, having a massive amount of I/O mapped control/status registers, I am no more using readl()/writel() with #defined register address and content and MASKs, but array of structures containing arrays of structures and a lot of bitfields in there - the whole lot declared volatile.

Supporting 64-bit ARM systems

Posted Jul 12, 2012 23:08 UTC (Thu) by pabs (subscriber, #43278) [Link]

Debian users might be interested to note that there is already 64-bit ARM port in progress, which was demoed at DebConf12. The port was prepared by ARM employees and the demo was run on the so called "Fast Model". The demo included Linux, Xfce, a LAMP stack and Firefox loading a phpinfo() page from that stack. The port builds on the recent multi-arch, cross-compiling and bootstrap work happening in and around Debian.

http://penta.debconf.org/dc12_schedule/events/882.en.html

Supporting 64-bit ARM systems

Posted Jul 19, 2012 3:03 UTC (Thu) by guillemj (subscriber, #49706) [Link]

FWIW the dpkg architecture name will be arm64:

http://anonscm.debian.org/gitweb/?p=dpkg/dpkg.git;a=commi...

And I'd be happy to change the GNU triplet match, if that got renamed to something less unfortunate.