Supporting 64-bit ARM systems

Posted Jul 11, 2012 15:52 UTC (Wed) by iabervon (subscriber, #722)
In reply to: Supporting 64-bit ARM systems by smoogen
Parent article: Supporting 64-bit ARM systems

Aside from the fact that IA64 wasn't a good match for IA32, it also wasn't that great an architecture on its own. It was very much designed to be implemented like the P4, with deep pipelines and a lot of control logic that figures out what your code is doing and does things in parallel and speculatively in complicated ways, with the goal of making use of lots of functional units when your programs are single-threaded. But then the x86 went the opposite way, towards more processors (and more cores, and even pretending you have more cores than you actually do). If your code runs best on a P4 than any other x86, it would be even better in IA64, but nobody's code is like that.

There's nothing terrible about designing an architecture that's just different from the other architecture the chip supports, so long as the new architecture isn't nuts. If arm64 is a different architecture from a company that knows a lot about a certain sweet spot from their 32-bit architecture, that's a whole lot better than IA64 being a different architecture from a company that'd been driven temporarily insane by their 32-bit architecture.

Supporting 64-bit ARM systems

Posted Jul 11, 2012 19:39 UTC (Wed) by zlynx (guest, #2285) [Link] (7 responses)

I think that you may be confused about your chips.

The Itanium was nearly the complete OPPOSITE of a P4 design. In the Itanium design the compiler was responsible for figuring out what memory to preload, what branches to predict and what instructions to run in parallel. The Itanium CPU itself was a very RISCy design in its way without much special logic.

In a P4 and other IA32 designs, the CPU has big piles of logic dedicated to branch predictions, instruction decoding, speculative execution and parallel instruction dispatch with the associated timeline cleanup at the end to make it all appear sequential.

Itanium dropped quite a lot of that, which I think was a very good decision.

Supporting 64-bit ARM systems

Posted Jul 11, 2012 20:46 UTC (Wed) by khim (subscriber, #9252) [Link] (6 responses)

In a P4 and other IA32 designs, the CPU has big piles of logic dedicated to branch predictions, instruction decoding, speculative execution and parallel instruction dispatch with the associated timeline cleanup at the end to make it all appear sequential.

P4 is step in the same direction as Itanic - just not as big. That's why it was merely "reputation disaster" instead of "billions down the drain". For example for good performance it needed branch taken/branch not taken hints from a compiler (they were added to x86 specifically for P4).

Itanium dropped quite a lot of that, which I think was a very good decision.

Yeah, good decision. For AMD, that is. Itanium is designed for weird and quite exotic corner case: tasks where SMP is unusable yet compiler is capable of making correct predictions WRT memory access and branch execution. Sure, such tasks do exist (most cryptoalgorithms, for example), but they are rare. We all know the result.

Supporting 64-bit ARM systems

Posted Jul 11, 2012 20:58 UTC (Wed) by zlynx (guest, #2285) [Link] (5 responses)

It works great, especially after a profiling run of the code.

Is there ANYTHING that CPU silicon can do with an instruction stream that software cannot do? No, not really.

It's not rare at all. Most software improves quite a bit on any CPU when rebuilt with performance feedback optimizations. After profiling, true data is available for branch and memory predictions. CPU silicon without hints can only look a few steps ahead and make guesses.

And I'm not sure what you mean by tasks where SMP is unusable. Most Itanium systems are SMP. IA64 SMP works even better than Xeon's because Intel fixed some of x86's short-sighted memory ordering rules.

Supporting 64-bit ARM systems

Posted Jul 12, 2012 14:31 UTC (Thu) by khim (subscriber, #9252) [Link]

Is there ANYTHING that CPU silicon can do with an instruction stream that software cannot do? No, not really.

This is were you are wrong. Dead wrong as Itanic story showed. There is one thing software can not do - and sadly (for guys who dreamed up this still-born archtitecture) this is the only thing that matters. Software can not optimize your software for the CPU which does not even exist yet.

Most software improves quite a bit on any CPU when rebuilt with performance feedback optimizations. After profiling, true data is available for branch and memory predictions. CPU silicon without hints can only look a few steps ahead and make guesses.

This is classic case of the In theory, there is no difference between theory and practice. But, in practice, there is.

Theory: you can use PGO and produce nice and fast code. Itanium should kill all other CPUs!

Practice: Even if software is built in-house it's usually built once and used on system with different CPU generations (often from different vendors). COTS is used for decades without recompilation. Itanic is gigantic waste of the resources.

IA64 SMP works even better than Xeon's because Intel fixed some of x86's short-sighted memory ordering rules.

Not really. Compare obvious competitors: Itanium® Processor 9350 and Xeon® Processor E7-4860. Surely Itanium wins because of it's better architecture? Nope: if your task is SMP-capable then 10 cores of Xeon can easily beat 4 Itanium cores. And this is not an aberration: Xeon systems usually had 2x more cores (often difference was more then 2x) than identically priced Itanium systems. Even will all fixes in memory-ordering rules Itanium was never able to win such competition if task was able to use all the cores of Itanium and all the cores of Xeon.

Supporting 64-bit ARM systems

Posted Jul 13, 2012 8:55 UTC (Fri) by etienne (guest, #25256) [Link] (3 responses)

> Is there ANYTHING that CPU silicon can do with an instruction stream that software cannot do? No, not really.

Seems that CPU silicon can adapt to memory access taking different times (in layer 1 cache / in layer 2 cache...) (so re-order instructions) better than compiler, mostly when the same code is executed at different times.

Supporting 64-bit ARM systems

Posted Jul 14, 2012 22:58 UTC (Sat) by jzbiciak (guest, #5246) [Link] (2 responses)

I was going to say much the same thing: Hardware can adapt to dynamically changing conditions. The compiler can, at best, look at a set of profiling runs and try to shoot down the middle. Or, if it's particularly aggressive, try to issue multiple versions for the different dynamic probabilities. It does not have the opportunity to respond to the exact conditions that arise when any given instruction executes.

For example, the compiler doesn't have sight of the other software on the system. Suppose process A is polluting the caches. When running unrelated process B, the CPU still could improve its performance through out-of-order scheduling and other tricks that hide the miss penalties. The compiler had no way of predicting those when it compiled process B's executable.

Statically scheduled architectures such as IA64 won't reorder the instruction scheme, although they will (especially once they got to McKinley) aggressively try to reorder memory requests and allow multiple misses in flight. As a result, to address concerns such as the pollution issue above, the compiler needs to try to schedule loads as early as possible. This is why IA64 has "check loads." they all you issue a speculative load even earlier -- perhaps before you even know the pointer is valid -- and then issue a "check load" to copy the loaded value into an architectural register at the load's original execution point.

The "check load" is where all exceptions get anchored in the case of a fault. If the speculative load got invalidated for some reason (it doesn't have to be a fault -- the hardware can drop a speculative load for any reason at all), the check-load will re-issue it. It allows the compiler to mimic the hardware's speculation to a certain extent.

It's not problem free. Speculative load / check load pairs tie up more issue slots than straight loads that hardware might speculate. If the check load stalls, it stalls all the code that follows it (statically scheduled, remember?), whereas with hardware speculation, only the dependent instructions stall while others can proceed.

The original promise of the IA-64 architecture was to be able to ramp up the processor clock given the simplified instruction pipeline thanks to static scheduling, to overcome any losses associated with lack of hardware speculation. Furthermore, it was supposed to be more energy efficient since it wasn't spending hardware tracking dependencies in large instruction windows. In the end, it didn't seem to live up to any of that.

I don't think IA64 failure is an indictment of VLIW principles, though. I work with a VLIW processor regularly that doesn't seem to have any of the same problems. When measured against the promises, EPIC (the official name for IA64's brand of VLIW), was an EPIC failure, IMHO.

Supporting 64-bit ARM systems

Posted Jul 27, 2012 21:06 UTC (Fri) by marcH (subscriber, #57642) [Link] (1 responses)

> I was going to say much the same thing: Hardware can adapt to dynamically changing conditions. The compiler can, at best, look at a set of profiling runs and try to shoot down the middle.

Not just hardware but self-modifying software too. This how "Hotspot" got its name. This looks like a more general compile-time versus run-time question rather than a hardware versus software one.

At this stage this thread could probably use an informed post about Transmeta?

Supporting 64-bit ARM systems

Posted Jul 27, 2012 21:48 UTC (Fri) by jzbiciak (guest, #5246) [Link]

That is true, although to a lesser extent. I'd say it gives you a point on the continuum.

JITs can use dynamic profile information (both block and path frequency) and specific details of the run environment to tailor the code, but they can't respond at the granularity of, say, re-ordering instructions due to cache misses, for example. If you have a purely statically scheduled instruction set like EPIC, no amount of JIT will help reorder loads if the miss patterns are data dependent and dynamically changing. (Speculate/Check loads can help, but only so much.)

Speaking of the ultimate JIT platform Transmeta: Even Transmeta has some hardware patents for hardware memory access speculation mechanisms. I recall one which was a speculative store buffer: The software would queue up stores for a predicted path, and then another instruction would either commit the queue or discard it. Or something like that... Ah, I think this might be the one: http://www.freepatentsonline.com/7225299.html