Gentoo Linux drops IA-64 (Itanium) support

[Posted August 16, 2024 by jzb]

The Gentoo Linux project has announced that it is dropping support for Itanium:

Following the removal of IA-64 (Itanium) support in the Linux kernel and glibc, and subsequent discussions on our mailing list, as well as a vote by the Gentoo Council, Gentoo will discontinue all ia64 profiles and keywords. The primary reason for this decision is the inability of the Gentoo IA-64 team to support this architecture without kernel support, glibc support, and a functional development box (or even a well-established emulator). In addition, there have been only very few users interested in this type of hardware.

Labor budgeting

Posted Aug 16, 2024 17:38 UTC (Fri) by NightMonkey (subscriber, #23051) [Link] (22 responses)

I think it very admirable that Gentoo's volunteers do their utmost to preserve architectures that have seen better days. But there's only so much time available, and only so much labor available. Cheers to them for making the hard decisions, but I do hope, for their sake, that they work to get better at balancing obligations to available labor. Volunteer labor is used up faster when volunteers are overwhelmed. In this case, I wonder how much effort has been spent chasing an arch that should have been retired/deprecated long ago?

Labor budgeting

Posted Aug 16, 2024 18:27 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (21 responses)

I do not know anything about the structure or function of the Gentoo Council, but I would presume that it is not empowered to voluntell people for specific tasks. Given that assumption, it is not the case that supporting additional arches "uses up" volunteer time and effort, because volunteer time and effort is non-fungible. You cannot assume that time spent on task A could instead be spent on task B.

Put another way: Gentoo's volunteers have the right to work on whatever they want to work on. At some point, they (presumably) wanted to work on ia-64, and now they have changed their mind. It is not necessarily the case that those same volunteers would otherwise have spent that time working on something else. They might instead have worked on ia-64 support in some distro other than Gentoo.

Labor budgeting

Posted Aug 16, 2024 19:13 UTC (Fri) by josh (subscriber, #17465) [Link] (20 responses)

It's still the case that having a policy of supporting a particular target architecture (for instance) uses up additional time from people who *don't* necessarily care about that architecture, if the level of support is such that people who don't care about it still have to do some amount of work to account for it.

While it's certainly true that the time they get back is not the project's to direct, it's still time used up that *could* have been used for something else that those volunteers would have preferred to work on.

This is one reason for being careful when setting technology support policies that force everyone involved in a project to take some particular usage into account (e.g. supported target architectures, OSes, init systems...). For such policies, it's a good idea to have multiple tiers of support, to distinguish between things everyone in a project must support versus things people can largely ignore unless they're one of the people working on them.

Labor budgeting

Posted Aug 17, 2024 0:09 UTC (Sat) by willy (subscriber, #9762) [Link] (18 responses)

Unfortunately, Itanium has already messed up the C standard, and that's never going to be fixed; we must all suffer until we can retire C.

For those not aware, HP got a change into C11 that made it UB to pass an uninitialized pointer to a function, even if that function doesn't dereference it.

The example in Linux is:

void *fsdata;
op->start(&fsdata);
op->end(fsdata);

If the FS doesn't initialise fsdata, the invocation of op->end() is now UB.

Labor budgeting

Posted Aug 17, 2024 0:25 UTC (Sat) by ware (subscriber, #83607) [Link] (14 responses)

We’re going to be able to retire C? Sometime? Anytime? Evidence supports C being a thing until the heat death of the universe.

Labor budgeting

Posted Aug 17, 2024 0:36 UTC (Sat) by willy (subscriber, #9762) [Link] (2 responses)

It took 50 years to come up with a programming language sufficiently good to replace C in the areas where C was our best option. It's like losing weight; we're not going to be able to expunge C overnight. But we can at least make progress. I wouldn't be surprised to still see C around in a hundred years, but I do think there will be less of it.

Labor budgeting

Posted Aug 19, 2024 8:06 UTC (Mon) by error27 (subscriber, #8346) [Link] (1 responses)

In the kernel, we consider passing uninitialized data to a function as a bug unless the function is inlined. In the C standard passing uninitialized data is undefined behavior but Linus thinks it's silly for inlined functions. The UBSan tool works the same way, when you pass uninitialized data to a function then that's counted as reading uninitialized data even though the data is not used, but inline functions are not a read.

Smatch doesn't differentiate between inline and not inline functions. When I'm reporting those bugs, I have to review them by hand. It's a bit tricky to handle this properly. I guess I could just silence the warnings for passing uninitialized parameters to __always_inline functions but it's possible that would hide bugs.

Also in the kernel, everyone sane automatically initializes stack variables to zero these days.

Labor budgeting

Posted Aug 20, 2024 13:54 UTC (Tue) by Baughn (subscriber, #124425) [Link]

He can think he’s silly, and he’s even right, but invoking UB means you’re begging for the compiler to delete some unrelated code or security checks. It’s really not a good idea.

Labor budgeting

Posted Aug 17, 2024 0:56 UTC (Sat) by atnot (subscriber, #124910) [Link]

There's a lot of types of "dead". Dead as in unsexy, dead as in nothing new gets written in it, dead as in nobody learns it, dead as in nobody maintains it, dead as in: nobody ports it to new systems, nobody runs it at all; dead as in people make jokes; nobody gets the joke anymore; it's only discussed in the past tense; it's no longer mentioned at all.

It's hard to imagine C ever going far down this curve as something like PL/1, but fortran or python2 within my working life doesn't seem too far fetched? I don't think those will disappear either any time soon. But as an ffi description language it will probably far outlive anything else around today.

Labor budgeting

Posted Aug 17, 2024 3:24 UTC (Sat) by atai (subscriber, #10977) [Link]

Long Live C!

Wait, no need to say that. C cannot even die.

Retiring C

Posted Aug 17, 2024 10:11 UTC (Sat) by farnz (subscriber, #17727) [Link] (7 responses)

We should, given the current set of interesting potential C and C++ replacements, be able to retire C in about 10 years time, in the same way that we retired COBOL and FORTRAN - a long gentle retirement where new things are no longer written in C, things that are getting new features have a mix of C and NewLang (whether it be Zig, Go, Rust or other), and there's a gradual reduction in the total amount of C code in active use.

See also the retirement of S/360 in favour of merchant silicon; there's still new z/Arch beasts that run old S/360 code, but they're no longer as dominant as they used to be, and new features tend to be implemented in a way that is at worst portable between merchant silicon and z/Arch, even if the legacy core remains S/360 or S/390 code.

Retiring C

Posted Aug 18, 2024 0:16 UTC (Sun) by aragilar (subscriber, #122569) [Link] (6 responses)

Fortran is still being written for new things, there's even a new package manager, and new compilers being developed.

Similarly, there are replacements for some of C's use cases, but there no one thing that replaces all of it.

Retiring C

Posted Aug 18, 2024 16:20 UTC (Sun) by adam820 (subscriber, #101353) [Link] (4 responses)

> Fortran is still being written for new things

I think a lot of people don't know this. It's used a lot in industries where you need to do a lot of fast, accurate high-precision mathematics (HPC, simulations, etc.). Intel [1], and even NVIDIA are selling compilers [2] for it, with support for GPU offloading and all the modern fun stuff people want. But your average new developer isn't really sitting down and cracking into a "FORTRAN for Fun and Profit" book.

[1]: https://www.intel.com/content/www/us/en/developer/tools/o...
[2]: https://developer.nvidia.com/hpc-compilers

Retiring C

Posted Aug 19, 2024 23:10 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (3 responses)

IIRC the main selling point of FORTRAN, for a long time, used to be the fact that C had no restrict, and more recently, the fact that C's restrict is off-by-default. Rust emits most pointers as (the LLVM-IR equivalent of) restrict (i.e. noalias), or at least it has the necessary semantics for doing so (but I keep hearing that they have to turn it off in some cases when bugs are found in LLVM). But I must admit that I don't know much about FORTRAN, so perhaps it has some other advantage over Rust for its use cases.

Retiring C

Posted Aug 20, 2024 6:14 UTC (Tue) by joib (subscriber, #8541) [Link]

Fortran (not FORTRAN unless you specifically mean FORTRAN 77 or older) is nice in that it has out of the box support for multidimensional arrays, and array syntax similar to MATLAB or numpy. Yet it's a compiled language. That combination is rather unusual, and is one reason why Fortran remains popular for numerical code.

Yes, you can do something similar with C++ and a template-heavy library like Eigen. And maybe Rust has something similar in some library as well, I don't know.

Retiring C

Posted Aug 20, 2024 8:33 UTC (Tue) by farnz (subscriber, #17727) [Link]

The main selling point of FORTRAN, back in 2003 or so, was that it retained the punch card layout that the FORTRAN-teaching lecturers at my university believed was vital to "real" programming. If you can't version-control your code as a set of punch card decks, it's not "real code".

The main selling point of Fortran 90 and later is (as you say) that it has noalias by default, combined with multidimensional arrays in a way that's relatively easy to transliterate from mathematical notation. It's also been friendly to vectorization, and experimented with things like FORALL as ways to make vectorization easier (by accessing array elements in unspecified order).

Retiring C

Posted Aug 20, 2024 9:25 UTC (Tue) by aragilar (subscriber, #122569) [Link]

That would be FORTRAN 77 level features, but proper multi-dimensional slicing (which almost makes Fortran IO worth dealing with), support for adjustable precision (c.f. Rust where not using f64 is a challenge, the first of many issues in this area), proper C ABI definition (so no more fighting to bridge Fortran and your scripting language of choice, and so you can ignore Fortran IO), and then a whole host of parallelisation options (from single-node to multi-node).

In terms of replacements, only C and C++ would come close (as likely someone's come up with a DSL using macros/ templates relevant to your area and so the code isn't the worst thing to interact with); Rust's standard library and builtins are worse than C here (though if you're willing to limit yourself to WASM+visualisation, then Rust is quite nice, and I'd like to use it more here), while Zig has a more complete standard library and builtin types.

Neither Python nor Julia are great replacements.

Retiring C

Posted Aug 18, 2024 16:34 UTC (Sun) by farnz (subscriber, #17727) [Link]

Fortran is, but FORTRAN is not; Fortran is Fortran 90 and later, using free-form source input, as opposed to FORTRAN where you had to lay things out to match punched card layouts. But the retirement of old FORTRAN code in favour of a mix of C, Fortran, Python, Julia and other modern languages has been slow.

Incidentally, there's also new COBOL being written, and new COBOL compilers being developed - it's not completely obsolete (the way PL/M is, for example), but rather retired and rarely used for new stuff, just like FORTRAN is rarely used for new stuff, although Fortran is still used.

Labor budgeting

Posted Aug 17, 2024 11:36 UTC (Sat) by Deleted user 129183 (guest, #129183) [Link]

> We’re going to be able to retire C? Sometime? Anytime?

Honestly, I’m afraid that in 2024 the current ways of doing things are too entrenched to do away with such languages as C completely. Perhaps there exists somewhere an alternative universe where Lisp machines were much more successful back in the 70’s, which led to a completely different computing world, but obviously that chance is totally lost to us.

Labor budgeting

Posted Aug 17, 2024 0:39 UTC (Sat) by josh (subscriber, #17465) [Link]

Maybe a while after IA-64 is gone, the C standard might be able to define that behavior. Wouldn't be the first time; long after all the architectures pushing back against it were gone, C finally managed to commit to two's complement.

Labor budgeting

Posted Aug 17, 2024 23:20 UTC (Sat) by ttuttle (subscriber, #51118) [Link]

Can you share some context on this? I'm curious what the history / reasoning / consequences are but I cant turn anything up.

Labor budgeting

Posted Aug 18, 2024 1:14 UTC (Sun) by hvd (guest, #128680) [Link]

You are misrepresenting what happened. Your example was already UB in C99, C99 already generally did not permit reading uninitialized objects. Uninitialized objects have an indeterminate value (6.2.4 "The initial value of the object is indeterminate") which may be a trap representation (3.17.2 "indeterminate value: either an unspecified value or a trap representation"). Reading such a value violates 6.2.6.1 Representations of types / General, which says "Certain object representations need not represent a value of the object type. If the stored value of an object has such a representation and is read by an lvalue expression that does not have character type, the behavior is undefined."

Due to the way C99 defined trap representations, for some types (not pointers!) trap representations could not exist, or could be determined not to exist, and therefore it was well-defined to read uninitialized values of that type. But that had previously been undefined in C90: C90 defined "undefined behavior" as "Behavior, upon use of a nonponable or erroneous program construct of erroneous data, or of indeterminately valued objects. for which this International Standard imposes no requirements." C99 attempted to be more explicit by introducing the concept of trap representations, in the process made previously undefined code well-defined, and the impact of that change was not understood until after C99 was published, hence the correction in C11 to partially restore C90's behavior.

But that is irrelevant to your example which is undefined in all versions of standard C.

Labor budgeting

Posted Aug 18, 2024 4:53 UTC (Sun) by sam_c (subscriber, #139836) [Link]

As a Gentoo Council member, I'd say both of you are right to a degree.

On the one hand, no, we can't force people to do anything (nor would we want to).

On the other, people expect the Council, at least when there's not an active team handling the architecture that it wants it wound down (and even then we'd likely consider the matter unless it was a very niche port indeed), to give consent to removal and to intervene in getting it over the line if everybody knows it's dead but nobody is quite sure of its status.

In the case of ia64, the writing has been on the wall for quite some time, but deciding to actually get it over with so people can stop worrying about how to handle ia64-specific bugs, waiting on testing on the arch, and so on, is useful for the wider developer pool.

The nichest of niches

Posted Aug 18, 2024 4:55 UTC (Sun) by sam_c (subscriber, #139836) [Link] (19 responses)

One thing I'd note is that ia64 never seemed to have the same base of hobbyists as even other niche arches we support. ia64 was the nichest of niche. I don't see alpha or hppa support going away any time soon, for example.

I leave it to others to analyse why that might be.

The nichest of niches

Posted Aug 18, 2024 10:51 UTC (Sun) by vadim (subscriber, #35271) [Link] (18 responses)

Yup. Itanium seems to be the most boring niche architecture in existence.

It came out well after everyone was already on X86. So there's no weird buses, no quirky flavors, no weird designs, no special enterprise features you couldn't get anywhere else.

It's the same as any old X86 server: PCI, Ethernet, USB, DDR RAM, EFI. In a normal rackmount case. Software is almost exclusively ports of software made for X86. The only difference is that it's got a weird CPU so stuff is less available, slower and buggier than on X86.

The nichest of niches

Posted Aug 18, 2024 12:23 UTC (Sun) by atnot (subscriber, #124910) [Link] (16 responses)

I disagree on the first bit. It's anything but boring. It's probably the most forward looking CPU architecture ever productionized and features lots of interesting designs. It's the last hurrah for cpu designers actually designing novel architectures and you can really tell. That makes it very exciting to me because it's the branching point in history whereafter we have and will probably never see any innovation in ISA design again. It's fun to implement and program recreationally too. The only real fault of Itanium is not looking enough like any of the architectures that preceded it. We, the software world, failed it, have been paying the price ever since and it does feel nice keeping it's memory alive.

But unfortunately as you're still right otherwise. The architecture aside, there's just nothing that cute about old itanium server hardware that makes me want it in the same way I kind of want an SGI onyx or even a Sun Fire system.

The nichest of niches

Posted Aug 18, 2024 12:47 UTC (Sun) by pizza (subscriber, #46) [Link]

> The only real fault of Itanium is not looking enough like any of the architectures that preceded it.

No, the fault of Itanium is that its purported benefits were predicated on a fundamentally flawed premise -- ie that you can statically determine parallelism without knowing anything about the state of the rest of the system and the data being fed in.

> We, the software world, failed it

Not for the lack of trying. Intel (and its primary customers) spent many billions of dollars, but ultimately failed because one can do a far, far better job determining data dependencies at runtime. (At least for a given amount of silicon, as AMD handily demonstrated!)

It's not all bad though; all the compiler tricks that were invented for Itanium benefited "traditional" architectures as well.

(But if you consider that the purpose of Itanium was to kill off investment into non-Intel server/workstation architectures, Itanium was an unmitigated success)

The nichest of niches

Posted Aug 18, 2024 14:37 UTC (Sun) by farnz (subscriber, #17727) [Link] (14 responses)

It's fun to implement and program recreationally too. The only real fault of Itanium is not looking enough like any of the architectures that preceded it

Itanium had two very significant faults:

The assumption that memory bandwidth (throughout the hierarchy) would scale better than out-of-order execution logic, such that a large instruction word (Itanium's instructions are a little over 43 bits long) would be a better choice than pushing wide out-of-order execution.
The assumption that improvements to a compiler's ability to manage data dependencies would not significantly improve instruction scheduling for out-of-order execution. In other words, making the compiler better would benefit Itanium without helping PPro and later significantly.

Neither of these assumptions were true; the first was a matter of mispredicting the future, but several oral histories of Itanium and Pentium Pro make it clear that the second assumption was deliberate on HP and Intel's part, because it would have made the Itanium project non-viable if they'd allowed for the compiler improvements to also improve PPro, Pentium II etc performance, not just Itanium.

The nichest of niches

Posted Aug 18, 2024 20:31 UTC (Sun) by joib (subscriber, #8541) [Link] (13 responses)

Seems they were also quite pessimistic about how good branch prediction would become, considering the extensive support for predication and (compiler-driven) speculation in IA-64. To an extent this is a corollary to your point 1, in that without an accurate branch predictor an OoO engine with a deep register window is kind of pointless. And further, taking advantage of these features further bloated the code, making the memory BW issue worse.

Speculative execution attacks?

Posted Aug 19, 2024 2:50 UTC (Mon) by gmatht (subscriber, #58961) [Link] (12 responses)

I guess that at least means that IA64 is immune to speculative execution attacks. (Though I doubt it is the best in-order CPU available at the moment)

Speculative execution attacks?

Posted Aug 19, 2024 5:41 UTC (Mon) by joib (subscriber, #8541) [Link]

I've seen this point brought up before, but I'm not so sure it holds water.

- IA-64 cpu's still have, and use, branch predictors. So they could be vulnerable to branch-prediction based speculative execution attacks such as Spectre?
- I'm not sure IA-64 makes any particular guarantees that predicated instructions are constant time. So converting branches to predication ("if-conversion" in compiler literature) would still allow timing analysis.
- The compiler-directed speculation could still enable side-channel attacks. It wouldn't be a micro-architectural side channel attack like Spectre, since this speculation would be part of the architectural state, but I don't see why the same kind of timing information can't leak just because it's part of the architectural state.

I suspect the best defense IA-64 has against these types of attacks is that it's such a niche architecture that people haven't bothered developing attacks against it, or even just porting existing side-channel attacks to see if they work.

Speculative execution attacks?

Posted Aug 19, 2024 9:14 UTC (Mon) by farnz (subscriber, #17727) [Link]

Itanium depends very heavily on compiler-controlled speculative execution for performance (including compiler control of use of the branch prediction hardware). It's thus not vulnerable to microarchitectural speculative execution attacks, because the ISA spec doesn't permit them to exist without architectural code that makes them exploitable.

However, it's plausible that in practice, speculative execution vulnerabilities on Itanium (which has explicit speculation, and requires it for high performance) would be worse than on other CPUs (where speculation is an implementation detail of a non-speculative ISA); instead of being able to mitigate in systems software, and fix in future hardware, you'd need to replace all binaries with ones that don't have the speculation vulnerability. And because Itanium depends on compiler-implemented speculation for performance, you're likely to see more inlining of speculation across library boundaries.

The flip-side is that we'd have been likely to have less speculation across security boundaries, because speculation is explicit in Itanium, and it's more likely that people would have observed it and asked "why?" once the dangers of speculation were understood.

Speculative execution attacks?

Posted Aug 22, 2024 8:15 UTC (Thu) by anton (subscriber, #25547) [Link] (9 responses)

Concerning Spectre: IA-64 is vulnerable to Spectre, because 1) it has speculative execution (in the architecture, not microarchitecture), and 2) it has microarchitectural features like caches that persist microarchitectural state even when speculation turns out to be wrong, and that can reveal speculative state to an attacker. Because no existing implementation of IA-64 performs microarchitectural speculation, you can avoid Spectre on IA-64 by recompiling code to use no compiler speculation, but then you throw away much of the benefits of IA-64 over conventional architectures. You get similar performance to what you get when you use a conventional architecture without microarchitectural speculation (e.g., ARM Cortex-A55) or with microarchitectural speculation and automatic compiler-based mitigations for Spectre (e.g., speculative load hardening for Spectre v1): The compiler-based mitigations mostly suppress speculation, too.

BTW, hardware manufacturers should fix Spectre (I think that an "invisible speculation" variant is the best solution), but given that it has been 7 years since they were informed of Spectre and they have not released fixed hardware, it seems that they have decided to ignore the problem.

Concerning why IA-64 flopped and OoO CPUs won: A major reason is that the accuracy of hardware branch prediction improved far beyond what compiler branch prediction could achieve (even with profile feedback), making it possible to speculate profitably far deeper; e.g., recent Intel and AMD microarchitectures have reorder buffers with 512 (Intel) and 448 (AMD) instructions, which is far beyond what a compiler-based scheduler will reorder across in nearly all situations. And this increased instruction window made it possible to find more instruction-level parallelism. I.e., joib's point.

Another reason is that for those regular codes where IA-64 worked well, SIMD usually worked well, too, and was cheaper to implement.

Concerning farnz's points: 1) Memory bandwidth at all levels has scaled better than OoO execution. E.g., AMD's FX-62 (early 2000s) has 22bytes/cycle D-cache bandwidth, while the 9950X has about 128bytes/cycle, i.e., more than 5 times as much, while the maximal IPC has only increased by a factor 2.67. And actually for the programs that I have looked at, the problem of IA-64 is not that it cannot feed the 6 instructions/cycle to the functional units that they could process, but that the compiler does not find 6 useful instructions per cycle to process.

2) OoO CPUs, especially from the last decade, indeed are fast without "the compiler's ability to manage data dependencies", thanks to techniques like "speculative store bypass" (hardware can speculatively execute architecturally later loads before stores even though the store address is not yet known). If compilers really had made great advances in this area, this might have saved IA-64's ass, because IA-64 would have benefited much more from them. Auto-vectorization (i.e., SIMD) benefits from such advances, but auto-vectorization is hit-or-miss.

Speculative execution attacks?

Posted Aug 22, 2024 9:07 UTC (Thu) by farnz (subscriber, #17727) [Link] (8 responses)

Memory bandwidth has not scaled better than OoOE; while the constant factor for memory scaling has been better than for OoOE, the Itanium assumption was that memory bandwidth would have a better scaling curve than OoOE, with OoOE improvements being at best O(log N), while memory bandwidth scaling was at least O(N), if not better. That hasn't been the case - they've both scaled on about the same curve, so assumptions in IA-64 around code density not being a big problem, because memory bandwidth would scale fast enough to cover for it, have turned out false; for IA-64's assumptions to be correct, you'd need the 9950X to have more like 1024 bytes/cycle from D-cache, and you'd need D-cache to be much larger, too (remember that PA-RISC already had L1 caches measured in megabytes per core before IA-64 started, while 9950X doesn't have 100 KiB per core even today).

And the thing about "the compiler's ability to manage data dependencies" is that OoOE also benefits from a compiler that's capable of interleaving independent computations; not by as much as EPIC, but it still benefits. And Itanium was predicated on the idea that OoOE in 1995 was at its upper limit for finding ILP; you needed EPIC to allow the processor to see more ILP, because while compilers would find ILP for EPIC instructions, they would not be able to reorder and rearrange their code to expose more ILP to a deep OoOE engine. That is empirically not true; as compilers improved, both OoOE and EPIC have gained from having ILP exposed by a compiler that interleaves independent computations.

Speculative execution attacks?

Posted Aug 22, 2024 11:34 UTC (Thu) by anton (subscriber, #25547) [Link] (5 responses)

the Itanium assumption was that memory bandwidth would have a better scaling curve than OoOE, with OoOE improvements being at best O(log N), while memory bandwidth scaling was at least O(N), if not better.

Citation needed. And what is N supposed to be? Where is the 1024 bytes/cycle number coming from? A 6-wide IA-64 implementation (every implementation before Poulson (2012)) needs 32bytes/cycle from the I-cache and can perform at most 4 memory accesses per cycle; if they supported the 80-bit FP format with 128-bit loads and stores, they would need 4*16=64bytes/cycle of D-cache accesses, and that's a worst-case calculation.

PA-8200 has 2MB D-cache and 2MB I-cache at 240MHz, and then HPPA went for smaller (but still big) L1 caches in later, faster-clocked CPUs, and then HP switched to Itanium 2 with 16KB D-cache and 16KB I-cache, so obviously the smaller L1 caches were not a red flag for HP and most of their customers.

The vibes that I got at the time was not that OoO was at its upper limit, but that OoO would have lower clocks (and looking at the 21164 vs the Pentium Pro, one could easily believe that), that OoO would cost more transistors that EPIC would invest in more functional units (and indeed, P6, K7/K8 and Willamette were 3-wide while McKinley was 6-wide, but the clock rates of the OoO CPUs were faster), and that the compiler is able to make more intelligent reordering decisions; the latter did not need any then-new compiler capabilities, and to this day OoO schedulers simply issue the oldest ready instruction, which is not very intelligent, but with deep scheduling windows turns out to be superior to more intelligent scheduling of smaller windows. Moreover, compilers often suffer from scheduling barriers (e.g., calls/returns).

Speculative execution attacks?

Posted Aug 22, 2024 12:25 UTC (Thu) by farnz (subscriber, #17727) [Link] (4 responses)

This is from discussions with people who worked at Intel while the EPIC concept was being pitched in about 1994/1995. One of the things that was described for EPIC was that it'd reach 4 wide (or wider) comfortably in the first iteration (to be released around end 1996 or early 1997), and that it'd be worthwhile widening it ever further into the future; by the time you get out 25 years from first chip, OoOE is going to struggle to effectively be 8 wide at the same clocks that EPIC uses to be 64 wide, and compiler technology advances would mean that, while current compilers would struggle to use a 4 wide EPIC efficiently, by the time you go 25 years out, compilers, languages and programmers would be good at using the width of a wide EPIC chip, since the top-performing chips would be EPIC chips, not OoOE.

HPPA didn't shrink caches outside of "low cost" chips until after Itanium was due to be originally released as its replacement. And that should have been a warning sign to Intel and HP that perhaps scaling wasn't going the way they anticipated, since Itanium had a code density problem as compared to SPARC or x86 that was expected to be overcome by larger I cache and faster memory interfaces (RDRAM, for example, with higher per-pin bandwidth).

And the problem the "compiler technology advances" side has always had is that a dumb OoOE scheduler with a smart compiler scheduler is, in practice, as effective at extracting static ILP as an in-order EPIC chip with a smart compiler scheduler; every decision the compiler makes to reorder instructions for explicit parallelism can also be left in the instruction stream as implicit parallelism for a dumb OoOE scheduler to extract. Part of the thesis behind EPIC was that there was simply no chance of an OoOE scheduler doing better without significantly more complexity, even if we improved compiler technology, and that's empirically false - if you take your explicit parallelism, and simply rearrange your code to maximize the gap between data dependencies, even a dumb OoOE scheduler like that of the P6 can find all the ILP that a comparable width EPIC design could exploit.

Speculative execution attacks?

Posted Aug 22, 2024 15:55 UTC (Thu) by excors (subscriber, #95769) [Link] (1 responses)

> by the time you go 25 years out, compilers, languages and programmers would be good at using the width of a wide EPIC chip, since the top-performing chips would be EPIC chips, not OoOE

I guess they were half right? What they missed was that those top-performing chips would be GPUs, not CPUs. Not many are VLIW (AMD TeraScale, ARM Midgard, then it fell out of favour), but none are OoOE; and they're all extremely wide, with novel languages and compilers, and programmers are willing to learn the new architecture and completely redesign their algorithms to fully utilise the width.

And I suppose offloading the highly parallelisable tasks to GPUs significantly reduced the demand for CPUs to be good at those tasks - CPUs are left with the stuff that's much harder to parallelise, either because it's an inherently serial task or because it's written in serial languages like C, so people want CPUs that are good at running that mostly-serial code, and the OoOE improvements over the years have been sufficient for that.

Alternative execution models

Posted Aug 22, 2024 18:05 UTC (Thu) by farnz (subscriber, #17727) [Link]

I only half-agree; while we've got models for working with extremely wide processors, they're only used by a small amount of specialized software (for lots of reasons); by and large, GPUs are still mostly used for graphics outside.

And I don't think that EPIC would have shifted this much, even if it had been a success; GPUs are a success because they throw away any chance of single threaded performance in favour of having huge parallelism. Where both OoOE and EPIC hide memory access latency by doing other work in parallel for the same thread, GPUs hide it by switching to another thread the moment you access memory.

Speculative execution attacks?

Posted Aug 22, 2024 17:16 UTC (Thu) by anton (subscriber, #25547) [Link] (1 responses)

Interestingly, yes, in the early 2020s we have been seeing 8-wide OoO microarchitectures, so (probably by pure luck) this part of the prediction became true. However, IA-64 implementations never reached clock parity with the faster-clocking contemporeneous OoO microarchitectures, and the widest IA-64 implementation was 12-wide (Poulson, 2012).

Concerning compilers, languages and programmers, they obviously suffered from wishful thinking. We can see with SIMD how that plays out: compilers try to auto-vectorize (which has to deal with many of the same problems as optimizing for EPIC), but usually fail, and when they succeed, they sometimes produce big slowdowns. Mainstream programming languages have not included features for manual vectorization, and consequently most programmers don't do anything about SIMD.

But assuming that the wishful thinking about vast compiler advances was true (it is not), no, these compiler advances cannot be applied to non-EPIC architectures, because they do not have the required architectural features, such as, e.g., architectural speculation support and a large number of architectural registers (if you reorder that much, lifetimes get longer). So no, the non-EPIC architectures would not benefit as much from these hypothetical compiler advances. However, they don't need to, because hardware does the reordering just fine. But these compiler advances are just hypothetical anyway; even Proebsting's Law is overly optimistic about compiler advances.

Moreover, looking at a case where compilers actually can make good use of EPIC features, namely software pipelined loops, for an OoO machine you just compile the loop straightforwardly, no modulo scheduling and no modulo variable renaming (or rotating register files) needed, and the OoO engine executes the loop at least as well as an in-order engine executes the software-pipelined loop.

Concerning the supposedly low-cost HPPA chips, I expect that most customers preferred the 440MHz PA-8500 with 1MB D and 0.5MB I cache to the 240MHz PA-8200 with 2MB D and 2MB I-cache, and that HP consequently charged more for it (at the same time). Larger memories (including larger caches) have longer latencies, and conversely, for higher clock rates the caches need to be smaller or their latency gets longer. Interestingly, for IA-64 the designers opted for particularly small L1 caches; I have read an explanation why they did that for the D-cache, but it's unclear to me why they also did it for the I-cache; but they obviously were not worried about the larger code size.

Speculative execution attacks?

Posted Aug 22, 2024 18:00 UTC (Thu) by farnz (subscriber, #17727) [Link]

PA-8500 was a stop-gap chip; everything after the PA-8200 was released to cover for the fact that Itanium wasn't ready yet, and to avoid HP falling further behind after Itanium was delayed - Merced was supposed to be shipping in volume by 1998. The low-cost chips I was referring to (that predate Itanium) were the PA-7100LC and PA-7300LC, where HP cut down the chip to reach a price point, and L1 cache size suffered to make it happen (these chips did have L2, though). You would not have bought the PA-7100LC if you could afford the PA-7100, PA-7200 or PA-7150, and you would not have bought the PA-7300LC if you could afford a PA-8000.

Looking at the released Itanium chips for caches is a bad move - the Itanium chips we got were all compromised compared to the intent of the project, because they were too damn big; Merced was supposed to come out on 250 nm, but was too big to fit within the reticle limit of 250 nm, and even after moving to 180 nm, they had to cut bits off the design in order to physically fabricate the thing. Had they not been limited by the reticle size, they'd have made different compromises, because the limiting factor would have been cost, not fabrication technology.

And no, the compiler advances would not have helped OoOE as much as they would help EPIC, but the assumption was that compiler advances that enabled the compiler to "see" more static ILP would only help EPIC, because OoOE of 1995 or so was extracting the maximum amount of ILP that could possibly be extracted from a program (based on hand-written EPIC code versus compiler output from commercial compilers for the P6). However, what we saw is that as compilers got better at extracting ILP for Itanium, they also got better at arranging the object code so that OoOE schedulers could extract more ILP, too - which meant that some of the baseline assumptions about how EPIC would do better than OoOE were proven false.

Just to be clear, a lot of the issues with Itanium weren't obvious on day one of designing it - but given what Intel should have known by 1998, it should have been clear that IA-64 was never going to meet the expectations set in 1993/1994, and should have reconsidered the entire project.

Speculative execution attacks?

Posted Aug 22, 2024 12:03 UTC (Thu) by paulj (subscriber, #341) [Link] (1 responses)

Hmm, but PA-RISC /only/ had 1 level of cache, for most of its CPUs. Last ones had L2 at 32 MiB of L2/core. Todays EPYCs have up to 48 MiB L3/core.

Speculative execution attacks?

Posted Aug 22, 2024 12:31 UTC (Thu) by farnz (subscriber, #17727) [Link]

The last few PA-RISCs were ones that were stop-gaps while HP waited for Itanium to come out. Had Itanium kept to schedule, the last PA-RISC would have been the 8200, since Itanium would have replaced PA-RISC for new machines after 1997. And yes, HP only had one level of large fast cache - that's one of the things that Itanium was expecting to carry over from PA-RISC.

The nichest of niches

Posted Aug 21, 2024 11:33 UTC (Wed) by arnd (subscriber, #8866) [Link]

There have been a lot of other niche architectures that we previously removed from the kernel, some with novel ideas and some clones of others: tile, c6x, blackfin, meta, nds32, cris, score and unicore32.

The one thing those all had in common is that there was a single company behind each one that produced the CPU and provided all the resources to upstream and maintain the port. When that company stopped funding the work, it was effectively dead.

ia64 had the Gelato project and some amount of hobbyists and third-party organizations, but once both Intel and HP stopped funding Linux development, it wasn't much better off than the others I listed above. In contrast, most other old architectures we still have all started as hobbyist projects and only later got corporate support, and those tend to have more people willing to support the port even after the funding stops.