Another round of speculative-execution vulnerabilities [LWN.net]

Another round of speculative-execution vulnerabilities

Posted Aug 8, 2023 18:09 UTC (Tue) by joib (subscriber, #8541) [Link]

More info from AMD (they call it 'Inception'): https://www.amd.com/en/resources/product-security/bulleti...

Another round of speculative-execution vulnerabilities

Posted Aug 8, 2023 23:50 UTC (Tue) by motk (subscriber, #51120) [Link] (93 responses)

Can we please just go back to Z80s and CP/M please thank you.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 1:24 UTC (Wed) by kazer (subscriber, #134462) [Link] (3 responses)

I'll have RISC-V instead, thank you very much.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 1:51 UTC (Wed) by motk (subscriber, #51120) [Link] (2 responses)

No good, pipelines and speculative stuff, just plain cycle-by-cycle normality.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 18:42 UTC (Wed) by dezgeg (subscriber, #92243) [Link] (1 responses)

I very much doubt the RISC-V specification would disallow such implementation.

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 0:43 UTC (Thu) by motk (subscriber, #51120) [Link]

This is why I want my Z80 back.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 2:05 UTC (Wed) by willy (subscriber, #9762) [Link] (84 responses)

We know how non-speculative CPUs perform on modern manufacturing processes. The ARM Cortex-520 is in-order and the performance is not amazing. Similarly, Intel released the Quark a few years ago (port of the 486 to a modern process).

You wouldn't be happy with a non-speculative CPU in your phone, let alone your laptop, desktop or server.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 2:14 UTC (Wed) by motk (subscriber, #51120) [Link]

I was perfectly happy on my MicroBee in 1984, I could be just as happy now.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 5:34 UTC (Wed) by flussence (guest, #85566) [Link] (47 responses)

I'm still using a phone from 2011 and a netbook from 2009. Second and third battery (soon to be 4th :-)

And maybe I'm not happy with having to wait a few seconds to switch apps, or the fact that Firefox no longer works on the phone because the CEO fired all the engineers to buy a 10th mansion, but I know I wouldn't be any happier buying into the "10 copies of chrome and they all want to infantilise you and pick your pocket" way of life. Everyone there seems to be miserable.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 8:07 UTC (Wed) by Wol (subscriber, #4433) [Link] (45 responses)

Maybe all this speculative execution stuff is a response software and chip design going down a wrong path and chasing a local minimum at the top of a mountain ...

From what I can make out, modern CPUs are "C language execution machines", and C is written to take advantage of all these features with optimisation code up the wazoo.

Get rid of all this optimisation code, get rid of all this speculative silicon, start from scratch with sane languages and chips, ...

Sorry to get on the database bandwagon again, but I would love to go back 20 years, when I worked with databases that had snappy response times on systems with the hard disk going nineteen to the dozen. Yes the programmer actually has to THINK about their database design, but the result is a database that can start spewing results instantly the programmer SENDS the query, and a database that can FINISH the query faster than an RDBMS can optimise it ...

Cheers,
Wol

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 8:49 UTC (Wed) by motk (subscriber, #51120) [Link]

Transputers redux!

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 9:05 UTC (Wed) by eduperez (guest, #11232) [Link] (15 responses)

I used to code on a 8-bit processor, running at (almost) 4MHz; we had a 64KB memory map, but 16KB of those were ROM, and part of the remaining 48KB where reserved for the screen memory, printer buffer and other uses. We counted and optimized the processor cycles required to execute each instruction and routine, and memory usage was counted on bytes.

However, those times have long passed away, and there is no use in trying to bring them back. Except for some very specific use cases, it is way cheaper to buy a faster machine, than spend hours upon hours optimizing the code; all that counts is the "return on investment".

You just cannot keep the optimization and attention to detail leves of the past, with the development speed and costs required by the modern world.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 13:33 UTC (Wed) by butlerm (subscriber, #13312) [Link] (7 responses)

Modern development techniques for web applications in particular have contributed to making application response time much worse than it used to be and orders of magnitude beyond what it was on much slower systems forty years ago. Wondering what went wrong and why no one seems to care if their page takes five or ten seconds to update is a relevant inquiry.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 13:59 UTC (Wed) by yodermk (subscriber, #3803) [Link] (5 responses)

Yep, I often imagine a world where a large web application is written in Rust and runs as a single process on a single, vertically-scaled server. This bucks the "everything is a microservice" trend in a big way. But think of the benefits -- nearly every request could be served from in-memory in a single process. No Redis, no reaching out to other services for most things. Only requests that needed to result in durable, committed storage would have a slight delay. Besides that, operationally it would be dirt simple.

Main drawback is upgrades would require at least a bit of downtime. But, done right, it would be quite brief. The in-process caches would need to warm, though. The other drawback is the absolute need to be sure that no part of the system can crash under any circumstances. But Rust goes a long way in helping you there.

I'm learning Axum (a backend framework for Rust) and hope to be able to implement something like this someday.

Another round of speculative-execution vulnerabilities

Posted Aug 24, 2023 6:13 UTC (Thu) by ssmith32 (subscriber, #72404) [Link] (4 responses)

I dunno. Networks are pretty snappy compared to some of the systems discussed here - I would think a bunch of Rust microservices running on bare metal that correctly used HTTP would do fine. Vertical scaling is exactly what got us into the speculative execution mess. It's vertical scale to support the many many layers of abstraction to support whatever-the-hot language is to write a simple microservice.

But if you keep the services simple - why bother with all the abstraction? Give them full control, and make them fast.

The real troublemaker is not microservices or distributed systems - it's hosting providers wanting to resell the same time on the same hardware over and over again.

Another round of speculative-execution vulnerabilities

Posted Aug 24, 2023 22:32 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

FWIW, AWS doesn't share CPUs between multiple customers, except on a couple of very cheap instance types (T2 and T3 instances). I believe the same goes for Azure.

Another round of speculative-execution vulnerabilities

Posted Aug 25, 2023 9:58 UTC (Fri) by farnz (subscriber, #17727) [Link] (2 responses)

Couple of questions:

Is this documented by AWS anywhere? I can't find it in their official documentation, and the instance types documentation just says "Each vCPU on non-Graviton-based Amazon EC2 instances is a thread of x86-based processor, except for T2 instances and m3.medium.", which implies that two vCPUs assigned to different customers can be on the same core, just not using the same thread.
How is the "each CPU core can only be used by one customer" enforced? Is it just relying on the kernel rarely migrating actively used vCPU threads between hardware threads, or is there scheduler affinity etc applied to enforce it?

Another round of speculative-execution vulnerabilities

Posted Aug 25, 2023 19:27 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

AWS documentation is a mess, but it's documented: https://docs.aws.amazon.com/whitepapers/latest/security-d...

> Fixed performance instances, in which CPU and memory resources are pre-allocated and dedicated to a virtualized instance for the lifetime of that instance on the host

FWIW, this design has been used from the very beginning. Even with the old Xen-based hypervisor, there was very little sharing of resources between customers. AWS engineers anticipated that the hardware might have issues allowing the state to be leaked between domains, so they tried to minimize the possible impact.

> How is the "each CPU core can only be used by one customer" enforced? Is it just relying on the kernel rarely migrating actively used vCPU threads between hardware threads, or is there scheduler affinity etc applied to enforce it?

CPUs are allocated completely statically to VMs. The current Nitro Hypervisor is extremely simplistic, and it is not capable of sharing CPUs between VMs.

Another round of speculative-execution vulnerabilities

Posted Aug 25, 2023 19:33 UTC (Fri) by farnz (subscriber, #17727) [Link]

Thanks for the link - it answers my question in full, and makes it clear that this is something that's architected into AWS. And yes, AWS documentation is a mess - it looks like I didn't find it because I wasn't looking at AWS whitepapers, but at EC2 documentation.

I had hoped that it worked the way you describe, because nothing else would meet my assumptions about how security on this would work, but I have had enough experience to know that when security is involved, hoping that people make the same assumptions as I do is a bad idea - better to see my assumptions called out in documentation, because then there's a very high chance that Amazon trains new engineers to make this set of assumptions.

Another round of speculative-execution vulnerabilities

Posted Aug 11, 2023 0:04 UTC (Fri) by khim (subscriber, #9252) [Link]

> Wondering what went wrong and why no one seems to care if their page takes five or ten seconds to update is a relevant inquiry.

That one is easy. There's just no one left who may care.

Everyone is trying to solve their own tiny, insignificant task. And the fact that when all these non-solutions to non-problems, when combined, create something awful… who may even notice that, let alone fix that? Testers? They are happy if they have time to look on the pile of pointless non-problems in the bugtracker! Users? They are not the ones who pay for the software novadays. Advertisers do that and they couldn't care less about what users experience.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 16:09 UTC (Wed) by Wol (subscriber, #4433) [Link] (4 responses)

> You just cannot keep the optimization and attention to detail leves of the past, with the development speed and costs required by the modern world.

Which language has the motto "if you make the right thing to do, the easy way to do it, then people will do the right thing by default".

Going back to one of my favourite war stories, where the consultants spent SIX MONTHS optimising an Oracle query so it could outperform the Pick system it was replacing. I'm prepared to bet that Pick query probably took about TEN MINUTES to write. (And the Oracle system, a twin Xeon 800, was probably 20 times more powerful than the Pentium 90 it was replacing!)

Pick "tables" are invariably 3rd or 4th normal form, because that's just the obvious, easy way to do it. Sure, you have to specify all your indices, but if you put an index on every foreign key, you've pretty much got everything of any importance - a simple rote rule that covers 99% of cases. (And no different from relational, you tell Pick it's (probably) a foreign key by indexing it, you tell an RDBMS to index it by telling it it's a foreign key. A distinction without a difference.)

Oh - and if the modern world requires horribly inflated development speeds and costs, that's their hard cheese. With your typical RDBMS project coming in massively over time and budget, surely going back to a system where the right thing is the obvious thing will massively improve those stats! Most of my time at work is spent debugging SQL scripts and Excel formulae - that's why I want to get Scarlet in there because, well, what's the quote? "Software is either so complex there are no obvious bugs, or so simple there are obviously no bugs, guess which is harder to write." Excel and Relational are in the former category, Pick is in the latter. More importantly, Pick actually makes the latter easy!

Cheers,
Wol

What is Pick and Scarlet?

Posted Aug 10, 2023 5:58 UTC (Thu) by fredrik (subscriber, #232) [Link] (1 responses)

@Wol You've mentioned Scarlet before, do you have a link where I can learn more about it?

Ditto for Pick, what is it, link? Thanks!

What is Pick and Scarlet?

Posted Aug 10, 2023 8:07 UTC (Thu) by Wol (subscriber, #4433) [Link]

@fredrik

https://github.com/geneb/ScarletDME

https://en.wikipedia.org/wiki/Pick_operating_system

Google groups openqm, scarletdme, mvdbms, u2-users, I guess there are more ...

Go to the linux raid wiki to get my email addy, and email me off line if you like ...

Pick/MV is like Relational/SQL - there are multiple similar implementations.

Cheers,
Wol

Another round of speculative-execution vulnerabilities

Posted Aug 11, 2023 0:17 UTC (Fri) by khim (subscriber, #9252) [Link] (1 responses)

> With your typical RDBMS project coming in massively over time and budget, surely going back to a system where the right thing is the obvious thing will massively improve those stats!

How would that work? Let's consider three most important stats (in the increasing order of importance):

Amount of money in pockets of developers — how would that increase that?
Amount of money management can stash in their pockets — how would that increase that?
Amount of money CEO may get from bank loans — how would that increase that?

> More importantly, Pick actually makes the latter easy!

But where is the money in all that?

Another round of speculative-execution vulnerabilities

Posted Aug 11, 2023 0:48 UTC (Fri) by Wol (subscriber, #4433) [Link]

I'm very naive :-) , but I think you knew that :-)

Cheers,
Wol

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 15:55 UTC (Thu) by skx (subscriber, #14652) [Link] (1 responses)

You sound very much like a ZX Spectrum user! I have similar memories and experiences.

I have a single-board Z80-based system on my desk, running CP/M, these days. I tinker with it - I even wrote a simple text-based adventure game in assembly and ported it to the spectrum.

But you're right, those days are gone outside small niches. Having time and patience to enjoy the retro-things is fun. But it's amazing how quickly you start to want more. (More RAM, internet access, little additions that you take for granted these days like readline.)

Another round of speculative-execution vulnerabilities

Posted Aug 11, 2023 5:47 UTC (Fri) by eduperez (guest, #11232) [Link]

> You sound very much like a ZX Spectrum user!

Yes, that was my fist "computer", back when I was fourteen.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 10:02 UTC (Wed) by roc (subscriber, #30627) [Link] (1 responses)

You have to speculate heavily to get high single-thread performance, and single-thread performance will always matter because of Amdahl's Law.

Some people commenting here claim they'd be happy with much lower performance. That's fine, but most people find some Web sites and phone apps useful, and those need high single-thread performance.

Another round of speculative-execution vulnerabilities

Posted Aug 11, 2023 0:23 UTC (Fri) by khim (subscriber, #9252) [Link]

> That's fine, but most people find some Web sites and phone apps useful, and those need high single-thread performance.

Nope. Not even close. Web sites would be equally sluggish no matter how many speculations your CPU does simply because there are no one who may care to make them fast.

If speculations would have been outlawed 10 or 20 years ago and all we had would have been fully in-order 80486 100MHz… they would have worked with precisely the same speed they work today on 5GHz CPUs.

The trick is that it's easy to go from sluggish website on 80486 100MHz device to sluggish web site on 5GHz device, but it's not clear how you can go back and if that's even possible at all.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 18:55 UTC (Wed) by bartoc (guest, #124262) [Link] (19 responses)

I doubt it. The reason all this dynamic optimization stuff works so well isn't because c is somehow a bad language but because real workloads are rather unpredictable and the correct optimization decisions depend on stuff that happens between when a program is compiled and when a particular instruction executes. This'll always be true in any system that does many things at the same time, when those things are determined by many different people at different times.

It's not at all clear what the "better" alternative to C is either, without sacrificing a ton of usability. Sure rust is "better" than C, but ultimately it shares the same fundamental execution model. One could argue GLSL/WGSL/HLSL/etc, but the things that those languages lack from the C execution model (mutual recursion, an ABI, a stack to which registers can be spilt, etc) are seen as things holding them back, precicely because those things make shader languages less dynamic than C, and thus require absolute explosions of up front code generation, with all the compile time and I$ issues that brings.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 20:51 UTC (Wed) by Wol (subscriber, #4433) [Link]

> The reason all this dynamic optimization stuff works so well isn't because c is somehow a bad language but because real workloads are rather unpredictable and the correct optimization decisions depend on stuff that happens between when a program is compiled and when a particular instruction executes.

The problem with C isn't that real workloads are unpredictable. The problem with C is that the language behaviour is undefined and unpredictable. If you're writing something simple, there's not much difference between languages. Except that few problems are simple, C gives you very little help to cope, and indeed it's full of landmines that will explode at every opportunity.

Writing bug-free code in C is much harder than most other languages ...

Cheers,
Wol

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 9:40 UTC (Thu) by anton (subscriber, #25547) [Link] (17 responses)

Actually speculative execution works so well because it turns out that a lot of execution is very predictable. It's so predictable that the branch predictor has an accuracy of ~99% (depending on the application). This means that the instruction fetcher can fetch ahead for hundreds of instructions, and the OoO execution engine can execute these instructions ahead in an order determined by the data dependencies, not otherwise by the program order. This allows modern CPUs to complete (i.e., execute) several instructions per cycle.

I don't see that this has much to do with the programming language. Rust is as vulnerable to Spectre and Downfall as C is AFAICS. The only influence I see is that for a language like JavaScript that always bounds-checks array accesses, you have an easier time adding Spectre-V1 mitigations. But for Rust, which tries to optimize away the bounds-check overhead, you end up either putting in Spectre-V1 mitigation overhead (can this be done automatically?), slowing it down to be uncompetetive with C, or it is still Spectre-V1 vulnerable. Admittedly adding mitigations cannot be done automatically in C, because the compiler has no way to know the bounds in all cases.

The way to go is that the hardware manufacturers must fix (not mitigate) Spectre. They know how to avoid misspeculated->permanent state transitions for architectural state (Zenbleed is the exception that proves the rule), now apply it to microarchitectural state!

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 11:52 UTC (Thu) by excors (subscriber, #95769) [Link] (2 responses)

Maybe it's more a combination of control flow being predictable and memory latency being unpredictable. Compilers (AOT and JIT) can have a go at predicting control flow using PGO, but I suspect it's largely impossible for them to predict which memory loads will come from L1 (~5 cycles) and which will come from RAM (~300 cycles) as that depends heavily on the dynamic state of the caches. That unpredictable latency prevents a compiler from doing good instruction scheduling by itself, so it has to rely on the CPU doing dynamic scheduling that can adapt to the actual latency of every single memory access. And the CPU can do that because the highly predictable control flow means that while it's waiting for RAM, it can speculatively gather hundreds of instructions to reschedule and execute out of order.

If memory latency is predictable then I think it's much easier for the compiler to statically schedule the instructions, and the CPU can be much simpler while maintaining decent performance. But that only seems practical with very small amounts of memory (e.g. microcontrollers with single-cycle latency to SRAM, but only hundreds of KBs) or very large numbers of threads (e.g. GPUs where each core runs 128 threads with round-robin scheduling, so each thread has 128 cycles between consecutive instructions in the best case, which can mask a lot of memory latency), not for general-purpose desktop-class CPUs.

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 15:14 UTC (Thu) by farnz (subscriber, #17727) [Link]

Even with PGO, control flow still has largely unpredictable regions, which depend upon the details of user input, and can only be predicted at compile time if the exact input the user will use is provided at compile time. This was one component of why Itanium's EPIC never lived up to its performance predictions; as compilers got better at exploiting compile-time known predictability, they also benefited OoO and speculative execution machines, which could exploit predictability that only appears at runtime.

For example, in a H.264 encoder or decoder, black bars are going to send your binary down a highly predictable codepath doing the same thing time and time again; your PGO compiled binary is not going to be set up on the assumption of black bars, because that's just one part of the sorts of input you might get. However, at runtime, the CPU will notice that you're going down the same codepath over and over again as you handle the black bars, and will effectively optimize further based on the behaviour right now. Once you get back to the main picture, it'll change the optimizations it's applying dynamically, because you're no longer going down that specific route through the code.

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 16:03 UTC (Thu) by anton (subscriber, #25547) [Link]

Compilers (AOT and JIT) can have a go at predicting control flow using PGO

Profile-based static prediction has ~10% mispredictions, while modern history-based hardware branch prediction has about 1% mispredictions (for real numbers check the research literature, but the tendency is in that direction; and it's actually hard to compare the research, because static branch prediction research stopped about 30 years ago).

Concerning memory latency, I also see very good speedups of out-of-order over in-order for benchmarks like the LaTeX benchmark which rarely misses the L1 cache.

Also, the fact that the Itanium II designers chose small low-latency L1 caches, while OoO designers went for larger and longer-latency L1 caches (especially in recent years, with the load latency from D-cache growing to 5 cycles in Ice Lake ff.) shows that the unpredictability is a smaller problem for compiler scheduling than the latency.

The dream of static scheduling has led a number of companies (Intel and Transmeta being the most prominent) to spend substantial funds on it. Dynamic scheduling (out-of-order execution) has won for general-purpose computing, and the better accuracy of dynamic branch prediction played a big role in that.

With regard to Spectre and company, compiler-based speculation would exhibit Spectre vulnerabilities just as hardware-based speculation does. Ok, you can tell the compiler not to speculate, but that severely restricts your compiler scheduling, increasing the disadvantage over OoO. Better fix Spectre in the OoO CPU.

Another round of speculative-execution vulnerabilities

Posted Aug 11, 2023 13:26 UTC (Fri) by atnot (subscriber, #124910) [Link] (13 responses)

> don't see that this has much to do with the programming language. Rust is as vulnerable to Spectre and Downfall as C is AFAICS.

You're thinking much too narrow here in terms of what "C" is in this context. It's has far less to do with the specific syntax and more with the general model of computation that derives from the original PDP11, i.e.:

Programs are a series of commands, whose effects become visible in order from top to bottom. The sequence of these commands can be arbitrarily replaced using a specific command, called a "branch". There is a singular, uniform thing called "memory", which is numbered from zero to infinity and you can create a valid read-write reference to any of it by using that number. And so on.

None of this is true internally for any modern compute device. It isn't even true for C anymore. But it was true for the creators of C, and as a result these assumptions were baked very deeply into the language, then tooling like gcc and LLVM, then languages that use that tooling like Rust, OpenCL, CUDA, and then architectures that wanted to be able to easily targetable by those tools like RISC-V and, most notably, AMD64. (As opposed to it's Itanic cousin). It's so established that people don't even recognize these as specific design choices anymore, it's just "how computers work".

Rust is definitely a step away from C, and one that has at least some potential to improve how chips are designed in the future, if the tooling allows for it. But it's not a very big step in the grand scheme of things.

Another round of speculative-execution vulnerabilities

Posted Aug 11, 2023 22:40 UTC (Fri) by jschrod (subscriber, #1646) [Link] (12 responses)

I wouldn't call this PDP11 semantics - I would call this semantic of any von-Neumann computer architecture that exists nowadays. (I exclude quantum computing from that statement.) In fact, this is the paradigm of the Turing Machine.

That you have statements that are executed from top to bottom, where preconditions, invariants, and postconditions exist, is basically the fundament of theoretical computer science. No proof about algorithmic semantics or correctness would work without that assumption.

If this, in your own words, isn't true any more - can you please point me to academic work that formalizes that new behavior and its semantics?

After all, I cannot believe that theoretical computer scientists have ignored this development. It is a to good opportunity to write new articles for archived journals.

If no computer science work is published on your claim, can you please explain why research is ignoring this development?

Another round of speculative-execution vulnerabilities

Posted Aug 12, 2023 11:50 UTC (Sat) by atnot (subscriber, #124910) [Link] (11 responses)

> I would call this semantic of any von-Neumann computer architecture that exists nowadays. [...] In fact, this is the paradigm of the Turing Machine.

I reject this outright. There is an absolute world of wonderful compute models in between a turning machine and C or a PDP11. Many of them are von neumann machines, even. This should be clear from the fact that the key problem of the C model is that it's hard to model formally (see e.g. the size of the C memory model specification), while the turing machine was purpose designed for formal modelling.

Let me just give some examples:

For a very soft start, we can look at something like the 6502, which is generally pretty boring apart from treating he first 256 bytes of memory specially. Largely because of this, it is not supported upstream in any of the big C compilers.

Then we can look at something like Itanium, in which bundles of instructions are executed in parallel and can not see the effects of each other, along with things like explicit branch prediction, speculation and software pipelining.

This is actually pretty similar to modern CPUs, except instead of having it explicitly encoded in the instruction stream, they try to re-derive that information at runtime, often by just guessing.

Then we have things like GPUs, which have multiple tiers of shared memory, primarily work with masks instead of branches. Although they are slowly becoming more C-like as people seek to target them with C and C++ code.

There's also a whole bunch of architectures like ARM CHERI and many with memory segmentation, where addresses and pointers are not the same thing.

We can also talk about various other things like lisp machines, Mill, Transmeta, EDGE and many more things I'm forgetting.

Then even further asea, you can find things like FPGAs, which are programmed using a functional specification of behavior much like TLA+. (The current fad is, of course, trying to run C on them, to limited success)

Now if you say "But most of these are all obscure architectures nobody uses", then yes that's the point. It's because they don't look enough like K&R's PDP11. Itanium is far from the only innovative architecture that C killed and as primarily a hardware person, that's deeply frustrating.

Another round of speculative-execution vulnerabilities

Posted Aug 12, 2023 15:05 UTC (Sat) by anton (subscriber, #25547) [Link] (6 responses)

What "C memory model specification" do you mean?

Why should the zero page of the 6502 be a reason not to support the 6502 in "big" C compilers? They can use the zero page like compilers for other architectures use registers (which leave no trace in C's memory model, either). Besides, there are C compilers for the 6502, like cc65 and cc64, so there is obviously no fundamental incompatibility between C and the 6502. The difficulties are more practical, stemming from having zero 16-bit registers, three 8-bit registers, only 256 bytes of stack, no stack-pointer-relative addressing, etc.

Concerning IA-64 (Itanium), this certainly was designed with existing software (much of it written in C) in mind, and there are C compilers for it, I have used gcc on an Itanium II box, and it works. C has not killed IA-64, out-of-order (OoO) execution CPUs have outcompeted it. IA-64, Transmeta and the Mill are based on the EPIC assumption that the compiler can perform better scheduling than the hardware, and it turned out that this assumption is wrong, largely because hardware has better branch prediction, and can therefore perform deeper speculative execution.

And the fact that OoO won over EPIC shows that having an architecture where instructions are performed sequentially is a good interface between software (not just written in C, but also, e.g., Rust) and hardware, an interface that allows adding a lot of performance-enhancing features under the hood.

Concerning Lisp machines, they were outcompeted by RISCs, which could run Lisp faster; which shows that they are not just designed for C. There actually was work on architectural support for LISP in SPUR, and some of it made it into SPARC, but one Lisp implementor wrote that their Lisp actually did not use the architectural support in their SPARC port, because the cost/benefit ratio did not look good.

Concerning GPUs, according to your theory C should have killed them long ago, yet they thrive. They are useful for some computing tasks and not good for others. In particular, let me know when you have a Rust or, say, CUDA compiler or OS kernel (maybe one written in Rust or CUDA) running on a GPU.

Another round of speculative-execution vulnerabilities

Posted Aug 14, 2023 9:50 UTC (Mon) by james (subscriber, #1325) [Link] (5 responses)

C has not killed IA-64, out-of-order (OoO) execution CPUs have outcompeted it.

It's an interesting theoretical exercise to consider what would have happened if Meltdown and Spectre had been discovered sometime around 2000. Presumably the software workaround for Meltdown would have had to have looked like Red Hat's 4G/4G split, which could:

cause a typical measurable wall-clock overhead from 0% to 30%, for typical application workloads (DB workload, networking workload, etc.). Isolated microbenchmarks can show a bigger slowdown as well - due to the syscall latency increase.

That would have made a big difference to the perceived advantages of Itanium.

Would the conservative and increasingly security-sensitive server world have adopted the position that OoO couldn't be trusted? (Once Itanium was released, Intel would almost certainly have made that part of their marketing message.)

In 2018, when in this timeline Meltdown and Spectre were discovered, the consensus of the security community was that more such attacks would be discovered, and time has sadly proven that to be correct — but we now have no other realistic option but to live with it. We had other options around 2000 — then-current in-order processors (from Sun, for example).

The triumph of OoO looks much more like an accident of history rather than something inherent to computer science to me.

Another round of speculative-execution vulnerabilities

Posted Aug 14, 2023 11:05 UTC (Mon) by anton (subscriber, #25547) [Link] (4 responses)

AFAIK a mitigation for Meltdown was indeed to not share the address space between kernel and user space, leading to TLB flushes on system calls. Intel fixed Meltdown relatively quickly in hardware, and AMD hardware has not been vulnerable to Meltdown AFAIK.

By contrast, neither Intel nor AMD (nor AFAIK ARM or Apple) has fixed Spectre in the more than 6 years since they have been informed of it. This indicated that these CPU manufacturers don't believe that they can sell a lot of hardware by being the first to offer hardware with such a fix (and making it a part of their marketing message). So they think that few of their customers care about Spectre. But if they thought that many customers care about Spectre, they would design OoO hardware without Spectre.

As for IA-64, it has architectural features for speculative loads, and is therefore also vulnerable to Spectre. This vulnerability can probably be mitigated by recompiling the program without using speculative loads (if we assume that the hardware does not perform any speculative execution, it's good enough to perform the speculative load and then not use the loaded data until the speculation is confirmed; for security the speculatively loaded data should be cleared in case of a failed speculation). This mitigation would reduce the performance of Itanium CPUs to be close to the performance of architectures without these speculative features, i.e., even lower than the Itenium performance that we saw.

OoO certainly has other options wrt. Spectre than to live with it. Just fix it. All the OoO hardware designers (the Zen2 designers are the exception that proves the rule) are able to squash speculative architectural state on a misprediction; they now just need to apply the same discipline to speculative microarchitectural state. E.g., if they had squashed the speculative branch predictor state on a miscprediction, there would be no Inception, and if they had squashed the speculative AVX load buffer state on misprediction, there would be no Downfall.

Another round of speculative-execution vulnerabilities

Posted Aug 15, 2023 4:26 UTC (Tue) by donald.buczek (subscriber, #112892) [Link] (3 responses)

> E.g., if they had squashed the speculative branch predictor state on a miscprediction, there would be no Inception

A branch predictor, which isn't allowed to learn, would't that just be a rather useless static branch predictor like "allways probably backwards" or "as hinted by machine code" ?

Another round of speculative-execution vulnerabilities

Posted Aug 15, 2023 11:01 UTC (Tue) by anton (subscriber, #25547) [Link] (2 responses)

What makes you think that this would mean "isn't allowed to learn"? The fact that architectural state is not changed on a misprediction does not mean that architectural state is immutable, either.

A straightforward way is to learn from completed (i.e. architectural) branches, with the advantage that you learn from the ground truth rather than speculation.

If that approach updates the branch predictor too late in your opinion (and for the return predictor that's certainly an issue), a way to get speculative branch predictions is to have an additional predictor in the speculative state, and use that in combination with the non-speculative predictor. If a prediction turns out to be correct, you can turn the part of the branch predictor state that is based in that prediction from speculative to non-speculative (like you do for architectural state); if a prediction turns out to be wrong, revert the speculative branch predictor state to its state when the branch was speculated on (just like you do with speculative architectural state).

Another round of speculative-execution vulnerabilities

Posted Aug 15, 2023 14:26 UTC (Tue) by donald.buczek (subscriber, #112892) [Link] (1 responses)

> If a prediction turns out to be correct, you can turn the part of the branch predictor state that is based in that prediction from speculative to non-speculative (like you do for architectural state); if a prediction turns out to be wrong, revert the speculative branch predictor state to its state when the branch was speculated on (just like you do with speculative architectural state).

Why wouldn't such a branch predictor always give the initial answer? If correct, it would be sensible to stick to it and if wrong, you want to ignore that and revert to the state of the last correct guess or the initial state.

Assuming you want to apply that to binary, taken/not taken branch predictor and not only target branch predictors?

Another round of speculative-execution vulnerabilities

Posted Aug 15, 2023 15:23 UTC (Tue) by anton (subscriber, #25547) [Link]

If the prediction is wrong, you throw away the speculative nonsense (and thus avoid Inception), but you record that the prediction was wrong. I had not written that earlier, sorry.

In more detail: If the mispredicted branch is non-speculative, you record it in the non-speculative predictor. If the mispredicted branch is still in the speculative part of execution (that would mean that you have a CPU that corrects mispredictions out-of-order; I don't know if real CPUs do that), you record it in the speculative part, and when this branch leaves the speculative realm, this record can also be propagated to the non-speculative predictor.

Another round of speculative-execution vulnerabilities

Posted Aug 12, 2023 16:51 UTC (Sat) by farnz (subscriber, #17727) [Link] (3 responses)

Itanium failed to outperform AMD64 on hand-coded assembly as well as on C code. It wasn't killed by the C model, it was killed by a failure to deliver performance greater other CPUs. VLIW CPUs like Transmeta failed because VLIW code is inherently low-density in memory, and our current bottleneck for performance tends to be L1 cache size. Mill has never reached a point where hand-written code in simulation outperforms hand-written code for AMD64 given the same simulated resources as AMD64. EDGE is an ongoing research project, and may (or may not) prove worthwhile - there's certainly not been an effort to build a good EDGE CPU that can be compared to something "C-friendly" like RISC-V.

Similar failures apply to Lisp Machines. While they had dedicated hardware to make running Lisp code faster, they lost out because RISC CPUs like SPARC and MIPS were even faster at running Lisp code for a given energy input than Lisp Machines were. Again, not about programming model, but about the Lisp Machines being worse hardware for running Lisp than MIPS or SPARC.

In terms of competing models of computation that have actually made it to retail sale, FPGAs are a commercial success, but are not programmed like CPUs, because they're defined as a sea of interconnected logic gates, and you are better off exploiting that via a Hardware Description Language than via something like C, FORTRAN or COBOL. GPUs are a commercial success; individual threads on a GPU are similar to a CPU with SIMD, with many threads per core (8 on Intel, more on others), and a hardware thread scheduler that allows you to have a pool of cores sharing thousands or even hundreds of thousands of threads.

None of this is about the "C model"; underpinning all of the noise is that humans struggle to coordinate concurrent logic in their heads, and prefer to think about a small number of coordination points (locks, message channels, rendezvous points, whatever) with a single thread of execution between those points. OoOE with speculative execution is one of the two local minima we've found for such a mental model of programming, and supports the case where a single thread of logic is the bottleneck. The other model that works well is the workgroup model used by GPU programming, where something distributes a very large number of input values to a pool of workers, and lets the workers build a large number of output values. Between the input and output values, there's very little (if not no) coordination between workers.

And while the 6502 is not supported upstream in any of the big C compilers, nor are many other CPUs of the same vintage. The Z80 is not supported in any of the big C compilers, nor is the 6809, for example, and both of those were big selling CPUs at the time the 6502 was current; the Z80 is also a lot friendlier to C than the 6502, since the Z80 does not limit you to a single 256 byte stack at a fixed location in memory, whereas the 6502 has a 256 byte stack fixed in page 1. I've never personally programmed a 6809 system, but I believe that it's also a lot more C friendly than the 6502.

Fundamentally, the thing that has killed every alternative to date is that the surviving processor types are simply faster for commercially significant problems than any competitor was, even with alternative programming models. This applies to VLIW, and to EPIC, and to Lisp Machines.

Another round of speculative-execution vulnerabilities

Posted Aug 14, 2023 20:08 UTC (Mon) by mtaht (subscriber, #11087) [Link] (2 responses)

I remain fond of the Mill set of ideas for many reasons, but was not aware of any benchmarks of the compiler, or public sim information? I have not kept track.

Weirdly enough I do not care about IPC, what I care about is really rapid context and priv switching, something that unwinding speculation on the TLB flush on spectre really impacted. I am tired of building processors that can only go fast in a straight line. And like everyone here, tired of all these vulnerabilities.

The mill held promise of context or priv switching in 3 clocks. The implicit zero feature and byte level protections seemed like a win. But it has been a long 10+ years since that design was announced, have there been any updates?

Another round of speculative-execution vulnerabilities

Posted Aug 14, 2023 21:52 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

I recently perused the forum and it seems that they're in another funding round and looking to go from startup to a proper company (salaries, etc.). Technical progress (well, at least publicizing it) is blocked on that. To be fair, they are apparently in it for the money (based on the Q&A in at least one of the talks that have been released).

Another round of speculative-execution vulnerabilities

Posted Aug 17, 2023 14:43 UTC (Thu) by farnz (subscriber, #17727) [Link]

It's a while since I saw the information (around 10 years), so I don't have links to hand, and it was investor-targeted. They seemed to be making the same mistake as Itanium designers, though - they compared hand-optimized code on their Mill simulator to GCC output on a then current Intel chip (Haswell, IIRC), showing that simulated Mill was better than GCC output on Haswell. The claim was that compiler improvements needed for Mill would bring Mill's performance on compiled code ahead of Haswell's performance; but it failed to take into account that, with a lot of human effort, I could get better performance from Haswell with hand-optimized code than they got with GCC output, using GCC's output as a starting point.

I am inherently sceptical of "compiler improvement" claims that will benefit one architecture and not another; while I'll accept that the improvement is not evenly distributed, until Mill Computing can show that their architecture with their compiler can outperform Intel, AMD, Apple, ARM or other cores with a modern production-quality (e.g. GCC, LLVM) compiler for the same language, I will tend towards the assumption that anything that they improve in the compiler will also benefit other architectures.

This holds especially true for compiler improvements around scheduling, which is what Mill depends upon, and what Itanium partially needed to beat OoOE - improvements to scheduling of instructions benefit OoOE by making the static schedule closer to optimal, leaving the OoOE engine to deal with the dynamic issues only, and not statically predictable hazards.

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 23:59 UTC (Thu) by khim (subscriber, #9252) [Link] (4 responses)

> Yes the programmer actually has to THINK about their database design

And that's the beginning and the end. Most people out there don't want to think.

And once these people have took over… the whole house of cards started unraveling.

Today people don't want to think… about anything, really. They are ignoring as much as they could and concentrate on what's profitable.

Only… you couldn't eat paper and zeros and ones in central banks servers are even more useless.

It would be interesting to see if we would find a way to avoid collapse of western civilisation, but chances are not good: most people not only don't understand why it's collapsing, they don't even notice that collapse is not just started but well underway.

Another round of speculative-execution vulnerabilities

Posted Aug 11, 2023 15:10 UTC (Fri) by Wol (subscriber, #4433) [Link] (3 responses)

The problem is journalists ...

A couple of days ago we had an article about Drax in one of our daily newspapers - so we're talking maybe 20-30% of newspaper readers reading this article.

A major part of the story is about the power station shutting down and avoiding having to pay rebates to consumers - some government subsidy that had to be repaid if they were generating and selling electricity above a certain price. So they shut down and sold their fuel elsewhere instead.

That fuel being woodchip. So a second, large, part of the journalist's story was about how Drax was one of our biggest greenhouse gas emitters and polluters in the country! The eco-friendliness of shipping the wood from Canada is certainly up for debate, but burning wood? That's one of the greenest fuels we've got!

When journalists - who are supposed to inform the public! - get their facts so badly out of kilter, what hope do the public have?

Cheers,
Wol

Another round of speculative-execution vulnerabilities

Posted Aug 11, 2023 15:26 UTC (Fri) by paulj (subscriber, #341) [Link] (2 responses)

Way OT now, but using wood for power, on the back of a highly oil based system to grow, process and transport (over massive distances) that wood is not that green.

Particularly if that wood is coming from old wood forests that are being cleared. I don't know the details of Canadian wood pulp, but IVR a lot of their wood is from clearing old woods.

A final issue is that commercial forestry (least in UK and Ireland) is from dense pine forestry plantations, which is kind of a disaster for the native ecosystem. Really, we need to reforest our denuded countries (UK and Ireland) with natural, long-life forests - really good carbon capture and storage!

Which means we need something else for power. Something that is a lot more space efficient than covering the country in dense commercial and largely dead pine forests (which probably still won't give us enough fuel). The answer is obvious, but greens have irrational dogma.

Another round of speculative-execution vulnerabilities

Posted Aug 11, 2023 16:14 UTC (Fri) by joib (subscriber, #8541) [Link] (1 responses)

It's good to see some pushback on the remarkably common but simplistic idea that since biomass sucks up CO2 when it grows and releases it when it burns, all is ok, and we can just burn wood as much as we want with no ill effects. In addition to climate change, the other big environmental crisis is biodiversity loss, largely driven by land use changes. Such as turning native forests into cropland, or for that matter biomass plantations.

Burning biomass is, in the end, a very inefficient way of turning sunlight into usable energy. There just isn't enough arable land on the planet to replace the energy we currently get from fossil fuels. There are other very low carbon energy production technologies that are much more area efficient, like wind, solar and nuclear energy.

Anyway, this isn't the correct forum to debate this. ;)

Another round of speculative-execution vulnerabilities

Posted Aug 14, 2023 8:35 UTC (Mon) by paulj (subscriber, #341) [Link]

Nuclear power is the only option that is compatible with a modern lifestyle, AFAICT.

I consider myself pretty green, but I abhor the common the "green" stance on nuclear power. Which is completely at odds with having both a) A biodiverse and sustainable planet b) A modern way of life ("modern" implies high energy use in many many ways, and only nuclear can reliably replace fossil fuels to provide this). If you make society choose between A and B, society will choose B. Sigh sigh sigh.

Another round of speculative-execution vulnerabilities

Posted Aug 11, 2023 16:32 UTC (Fri) by DemiMarie (subscriber, #164188) [Link]

What would you do? What would a reasonable language be?

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 22:41 UTC (Thu) by dfc (subscriber, #87081) [Link]

Running older desktop hardware is feasible because there are a number of Linux distributions that will run on the hardware and continue to provide security updates. Is the same thing true for your phone?

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 6:39 UTC (Wed) by ibukanov (subscriber, #3942) [Link] (3 responses)

But what about energy efficiency per task? Is it lower with Cortex-520 compared with speculative processors? Or speculative execution only works by consuming excessive power?

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 10:10 UTC (Wed) by excors (subscriber, #95769) [Link] (2 responses)

I think efficiency depends heavily on how much performance you want. See the image halfway down https://community.arm.com/arm-community-blogs/b/announcem... , which shows power vs performance curves for Cortex-A510 (in-order; speculative instruction fetch but not really speculative execution), Cortex-A715 (OoO; speculative execution; "balance of performance and efficiency"), and Cortex-X3 (OoO; highest performance).

They're curves because power is a non-linear function of frequency. There's some overlap where a lower-end CPU near its max frequency has worse power than a higher-end CPU at equal performance.

Draw a straight line through the origin and tangential to the purple (middle) curve. That should represent the optimal power/performance ratio for the Cortex-A715. By my rough measurements on this questionably-precise graph, the Cortex-A510 curve shows better power/performance than that when it's about 20%-60% of its max performance.

So if you're trying to optimise power/performance, and you're happy with <15% of the Cortex-A715's max performance - maybe your task doesn't need to complete quickly, or maybe you've got an embarrassingly parallel problem and can spread it over 6x as many cores with no extra overhead - then the Cortex-A510 seems worthwhile. But if you need even slightly more than that, and would have to drive the Cortex-A510 at a higher frequency, you'll get better efficiency *and* 3x better performance by switching to the Cortex-A715 at half its max frequency.

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 9:48 UTC (Thu) by anton (subscriber, #25547) [Link]

A more revealing graph (but for the earlier A55 vs. A75 (vs. Exynos M4)) shows Perf/W. And it shows that the in-order A55 is better in Perf/W than the OoO A75 only at its very lowest performance. As soon as you need a little more, the A75 is more power-efficient.

Another round of speculative-execution vulnerabilities

Posted Aug 11, 2023 13:54 UTC (Fri) by ibukanov (subscriber, #3942) [Link]

Big thanks for the info! It is a very interesting perspective.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 9:48 UTC (Wed) by epa (subscriber, #39769) [Link] (5 responses)

My computer is idle waiting for user input 99% of the time. I might be happy with a slower, non-speculative CPU for most use. High-performance code for gaming or video decoding (or perhaps a kernel compile) can be explicitly tagged as less sensitive, and scheduled on a separate high-performance core. Indeed, CPU-intensive stuff is nowadays becoming GPU-intensive instead. It's possible that in ten years, with number-crunching offloaded to the GPU, the evolutionary niche for big, superscalar CPUs will disappear.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 10:45 UTC (Wed) by mb (subscriber, #50428) [Link] (1 responses)

>My computer is idle waiting for user input 99% of the time.

Yes, but then, if I click a button, I want today's massive software stack triggered by this action to run as fast as possible. Otherwise it becomes non-interactive.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 13:53 UTC (Wed) by epa (subscriber, #39769) [Link]

If only it could speculatively run all that Javascript in anticipation of you clicking the button.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 10:50 UTC (Wed) by adobriyan (subscriber, #30858) [Link] (2 responses)

> My computer is idle waiting for user input 99% of the time.

> I might be happy with a slower, non-speculative CPU for most use.

> High-performance code for gaming or video decoding (or perhaps a kernel compile) can be explicitly tagged as less sensitive, and scheduled on a separate high-performance core.

Full x86_64 allmodconfig build takes about 3.5-4 hours on 1 core and kernel is not the slowest project to build.

Developers still need _many_ fast cores for parallel compilation.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 14:16 UTC (Wed) by yodermk (subscriber, #3803) [Link] (1 responses)

IMHO there should be no problem disabling mitigations for this kind of thing. If you're the only person using your computer, what difference do they make? Any processes are probably running as you anyway. If you let untrusted code on your computer you're probably screwed with or without these vulns.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 21:45 UTC (Wed) by DemiMarie (subscriber, #164188) [Link]

“If you let untrusted code on your computer you're probably screwed with or without these vulns.” Only if you are not using any form of sandboxing. If you are using a web browser, you are.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 14:00 UTC (Wed) by eru (subscriber, #2753) [Link] (2 responses)

When was speculation added to Intel CPU:s? My guess is after Pentium 4.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 14:30 UTC (Wed) by excors (subscriber, #95769) [Link]

It was before the Pentium 2 - specifically the Pentium Pro (launched in 1995) had out-of-order and speculative execution with a 20-30 instruction window (according to the Pentium Pro Family Developer's Manual). The original Pentium had branch prediction for speculative code fetch but no speculative execution.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 14:32 UTC (Wed) by farnz (subscriber, #17727) [Link]

The last Intel x86 CPU with no speculative execution at all was the 80486. The Pentium (in 1993) had a very limited amount of speculative execution driven by a dynamic branch predictor, and it's just grown from then on.

How many in-order cores could one fit on a die?

Posted Aug 10, 2023 16:07 UTC (Thu) by DemiMarie (subscriber, #164188) [Link] (16 responses)

How many of those non-speculative cores could fit on a die? Could one make up for the reduced single-threaded performance with higher core counts and hardware support for coordination between cores?

How many in-order cores could one fit on a die?

Posted Aug 10, 2023 16:34 UTC (Thu) by malmedal (subscriber, #56172) [Link] (7 responses)

> Could one make up for the reduced single-threaded performance with higher core counts and hardware support for coordination between cores?

In theory, but the people who have tried, e.g. Sun with Niagara and Intel with Larrabee have so far failed...

How many in-order cores could one fit on a die?

Posted Aug 11, 2023 9:58 UTC (Fri) by paulj (subscriber, #341) [Link] (6 responses)

I'm not sure Niagara failed. It did surprisingly well for Sun. For the stuff it was good at, it was very very good at.

Larrabee failed, but... Intel tried to make that into a GPU competitor. And the amount of RAM was limited.

How many in-order cores could one fit on a die?

Posted Aug 11, 2023 17:06 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

I was on a receiving end of trying to make Niagara work (the second one, with multiple FPUs). In Java, which is supposed to be its natural habitat.

It never worked well, garbage collection was slow because even the "parallel" GC in Sun JVM was not quite parallel and the sequential parts were causing huge delays because the single-threaded execution was super-slow.

Later, we tried to use Tilera CPUs (massively parallel CPUs with 32 cores) for networking software, and it ALSO failed miserably. Turns out that occasional serialized code just overwhelms everything. I still have a MikroTik Tilera-based router from that experiment, I'm using it for my home network.

How many in-order cores could one fit on a die?

Posted Aug 11, 2023 19:55 UTC (Fri) by malmedal (subscriber, #56172) [Link]

I also tried Java to try to make use of the Niagara, a major pain point was a number of non thread-safe routines in the standard library.

Especially annoying since I was not making these calls directly, they were from third-party libraries so it was practically impossible to figure out what could be safely run in parallel.

How many in-order cores could one fit on a die?

Posted Aug 14, 2023 8:48 UTC (Mon) by paulj (subscriber, #341) [Link] (2 responses)

I'm not sure Java was its natural habitat either, given the GC issues. It was designed for fairly parallel (multi-thread/process) C/C++ server software - web, SQL, etc.

Tilera, worked on software on that too. The people who architected that software had actually done a pretty good job of making sure the packet processing "hot" paths could all run independently, and each thread (1:1 to CPU core) had its own copy of the data required to process packets. Other, non-forwarding-path "offline" code would then in the background take the per-CPU packet data, process it, figure out what needed to be updated, and update each per-CPU hot-path/packer-processing data state accordingly. That worked very well.

The issue the shop I worked at had with Tilera was that it was unreliable. The hardware had weird lock up bugs. I figured out ways to increase the MTBF of these hard lock ups, by taking more care in programming the broadcom Phys attached to the chip (I think they were on ASIC, and part of the Tilera design - can't quite remember). But... MAU programming via I2C controllers shouldn't really be causing catastrophic lockups of the whole chip. We still had hard lock ups though - never fully figured them all out or work-arounds.

It seemed a 'fragile' and sensitive chip.

How many in-order cores could one fit on a die?

Posted Aug 14, 2023 9:20 UTC (Mon) by paulj (subscriber, #341) [Link]

Uhm, MII MDIO rather, not MAU, I guess.

How many in-order cores could one fit on a die?

Posted Aug 14, 2023 15:38 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

> The issue the shop I worked at had with Tilera was that it was unreliable. The hardware had weird lock up bugs.

We found some strange lockups in glibc, something to do with pthreads and signals. We "solved" it by porting musl libc, at that time it was easier to do than figuring out how to build and debug glibc.

But yeah, lockups also happened.

How many in-order cores could one fit on a die?

Posted Aug 17, 2023 11:00 UTC (Thu) by davidgerard (guest, #100304) [Link]

Was also there, did that - went from a Niagara beast machine to Ubuntu VMs. Our actual reason was to get away from Oracle as absolutely fast as possible - but it turned out bogomips were what our apps actually wanted, and 300MHz vs 3GHz did in fact make our apps many times faster.

How many in-order cores could one fit on a die?

Posted Aug 10, 2023 17:16 UTC (Thu) by farnz (subscriber, #17727) [Link] (1 responses)

Most systems have a compute device in them, called a GPU, which is designed that way. For certain workloads, such as graphics rendering and machine learning, this is an amazing model, because there's a huge amount of parallelism to exploit (so-called "embarrassingly parallel" problems). For others, such as running a TCP/IP stack, it's not great, because much of the problem is serial, and you're better off pushing the problem to a CPU which is designed to run a single thread exceedingly fast.

How many in-order cores could one fit on a die?

Posted Aug 11, 2023 9:59 UTC (Fri) by paulj (subscriber, #341) [Link]

If you have X hundreds to thousands different TCP state machines to run, the many-simple-core machine may give you better throughput.

How many in-order cores could one fit on a die?

Posted Aug 11, 2023 9:52 UTC (Fri) by paulj (subscriber, #341) [Link] (4 responses)

Several times more. Sun Niagara T1 had 8 cores X 4 threads = 64 threads, at a time when the complex OOO CPUs had 4 cores max with 2-way SMT (8 threads).

That machine got considerably more throughput on highly parallel web workloads as a result (as long as you didn't run a web app in a language that indiscriminately used floating-point, like PHP, cause they gave it one FPU to share between all cores!).

See link in another comment to a blog post with more details and references to a couple of really good papers - old, but still good reading.

How many in-order cores could one fit on a die?

Posted Aug 11, 2023 11:19 UTC (Fri) by malmedal (subscriber, #56172) [Link] (3 responses)

> as long as you didn't run a web app in a language that indiscriminately used floating-point, like PHP

At one point I tried explaining to people, complete with benchmarks, why the Niagaras were not a good fit for a specific PHP application. It is quite difficult to convince people that the new expensive system they just bought will never work as well as the existing several years old servers it was supposed to replace.

Since the servers were bought and paid for I tried to find something useful for them to do, but did not really succeed.

How many in-order cores could one fit on a die?

Posted Aug 11, 2023 12:03 UTC (Fri) by paulj (subscriber, #341) [Link] (2 responses)

It was quite amazing that, in all the years it must have taken to get Niagara from concept through to actual servers, a CPU designed explicitly for loads like web serving, that no one really considered that a lot of web applications (esp then) are written in languages/frameworks that just use FP for all arithmetic.

How many in-order cores could one fit on a die?

Posted Aug 14, 2023 10:21 UTC (Mon) by epa (subscriber, #39769) [Link] (1 responses)

PHP and other scripting languages like Perl treat numbers as double-precision floating point but a lot of the time they are only smallish integers in practice. With a small amount of silicon you could give each core a 'fake FPU' that performs the necessary integer operations. If it turns out the inputs or the result aren't integer, it waits for the real FPU to become available.

How many in-order cores could one fit on a die?

Posted Aug 15, 2023 4:47 UTC (Tue) by donald.buczek (subscriber, #112892) [Link]

> PHP and other scripting languages like Perl treat numbers as double-precision floating point but a lot of the time they are only smallish integers in practice.

Not true for Perl, integers and doubles use native types [1].

[1]: https://github.com/Perl/perl5/blob/79c6bd015ed156a95e3480...

How many in-order cores could one fit on a die?

Posted Aug 13, 2023 20:16 UTC (Sun) by kleptog (subscriber, #1183) [Link]

This is basically the idea behind Erlang and it's VM. It's a completely different style of programming (the Actor model) but it basically means your program consists of thousands of threads (typically called processes in this context) that are quickly created and destroyed and work by passing messages to each other. A webserver running on it may generate dozens of threads for each request that comes in.

You're working on a VM so there is some overhead, but the result is that your application can linearly scale with the number of cores. A 256-core machine will support twice as many requests per second and a 128-core machine. It was built for telephony exchanges, and it shows. For stuff like WhatsApp where you're managing millions of TCP connections and messages, it really shines.

It's a functional language though with no per-process shared mutable state. It avoids a lot of GC overhead because most threads die before the first GC pass is run. You simply toss all the objects associated with a thread when it exits without checking liveness.

There is absolutely no way you could make the existing mass of Javascript of C/C++ run in such a way. Maybe one day we will have AI systems smart enough to reformulate code in this way for us.

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 22:29 UTC (Thu) by jmspeex (subscriber, #51639) [Link] (4 responses)

Actually, even in-order CPUs typically use speculative execution. Not having speculative execution means you stall the entire pipeline on every conditional branch. Even the original Pentium I had a branch predictor (and hence speculative execution), possibly even the 486.

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 22:32 UTC (Thu) by farnz (subscriber, #17727) [Link] (3 responses)

The 486 only had pipeline stalls for resolving conditional branches; the Pentium used a branch predictor and very limited speculative execution of the predicted outcome, and the amount of speculative execution in Intel processors went up from there onwards to today's CPUs.

Another round of speculative-execution vulnerabilities

Posted Aug 11, 2023 8:13 UTC (Fri) by excors (subscriber, #95769) [Link] (2 responses)

Hmm, the "Pentium Processor Family Developer's Manual" describes a 5-stage pipeline (for non-MMX models):

> PF: Prefetch
> D1: Instruction Decode
> D2: Address Generate
> EX: Execute - ALU and Cache Access
> WB: Writeback

and says:

> In EX all u-pipe instructions and all v-pipe instructions except conditional branches are verified for correct branch prediction. [...] The final stage is Writeback (WB) where instructions are enabled to modify processor state and complete execution. In this stage, v-pipe conditional branches are verified for correct branch prediction.

(where the two pipes are: "The u-pipe can execute all integer and floating-point instructions. The v-pipe can execute simple integer instructions and the FXCH floating-point instruction.")

and:

> The Pentium processor uses a Branch Target Buffer (BTB) to predict the outcome of branch instructions which minimizes pipeline stalls due to prefetch delays. The Pentium processor accesses the BTB with the address of the instruction in the D1 stage. [...]
>
> A mispredicted branch (whether a BTB hit or miss) or a correctly predicted branch with the wrong target address will cause the pipelines to be flushed and the correct target to be fetched. Incorrectly predicted unconditional branches will incur an additional three clock delay, incorrectly predicted conditional branches in the u-pipe will incur an additional three clock delay, and incorrectly predicted conditional branches in the v-pipe will incur an additional four clock delay.

Apart from the quirk with v-pipe conditional branches, that sounds like all branch predictions are resolved by the EX stage. If the prediction made in D1 was wrong, then it doesn't EX the mispredicted instruction, it flushes the pipeline and starts again. There is speculative fetch and decode, but no speculative execution. Am I misinterpreting that, or using a different meaning of "speculative execution" or something?

(Speculative fetching sounds mostly harmless in relation to Spectre - it can't reveal any microarchitectural state except the contents of the BTB, in contrast to proper speculative execution where potentially-sensitive register contents are processed by the EX stages of potentially-bogus instructions and may be exposed through many microarchitectural side channels.)

Another round of speculative-execution vulnerabilities

Posted Aug 11, 2023 12:25 UTC (Fri) by tao (subscriber, #17563) [Link] (1 responses)

The difference in timing between a misprediction and a correct prediction will at the very least reveal state as to true/false for the predicted condition. Whether that information is enough or not is a wholly different matter.

Another round of speculative-execution vulnerabilities

Posted Aug 11, 2023 14:11 UTC (Fri) by excors (subscriber, #95769) [Link]

I think that can be solved by writing constant-time code when you're dealing with sensitive data, so it's never exposed to the branch predictor side channel, which you should be doing anyway to avoid other timing attacks. The problem with Spectre is that it can't be solved by writing perfectly correct side-channel-free code, because the CPU is not executing the code you wrote, it's executing some arbitrary code that you *didn't* write (when it's speculating past a misprediction), and that unwritten code may expose the data to side channels. Software developers can't take responsibility for code they didn't write, so it becomes a hardware issue.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 8:01 UTC (Wed) by Wol (subscriber, #4433) [Link] (1 responses)

I'd go back to Pr1mos and 50-series ...

Cheers,
Wol

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 7:33 UTC (Thu) by tomsi (subscriber, #2306) [Link]

That was the good old days :)

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 10:09 UTC (Thu) by Aissen (subscriber, #59976) [Link] (1 responses)

There might not have been speculation then, but there sure was micro-architectural leaks. See my FOSDEM talk on this: https://archive.fosdem.org/2022/schedule/event/z80/ (and it barely scratches the surface).
Of course this does not matter on a single-user computer that does not run arbitrary untrusted code from the Internet.

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 16:08 UTC (Thu) by deater (subscriber, #11746) [Link]

so surely we should be using 6502 processors then

famously there are fewer transistors in a 6502 than there are pages in the x86 documentation. It's actually possible for one person to know what each transistor in the 6502 is doing and audit it.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 2:07 UTC (Wed) by dxin (guest, #136611) [Link] (39 responses)

When DEC said they will scale Alpha to 1000x processing power in the future, it was considered quite ambitious.
But they actually guessed it right: things gets really sketchy after 1000x.
We have scaled really well on number of cores, but barely made it past 10x clock speed, and on the way of 10x IPC we shot out feet on every single step.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 2:19 UTC (Wed) by willy (subscriber, #9762) [Link] (7 responses)

What's your baseline for 10x IPC? I was just reading that 486 had 1.8 CPI, half that of the 386. We're up to, what, 6 IPC now? So that's about a 25x improvement in IPC relative to the i386

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 5:35 UTC (Wed) by ibukanov (subscriber, #3942) [Link]

This is the optimal case in absence of pipeline stalls. Typically it does not happens unless one runs carefully written calculation-type code. Which explains why the startup time for modern apps is much slower than what one can expect based on IPC.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 9:16 UTC (Wed) by paulj (subscriber, #341) [Link] (5 responses)

One issue is that there is a parallelism limit in common codes, of about 4 to 10 IPC, according to a '93 DEC WRL tech report by David Wall. Also, even if you increase IPC and can avoid pipeline stalls there, memory speed - latency particularly - isn't keeping up.

There is a good argument to be made that the increasing transistor count budgets could be better spent on adding more, simple, compute elements ("cores") rather than adding ever more complex speculative execution logic to ever more complex compute elements. That this would be more efficient overall.

I.e., rather than trying to make 1 (or a very small) number of parallel paths of execution very fast with speculative execution, we should just provide many more paths of execution with simpler cores. The simpler cores might each have to stall more waiting on memory latency, but if you have many of them you can get more throughput - they will not waste cycle or energy on misplaced speculative execution.

These are not new ideas, they go back a long way, and we're slowly going down that path it seems. GPUs are kind of part of that vision, CPUs have gone many-core, but still with very complex speculative logic to fulfil desire for good single-thread benchmark results. Old blog of mine, but the references are still good to read: https://paul.jakma.org/2009/12/07/thread-level-parallelis...

Using all of those cores

Posted Aug 10, 2023 16:05 UTC (Thu) by DemiMarie (subscriber, #164188) [Link] (4 responses)

How can one beat the parallelism limit you mentioned?

Using all of those cores

Posted Aug 10, 2023 17:19 UTC (Thu) by farnz (subscriber, #17727) [Link] (2 responses)

You can't, easily. Much of the parallelism limit is inherent to the way we perceive the problem domain, and it's simply not possible to have more parallelism without radical new understandings of the problems we're trying to solve.

Some problems, such as graphics rendering and neural network modelling, do have a higher inherent parallelism, and we have an alternative type of processor, called a GPU for historical reasons, which is designed to be faster than a CPU on problems with lots of parallelism; it achieves this by sacrificing single thread performance in favour of running a large number of concurrent threads, complete with hardware support for launching a very large number of threads and multiplexing them onto a smaller number of executing threads.

Not everything with parallelism is suitable for GPUs

Posted Aug 10, 2023 22:01 UTC (Thu) by DemiMarie (subscriber, #164188) [Link] (1 responses)

GPUs have other limitations, though. For instance, the SIMT model means that GPUs are terrible at workloads with lots of non-uniform control flow. That isn’t a huge limitation for math or graphics, but it is a serious limitation for what I call “business logic” workloads, where a significant part of the problem is figuring out what to do next. This includes e.g. web applications, which have a huge amount of parallelism but lots of conditional branches and non-uniform memory accesses.

Not everything with parallelism is suitable for GPUs

Posted Aug 10, 2023 22:08 UTC (Thu) by farnz (subscriber, #17727) [Link]

They're no more terrible at non-uniform control flow than CPUs are - in the worst case, you just use one SIMD lane per GPU core, get a much lower throughput, but still have the large number of threads. It's just that we look at GPUs differently to CPUs, so we see the slowdown from using only one SIMD lane as a big deal on a GPU, but we don't see it as a big deal that we only use scalar instructions on CPU cores with the ability to process 8 (AVX2) or 16 (AVX-512) 32-bit values in parallel, despite the fact that this is the same class of slowdown.

Using all of those cores

Posted Aug 11, 2023 9:48 UTC (Fri) by paulj (subscriber, #341) [Link]

Parallelism for a code for a specific problem?: You have to find a more parallel algorithm. If that is even possible.

Making efficient use of compute resources, in a world where the codes you want to run have limited parallelism? Run many different codes together on the same compute elements, and switch between them to keep memory bandwidth and compute occupied. No single code will run faster, but at least you maintain throughput in the aggregate.

This is kind of where computers have gone anyway. From your phone, to your desktop, to servers running containers running jobs in the cloud - they've all got many many dozens of jobs to run at any given time. If one stalls, switch to another.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 8:13 UTC (Wed) by Wol (subscriber, #4433) [Link] (30 responses)

The problem with CPU speed is we have now hit what I'll call the "Cray Limit". Cray Supercomputers were always relatively small, because the bigger the computer the longer it took the various components to talk to each other. This puts an absolute upper limit on the GHz, because if the chip speed times the size of the computer is faster than the speed of light, well, things ain't going to work!

At roughly 1ft/ns, this means your typical ATX mobo cannot operate faster than 500MHz. Knock a nought off that, to give a 3cm chip, and you've stuck a nought on your chip speed, 5GHz. Careful placement of components will nudge that speed up, but if components need to communicate "across chip", you're stuffed ...

Cheers,
Wol

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 9:16 UTC (Wed) by joib (subscriber, #8541) [Link]

> The problem with CPU speed is we have now hit what I'll call the "Cray Limit". Cray Supercomputers were always relatively small, because the bigger the computer the longer it took the various components to talk to each other.

Cray's were about the same size as other mainframe sized computers of the days. Going much bigger wasn't really useful, because neither software nor hardware at the time was ready for massive parallelism. Today it is, and thus we have warehouse sized supercomputers that can run (some, obviously not all) HPC style problems utilizing all that parallelism.

> At roughly 1ft/ns, this means your typical ATX mobo cannot operate faster than 500MHz. Knock a nought off that, to give a 3cm chip, and you've stuck a nought on your chip speed, 5GHz. Careful placement of components will nudge that speed up, but if components need to communicate "across chip", you're stuffed ...

That matters insofar as you require everything to be synchronous, with a signal traversing from across a wire within one clock cycle. The existence of CPU cores within your CPU running at different frequencies, not to mention long distance high speed network transmission, suggests that it's possible to design things without such synchronicity requirements.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 9:57 UTC (Wed) by malmedal (subscriber, #56172) [Link] (6 responses)

> because if the chip speed times the size of the computer is faster than the speed of light, well, things ain't going to work!

No, chip designers face many problems, but that one is solved. They are making complicated networks, called clock-trees, or meshes, that ensure each clock edge arrives at all components in the clock-domain simultaneously.

The problems they can't seem to solve include power and the fact that wires are getting slower faster than they get shorter in recent processes.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 11:53 UTC (Wed) by Wol (subscriber, #4433) [Link] (5 responses)

> No, chip designers face many problems, but that one is solved. They are making complicated networks, called clock-trees, or meshes, that ensure each clock edge arrives at all components in the clock-domain simultaneously.

That's the wrong problem! Making sure all the signals arrive together is easy. The problem is that if your edges do not share a common "light cone", you're stuffed!

Increasing the frequency reduces the size of the light cone. If the light cone is to encompass the entire chip, then the upper limit on frequency is 5GHz. As others have said, if you're only interested in communicating internal to a single core, then LOCALLY you can increase the frequency further, because everything you're interested in fits into a smaller light cone.

You can have all the fancy clock-trees you like, but if your components are that far apart that you physically require faster-than-light information transfer, you're stuffed.

Cheers,
Wol

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 12:34 UTC (Wed) by malmedal (subscriber, #56172) [Link] (3 responses)

Signal propagation speed in a wire is much slower than light speed. The chips are already much larger than the "light-cone". It gets progressively harder but there is no hard limit at a specific place.

The actual speed was about 1mm per nanosecond over ten years ago. It is probably less than 0.05mm/ns now.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 13:18 UTC (Wed) by Wol (subscriber, #4433) [Link] (2 responses)

So what you're saying is the speed of light slows down massively as the wire size shrinks. Ouch. Although I know the speed of electron propagation is much much slower. So it's a case of the fewer electrons fit on the wire, the closer the speed of light down the wire approximates to the speed of electrons in the wire? Painful!

Cheers,
Wol

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 13:36 UTC (Wed) by malmedal (subscriber, #56172) [Link]

Hmmm, I've understood the issue to be explained by the telegraph equations: https://en.m.wikipedia.org/wiki/Telegrapher%27s_equations, but I suppose at the latest processes they may possibly need to account for individual elections. Which would be an extra complication.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 14:37 UTC (Wed) by joib (subscriber, #8541) [Link]

With my physicist pedantery hat on, no, I don't think that's exactly right. Now, my specialty was never high speed small scale circuits so I could talk out of my ass, but anyway; The speed of light in a conductor (that is, the velocity with which the electromagnetic wave propagates in the medium) is about 2/3 of the speed of light in vacuum, including for very narrow conductors. That is, very very fast. Similarly, the electron drift velocity also remains as it is for larger conductors, that is very very low, on the order of um or mm/s.

What changes then is that for these very small conductors you'll find on modern deep submicron integrated circuits, the relative capacitance of the conductor starts to rise (and the resistance doesn't go down as well as you'd like either). This leads to a phenomenon where when you apply a voltage on one end of the conductor, it takes longer until the voltage/current rises enough on the other end to be registered as a 0->1 flip. So in effect it appears as if the speed of signal propagation drops. I'm not sure how well the telegrapher's equations mentioned by malmedal in the sibling post applies to multi-GHz signals propagating in these very narrow conductors, but something like that is the gist of it. I don't think you need to apply quantum mechanics or study the behavior of individual electrons per se to understand this phenomena.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 22:41 UTC (Wed) by magnus (subscriber, #34778) [Link]

You can always add pipeline stages to transfer the data over several cycles to overcome this (at the cost of latency of course).

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 19:01 UTC (Wed) by flussence (guest, #85566) [Link]

> At roughly 1ft/ns, this means your typical ATX mobo cannot operate faster than 500MHz. Knock a nought off that, to give a 3cm chip, and you've stuck a nought on your chip speed, 5GHz. Careful placement of components will nudge that speed up, but if components need to communicate "across chip", you're stuffed ...

We're already seeing that in RAM speeds where there's a tug-of-war over having it on-CPU/off-CPU, each new DDR version has a huge jump in latency that needs to be papered over with more cache, DIMMs on large boards need buffer chips, DDR5 (iirc) now *requires* ECC to survive normal operation…

I wouldn't be surprised if a few years from now we start seeing hard NUMA become mainstream. Back to the days of having an empty slot close to the CPU because the manufacturer was too cheap to populate it!

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 19:45 UTC (Wed) by willy (subscriber, #9762) [Link] (20 responses)

I told you why this argument was crap last time you espoused it. Knock it off.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 20:58 UTC (Wed) by Wol (subscriber, #4433) [Link] (19 responses)

Are you telling me we can communicate faster than light? Other people have explained the problem is worse than I thought, but from what I remember of your explanation you were saying that c wasn't a problem. (Well, it does seem it's not THE problem, but it does place a hard upper limit on chip frequency - of 5GH on a 3cm chip ...)

Cheers,
Wol

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 21:40 UTC (Wed) by willy (subscriber, #9762) [Link] (4 responses)

You have two misunderstandings relevant to your argument. They're opposite in sign, so they come close to cancelling each other out.

The first is that things need to happen in a single cycle. An instruction that needs data from L3 cache can and will stall for hundreds of cycles. During that time the CPU will execute some of the other dozens of instructions that it has ready. It's something like six clock ticks to retrieve data from L1. Data in registers is ready to operate on and incurs no delay.

The second is that the speed of communication between different parts of the CPU have anything to do with the speed of light. Speed of electrons in copper is much slower. That's the part other people are telling you that you have wrong.

(There are other problems with your argument, but those are the big two)

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 21:59 UTC (Wed) by farnz (subscriber, #17727) [Link] (3 responses)

Note that the speed of electrons is irrelevant; the voltage change that represents a change in state moves much faster than the electrons do, typically at around 60% to 70% of the speed of light in a copper conductor.

But the point about things not needing to happen in a single cycle is key; I can design my logic to account for propagation delays in the circuit, and have it work perfectly. This is what the timing diagrams that are part of any digital logic chip datasheet (and in every CPU datasheet since the 4004) are all about - how do I connect up the entire system's worth of logic such that the system's timing constraints are met?

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 7:59 UTC (Thu) by Wol (subscriber, #4433) [Link] (2 responses)

So, as farnz says, you appear to have completely mis-understood my argument and are arguing against a straw man.

Signals are carried by photons (or em waves, same(ish) thing) so the speed of light IS relevant, although from what others have said the telegraph effect is probably more important, and

My argument has repeatedly been prefixed with "IF components need to communicate" so okay, I'm not necessarily talking about clock cycles, but a single communication cycle has that upper limit. I'm not always clear in what I say, I know that, but if you make no attempt to understand me, I can't understand you either. So IFF a communication cycle equals a clock cycle, 5GHz is the maximum clock possible between two random components in a chip. Of course, splitting a communication clock cycle into multiple clock cycles can speed OTHER stuff up, but it makes no difference to the speed at which a signal travels across a chip.

(And of course, without communication a chip can't work.)

Cheers,
Wol

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 15:03 UTC (Thu) by farnz (subscriber, #17727) [Link]

A corollary of your argument is that Starlink satellites (communication clock rate of around 230 kHz) can be no higher than 1.3 km above the receiver, and Sky TV satellites (communication clock rate of 22 MHz or above) can be no higher than 13 metres above the receiver.

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 16:23 UTC (Thu) by malmedal (subscriber, #56172) [Link]

Nobody is misunderstanding you, it is very easy to understand what you are saying. It's just that it is wrong.

However you seem to be unable to understand what people are saying, please read more carefully.

For instance there is no "telegraph effect" the "telegrapher's equations" are just Maxwell's equations applied to signals in a wire.

If you wish to be able to say anything intelligible about chips you need to understand what "pipelines" are in this context. This appears to be a major gap in your knowledge, you completely ignore it when people bring this up. It is not just a word, it is one of the fundamental concepts.

Already the 8088 had a pipeline, it is not a new concept.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 21:57 UTC (Wed) by malmedal (subscriber, #56172) [Link] (13 responses)

As I said, there is no hard limit, with a speed of 0.05mm per nanosecond would imply that a 3cm wide chip could only manage less than 2MHz. Yet the speeds of the latest processors are above 5GHz.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 22:04 UTC (Wed) by willy (subscriber, #9762) [Link] (11 responses)

To further emphasize this point, the speed of a PCIe gen 6 link is now 64GHz. And the maximum length of a trace is considerably longer than 3cm.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 22:21 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (2 responses)

Another way to think about it: these signals are largely one-way communication. If it were round-trip, the speed of light would matter. But all that really matters is how accurately you can sample the wire for a signal (PCIe gen 6 can apparently do so 64 billion times a second).

And another way to sanity check things: if the size of space between communication endpoints limited your processing rate, we'd probably still be waiting for the first (quality) images from various Mars rovers.

At least I think that's somewhat closer than what Wol has as a model.

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 8:00 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

No. I did say the model is based on communication *between* components, ie there-and-back.

Cheers,
Wol

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 20:28 UTC (Thu) by rschroev (subscriber, #4164) [Link]

If I were to send you a good old-fashioned letter, it would take a day or 4 (or somewhat less or more, I don't actually know; the exact value doesn't matter for this discussion); a reply from you to me would take 4 days too; there and back is then 8 days. Does that mean that I can only send you a letter once every 8 days? Only when my letter needs information from your reply; in that case I have to wait until I get your letter. But in all other cases, I can easily send new letters while old ones are still in transit. The distance between you and me sets a lower limit on latency, but does not affect bandwidth. It's the same for communication in computer systems.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 22:23 UTC (Wed) by malmedal (subscriber, #56172) [Link]

It's amazing how far they have pushed the technology, just ten years ago I would have said it was impossible to get that high outside of a lab-setting.

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 10:49 UTC (Thu) by james (subscriber, #1325) [Link] (6 responses)

To further emphasize this point, the speed of a PCIe gen 6 link is now 64GHz.

I'm pretty sure this isn't technically correct, at least when talking about how far the signal propagates before the next signal is generated. PCIe 6.0 uses

PAM4 (Pulse Amplitude Modulation with 4 Levels) [...] a multilevel signal modulation format used to transmit data. [...] It packs two bits of information into the same amount of time on a serial channel. The utilization of PAM4 allows the PCIe 6.0 specification to reach 64 GT/s data rate and up to 256 GB/s bidirectional bandwidth via a x16 configuration.

It's basically the same concept as MLC versus SLC in flash.

This is the key difference between PCIe 5.0 (which used NRZ, or one bit per cycle) and PCIe 6.0. Both run at 32 billion signals per second: it's just with PCIe 6.0 each signal conveys two bits.

Your main point is correct, though -- this isn't what limits the length of a PCIe 6.0 connection.

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 15:39 UTC (Thu) by kpfleming (subscriber, #23250) [Link] (5 responses)

And those 32 billion transfers per second are spread across 16 parallel lanes... so each lane is nowhere close to '64 GHz' :-)

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 16:44 UTC (Thu) by malmedal (subscriber, #56172) [Link] (4 responses)

> And those 32 billion transfers per second are spread across 16 parallel lanes...

No. Each lane separately transmits 64 Gigabits per second.

Standard terminology is 64GT and and 32GHz.

Another round of speculative-execution vulnerabilities

Posted Aug 23, 2023 5:28 UTC (Wed) by JosephBao91 (subscriber, #157211) [Link] (3 responses)

Well, I think this statement is still not correct.
PCIe Gen5 is 32GT/s, with a frequency of 16GHz (tranfer data both posedge and negedge), and Gen6 uses PAM4 instead of NRZ, it transfers 2bits each time, and the frequency is still 16GHz, but the speed is 64GT/s.
And for hardware design, PAM4 16GHz is more difficulty compared with NRZ 16GHz.

Another round of speculative-execution vulnerabilities

Posted Aug 23, 2023 11:34 UTC (Wed) by malmedal (subscriber, #56172) [Link] (2 responses)

Do you have a reference for this? I haven't read the actual specs myself, but all the articles I've read say 5.0 is 32GHz e.g. https://www.tomshardware.com/reviews/pcie-definition,5754...

Another round of speculative-execution vulnerabilities

Posted Aug 23, 2023 12:13 UTC (Wed) by excors (subscriber, #95769) [Link] (1 responses)

I think that may just be a different meaning of "frequency": the sampling rate is 32GHz, while the Nyquist frequency (basically the frequency of the sine wave corresponding to the worst-case signal 1010101...) is 16GHz. Same as describing audio CDs as 44kHz (the sampling rate) or 22kHz (the highest audio frequency that can be encoded without aliasing) - both are reasonable in different contexts.

For example https://blog.samtec.com/post/why-did-pcie-6-0-adopt-pam4-... describes the Nyquist frequency of PCIe 5.0/6.0 as 16GHz. (The sampling rate is also the same in both, the difference is that in 6.0 each sample encodes 2 bits, so it's 16GHz Nyquist frequency with 32GHz sampling rate and 64GT/s data rate.)

Another round of speculative-execution vulnerabilities

Posted Aug 23, 2023 13:14 UTC (Wed) by malmedal (subscriber, #56172) [Link]

Thank you, sounds plausible.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 22:26 UTC (Wed) by farnz (subscriber, #17727) [Link]

For a very clear example, a geostationary TV satellite is typically transmitting at 22 MHz or higher symbol rates; if the signal has to propagate all the way from the satellite to the receiver before the satellite can start the next symbol, then geostationary orbit has to be no higher than 14 meters above the satellite dish. In practice, everything is designed to handle this delay, and thus it's fine.

If you insist on two-way communication, Starlink's signal has been partially reversed engineered, and has a symbol time of 4.4 µs; this corresponds to 1.3 km path length in free space. And yet, a Starlink satellite is around 550 km above the Earth's surface, for a propagation delay of around 1,800 µs - significantly more than the symbol time.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 15:27 UTC (Wed) by farnz (subscriber, #17727) [Link] (3 responses)

All of these vulnerabilities exist because we have shared state in the hardware between two threads with different access rights; how far away is a world where we can afford to let some CPU cores be near-idle so that threads with different access rights don't share state?

In theory, the CPU designers could fix this by tagging state so that different threads (identified to the CPU by PCID and privilege level) don't share state, and by making sure that the partitioning of state tables between different threads changes slowly.

And also in theory, we could fix this in software by hard-flushing all core state at the beginning and end of each context switch that changes access rights (including user mode to kernel mode and back). However, this sort of state flushing is expensive on modern CPUs, because of the sheer quantity of state (branch predictors, caches, store buffers, load queues, and more).

Which leaves just isolation as the fix for high-performance systems; with enough CPU cores, you can afford the expensive state flush when a core switches access rights, and you can use message passing (e.g. io_uring) to ask a different core to do operations on your behalf.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 16:18 UTC (Wed) by Wol (subscriber, #4433) [Link] (1 responses)

> All of these vulnerabilities exist because we have shared state in the hardware between two threads with different access rights; how far away is a world where we can afford to let some CPU cores be near-idle so that threads with different access rights don't share state?

Aren't we there already? If all your cores run at full power your chip will fry in seconds?

Just allocate one job per core and let the chip allocate power to whatever job is ready to run.

Cheers,
Wol

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 17:39 UTC (Wed) by farnz (subscriber, #17727) [Link]

My laptop has over 500 tasks running, many of them for only short periods before going to sleep. My phone has similarly large numbers of tasks.

We don't yet have thousands of cores, so we can't simply assign each task to a core; we thus need to work out how to avoid having (e.g.) kernel threads and user threads sharing the same state. And note that because some state is outside the core (L2 cache, for example), it's not just a case of "don't share cores - neither sequentially nor concurrently" - depending on the paranoia level, you might want to reduce shared state further than that.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 21:23 UTC (Wed) by andresfreund (subscriber, #69562) [Link]

The "downfall" vulnerability exists even if you disable SMT.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 17:01 UTC (Wed) by eharris (guest, #144549) [Link] (1 responses)

Confused by the discussion here about ON-CHIP transmission speeds.
...because I thought that modern multi-core chips need to talk to SHARED main memory housed across the bus in completely separate cards.....so maybe 10 centimeters away from the CPU chip.
*
Someone here can no doubt clear up my confusion.

Another round of speculative-execution vulnerabilities

Posted Aug 9, 2023 17:28 UTC (Wed) by mb (subscriber, #50428) [Link]

That's exactly why there are multiple layers of cache. Each layer physically and architecturally closer to the core.

And also pre-fetching so that the slow memory access can happen while the CPU is doing other calculations.

Another round of speculative-execution vulnerabilities

Posted Aug 10, 2023 3:25 UTC (Thu) by wtarreau (subscriber, #51152) [Link] (3 responses)

As usual, what's really annoying is that we're having to make a compromise between performance and security because it was decided 50 years ago that it would make sense to cut costs by mixing multiple independent workloads on the same machine in a way that respects confidentiality and security so much that these would believe they're using the same machine. By then it was time-shared systems, and caches were not really a thing.

Of course that was a lie or at least a misconception that would conflict with all hopes for optimizations later. Caches are incompatible with confidentiality, yet they're absolutely mandatory with nowadays CPU frequencies. Busses are too small for the large number of cores and cause arbitration allowing to infer other cores' activities. The wide execution units in our CPUs are mostly idle, making SMT really useful but disclosing even more fine-grained activities, to the point that no more progress is being made in that direction (what CPU vendor does 4-SMT or 8-SMT, maybe only IBM's Power ?).

Meanwhile, the vast majority of us are using a laptop that we don't share with anyone and we all run commands using "sudo", most of the type not even having to re-type a password, because it's *our* machine, and we don't care about the loss of confidentiality there. And the huge number of users of cloud-based hosting shows that tiny dedicated systems definitely have a use case, so full machines of different sizes could be sold to customers, with zero sharing on them either.

Browsers are the only enemies on local machines and they could be placed into an isolation sandbox that runs in real-time mode and flushes caches and TLBs before being switched in. They would not be that much slower nor heavier anyway, they're already the most horrible piece of software ever created by humanity: software that takes gigs of RAM to start and do not even print "hello world" by default, doing nothing at all until connected to a site, so we could definitely afford to see them even slower.

With such mostly dedicated hardware approach, we could get back to using our *own* hardware at full speed and the way we want. We've entered an era where computers are getting slower over time only due to all mitigations for conceptual security trouble that most of us do not care about and that result in sacrificing performance.

Multithread SMT

Posted Aug 10, 2023 5:19 UTC (Thu) by zdzichu (subscriber, #17118) [Link] (2 responses)

Answering your question about high density SMT: Sun's Niagara T1 chip, introduced in 2006, had 4-way SMT. Its successor T2 bumped to 8-way SMT quickly. The line of CPUs ended with M8 in 2017 (still 8-way SMT).

Multithread SMT

Posted Aug 10, 2023 5:56 UTC (Thu) by dxin (guest, #136611) [Link] (1 responses)

Those are not SMT because the thread doesn't run simultaneously, but scheduled round-robin.
Technically it's "fine-grained multi-threading", like most GPUs. The switching threads each cycle in round-robin style makes the wall-time of each cycle much longer, from individual thread's point of view, so pipeline delays doesn't exist.

Multithread SMT

Posted Aug 10, 2023 18:02 UTC (Thu) by wtarreau (subscriber, #51152) [Link]

Yeah so when you think you're having a 1.4 GHz CPU distributing available execution ports to waiting threads and where stalls are long, in fact you're really having eight 175 MHz processors working individually with stalls during the same in nanoseconds but not in cycles. It probably achieves higher total performance levels per transistor count but not in terms of peak performance. I better understand now why I was told gzip was ultra-slow on such machines.