Another round of speculative-execution vulnerabilities
Downfall attacks targets a critical weakness found in billions of modern processors used in personal and cloud computers. This vulnerability, identified as CVE-2022-40982, enables a user to access and steal data from other users who share the same computer. For instance, a malicious app obtained from an app store could use the Downfall attack to steal sensitive information like passwords, encryption keys, and private data such as banking details, personal emails, and messages. Similarly, in cloud computing environments, a malicious customer could exploit the Downfall vulnerability to steal data and credentials from other customers who share the same cloud computer.
A series of patches has landed in the mainline kernel, including one for gather data sampling
mitigation and one to disable the AVX
extension on CPUs where microcode mitigation is not available.
"This is a *big* hammer. It is known to break buggy userspace that
uses incomplete, buggy AVX enumeration.
"
Not to be left out, AMD processors suffer from a return-stack overflow
vulnerability, again exploitable via speculative execution; this patch, also just
merged, describes the problem and its mitigation.
Posted Aug 8, 2023 18:09 UTC (Tue)
by joib (subscriber, #8541)
[Link]
Posted Aug 8, 2023 23:50 UTC (Tue)
by motk (subscriber, #51120)
[Link] (93 responses)
Posted Aug 9, 2023 1:24 UTC (Wed)
by kazer (subscriber, #134462)
[Link] (3 responses)
Posted Aug 9, 2023 1:51 UTC (Wed)
by motk (subscriber, #51120)
[Link] (2 responses)
Posted Aug 9, 2023 18:42 UTC (Wed)
by dezgeg (subscriber, #92243)
[Link] (1 responses)
Posted Aug 10, 2023 0:43 UTC (Thu)
by motk (subscriber, #51120)
[Link]
Posted Aug 9, 2023 2:05 UTC (Wed)
by willy (subscriber, #9762)
[Link] (84 responses)
You wouldn't be happy with a non-speculative CPU in your phone, let alone your laptop, desktop or server.
Posted Aug 9, 2023 2:14 UTC (Wed)
by motk (subscriber, #51120)
[Link]
Posted Aug 9, 2023 5:34 UTC (Wed)
by flussence (guest, #85566)
[Link] (47 responses)
And maybe I'm not happy with having to wait a few seconds to switch apps, or the fact that Firefox no longer works on the phone because the CEO fired all the engineers to buy a 10th mansion, but I know I wouldn't be any happier buying into the "10 copies of chrome and they all want to infantilise you and pick your pocket" way of life. Everyone there seems to be miserable.
Posted Aug 9, 2023 8:07 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (45 responses)
From what I can make out, modern CPUs are "C language execution machines", and C is written to take advantage of all these features with optimisation code up the wazoo.
Get rid of all this optimisation code, get rid of all this speculative silicon, start from scratch with sane languages and chips, ...
Sorry to get on the database bandwagon again, but I would love to go back 20 years, when I worked with databases that had snappy response times on systems with the hard disk going nineteen to the dozen. Yes the programmer actually has to THINK about their database design, but the result is a database that can start spewing results instantly the programmer SENDS the query, and a database that can FINISH the query faster than an RDBMS can optimise it ...
Cheers,
Posted Aug 9, 2023 8:49 UTC (Wed)
by motk (subscriber, #51120)
[Link]
Posted Aug 9, 2023 9:05 UTC (Wed)
by eduperez (guest, #11232)
[Link] (15 responses)
However, those times have long passed away, and there is no use in trying to bring them back. Except for some very specific use cases, it is way cheaper to buy a faster machine, than spend hours upon hours optimizing the code; all that counts is the "return on investment".
You just cannot keep the optimization and attention to detail leves of the past, with the development speed and costs required by the modern world.
Posted Aug 9, 2023 13:33 UTC (Wed)
by butlerm (subscriber, #13312)
[Link] (7 responses)
Posted Aug 9, 2023 13:59 UTC (Wed)
by yodermk (subscriber, #3803)
[Link] (5 responses)
Main drawback is upgrades would require at least a bit of downtime. But, done right, it would be quite brief. The in-process caches would need to warm, though. The other drawback is the absolute need to be sure that no part of the system can crash under any circumstances. But Rust goes a long way in helping you there.
I'm learning Axum (a backend framework for Rust) and hope to be able to implement something like this someday.
Posted Aug 24, 2023 6:13 UTC (Thu)
by ssmith32 (subscriber, #72404)
[Link] (4 responses)
But if you keep the services simple - why bother with all the abstraction? Give them full control, and make them fast.
The real troublemaker is not microservices or distributed systems - it's hosting providers wanting to resell the same time on the same hardware over and over again.
Posted Aug 24, 2023 22:32 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
Posted Aug 25, 2023 9:58 UTC (Fri)
by farnz (subscriber, #17727)
[Link] (2 responses)
Couple of questions:
Posted Aug 25, 2023 19:27 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
> Fixed performance instances, in which CPU and memory resources are pre-allocated and dedicated to a virtualized instance for the lifetime of that instance on the host
FWIW, this design has been used from the very beginning. Even with the old Xen-based hypervisor, there was very little sharing of resources between customers. AWS engineers anticipated that the hardware might have issues allowing the state to be leaked between domains, so they tried to minimize the possible impact.
> How is the "each CPU core can only be used by one customer" enforced? Is it just relying on the kernel rarely migrating actively used vCPU threads between hardware threads, or is there scheduler affinity etc applied to enforce it?
CPUs are allocated completely statically to VMs. The current Nitro Hypervisor is extremely simplistic, and it is not capable of sharing CPUs between VMs.
Posted Aug 25, 2023 19:33 UTC (Fri)
by farnz (subscriber, #17727)
[Link]
Thanks for the link - it answers my question in full, and makes it clear that this is something that's architected into AWS. And yes, AWS documentation is a mess - it looks like I didn't find it because I wasn't looking at AWS whitepapers, but at EC2 documentation.
I had hoped that it worked the way you describe, because nothing else would meet my assumptions about how security on this would work, but I have had enough experience to know that when security is involved, hoping that people make the same assumptions as I do is a bad idea - better to see my assumptions called out in documentation, because then there's a very high chance that Amazon trains new engineers to make this set of assumptions.
Posted Aug 11, 2023 0:04 UTC (Fri)
by khim (subscriber, #9252)
[Link]
That one is easy. There's just no one left who may care. Everyone is trying to solve their own tiny, insignificant task. And the fact that when all these non-solutions to non-problems, when combined, create something awful… who may even notice that, let alone fix that? Testers? They are happy if they have time to look on the pile of pointless non-problems in the bugtracker! Users? They are not the ones who pay for the software novadays. Advertisers do that and they couldn't care less about what users experience.
Posted Aug 9, 2023 16:09 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (4 responses)
Which language has the motto "if you make the right thing to do, the easy way to do it, then people will do the right thing by default".
Going back to one of my favourite war stories, where the consultants spent SIX MONTHS optimising an Oracle query so it could outperform the Pick system it was replacing. I'm prepared to bet that Pick query probably took about TEN MINUTES to write. (And the Oracle system, a twin Xeon 800, was probably 20 times more powerful than the Pentium 90 it was replacing!)
Pick "tables" are invariably 3rd or 4th normal form, because that's just the obvious, easy way to do it. Sure, you have to specify all your indices, but if you put an index on every foreign key, you've pretty much got everything of any importance - a simple rote rule that covers 99% of cases. (And no different from relational, you tell Pick it's (probably) a foreign key by indexing it, you tell an RDBMS to index it by telling it it's a foreign key. A distinction without a difference.)
Oh - and if the modern world requires horribly inflated development speeds and costs, that's their hard cheese. With your typical RDBMS project coming in massively over time and budget, surely going back to a system where the right thing is the obvious thing will massively improve those stats! Most of my time at work is spent debugging SQL scripts and Excel formulae - that's why I want to get Scarlet in there because, well, what's the quote? "Software is either so complex there are no obvious bugs, or so simple there are obviously no bugs, guess which is harder to write." Excel and Relational are in the former category, Pick is in the latter. More importantly, Pick actually makes the latter easy!
Cheers,
Posted Aug 10, 2023 5:58 UTC (Thu)
by fredrik (subscriber, #232)
[Link] (1 responses)
Ditto for Pick, what is it, link? Thanks!
Posted Aug 10, 2023 8:07 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
https://github.com/geneb/ScarletDME
https://en.wikipedia.org/wiki/Pick_operating_system
Google groups openqm, scarletdme, mvdbms, u2-users, I guess there are more ...
Go to the linux raid wiki to get my email addy, and email me off line if you like ...
Pick/MV is like Relational/SQL - there are multiple similar implementations.
Cheers,
Posted Aug 11, 2023 0:17 UTC (Fri)
by khim (subscriber, #9252)
[Link] (1 responses)
How would that work? Let's consider three most important stats (in the increasing order of importance): But where is the money in all that?
Posted Aug 11, 2023 0:48 UTC (Fri)
by Wol (subscriber, #4433)
[Link]
Cheers,
Posted Aug 10, 2023 15:55 UTC (Thu)
by skx (subscriber, #14652)
[Link] (1 responses)
I have a single-board Z80-based system on my desk, running CP/M, these days. I tinker with it - I even wrote a simple text-based adventure game in assembly and ported it to the spectrum.
But you're right, those days are gone outside small niches. Having time and patience to enjoy the retro-things is fun. But it's amazing how quickly you start to want more. (More RAM, internet access, little additions that you take for granted these days like readline.)
Posted Aug 11, 2023 5:47 UTC (Fri)
by eduperez (guest, #11232)
[Link]
Yes, that was my fist "computer", back when I was fourteen.
Posted Aug 9, 2023 10:02 UTC (Wed)
by roc (subscriber, #30627)
[Link] (1 responses)
Some people commenting here claim they'd be happy with much lower performance. That's fine, but most people find some Web sites and phone apps useful, and those need high single-thread performance.
Posted Aug 11, 2023 0:23 UTC (Fri)
by khim (subscriber, #9252)
[Link]
Nope. Not even close. Web sites would be equally sluggish no matter how many speculations your CPU does simply because there are no one who may care to make them fast. If speculations would have been outlawed 10 or 20 years ago and all we had would have been fully in-order 80486 100MHz… they would have worked with precisely the same speed they work today on 5GHz CPUs. The trick is that it's easy to go from sluggish website on 80486 100MHz device to sluggish web site on 5GHz device, but it's not clear how you can go back and if that's even possible at all.
Posted Aug 9, 2023 18:55 UTC (Wed)
by bartoc (guest, #124262)
[Link] (19 responses)
It's not at all clear what the "better" alternative to C is either, without sacrificing a ton of usability. Sure rust is "better" than C, but ultimately it shares the same fundamental execution model. One could argue GLSL/WGSL/HLSL/etc, but the things that those languages lack from the C execution model (mutual recursion, an ABI, a stack to which registers can be spilt, etc) are seen as things holding them back, precicely because those things make shader languages less dynamic than C, and thus require absolute explosions of up front code generation, with all the compile time and I$ issues that brings.
Posted Aug 9, 2023 20:51 UTC (Wed)
by Wol (subscriber, #4433)
[Link]
The problem with C isn't that real workloads are unpredictable. The problem with C is that the language behaviour is undefined and unpredictable. If you're writing something simple, there's not much difference between languages. Except that few problems are simple, C gives you very little help to cope, and indeed it's full of landmines that will explode at every opportunity.
Writing bug-free code in C is much harder than most other languages ...
Cheers,
Posted Aug 10, 2023 9:40 UTC (Thu)
by anton (subscriber, #25547)
[Link] (17 responses)
I don't see that this has much to do with the programming language. Rust is as vulnerable to Spectre and Downfall as C is AFAICS. The only influence I see is that for a language like JavaScript that always bounds-checks array accesses, you have an easier time adding Spectre-V1 mitigations. But for Rust, which tries to optimize away the bounds-check overhead, you end up either putting in Spectre-V1 mitigation overhead (can this be done automatically?), slowing it down to be uncompetetive with C, or it is still Spectre-V1 vulnerable. Admittedly adding mitigations cannot be done automatically in C, because the compiler has no way to know the bounds in all cases.
The way to go is that the hardware manufacturers must fix (not mitigate) Spectre. They know how to avoid misspeculated->permanent state transitions for architectural state (Zenbleed is the exception that proves the rule), now apply it to microarchitectural state!
Posted Aug 10, 2023 11:52 UTC (Thu)
by excors (subscriber, #95769)
[Link] (2 responses)
If memory latency is predictable then I think it's much easier for the compiler to statically schedule the instructions, and the CPU can be much simpler while maintaining decent performance. But that only seems practical with very small amounts of memory (e.g. microcontrollers with single-cycle latency to SRAM, but only hundreds of KBs) or very large numbers of threads (e.g. GPUs where each core runs 128 threads with round-robin scheduling, so each thread has 128 cycles between consecutive instructions in the best case, which can mask a lot of memory latency), not for general-purpose desktop-class CPUs.
Posted Aug 10, 2023 15:14 UTC (Thu)
by farnz (subscriber, #17727)
[Link]
Even with PGO, control flow still has largely unpredictable regions, which depend upon the details of user input, and can only be predicted at compile time if the exact input the user will use is provided at compile time. This was one component of why Itanium's EPIC never lived up to its performance predictions; as compilers got better at exploiting compile-time known predictability, they also benefited OoO and speculative execution machines, which could exploit predictability that only appears at runtime.
For example, in a H.264 encoder or decoder, black bars are going to send your binary down a highly predictable codepath doing the same thing time and time again; your PGO compiled binary is not going to be set up on the assumption of black bars, because that's just one part of the sorts of input you might get. However, at runtime, the CPU will notice that you're going down the same codepath over and over again as you handle the black bars, and will effectively optimize further based on the behaviour right now. Once you get back to the main picture, it'll change the optimizations it's applying dynamically, because you're no longer going down that specific route through the code.
Posted Aug 10, 2023 16:03 UTC (Thu)
by anton (subscriber, #25547)
[Link]
Concerning memory latency, I also see very good speedups of out-of-order over in-order for benchmarks like the LaTeX benchmark which rarely misses the L1 cache.
Also, the fact that the Itanium II designers chose small low-latency L1 caches, while OoO designers went for larger and longer-latency L1 caches (especially in recent years, with the load latency from D-cache growing to 5 cycles in Ice Lake ff.) shows that the unpredictability is a smaller problem for compiler scheduling than the latency.
The dream of static scheduling has led a number of companies (Intel and Transmeta being the most prominent) to spend substantial funds on it. Dynamic scheduling (out-of-order execution) has won for general-purpose computing, and the better accuracy of dynamic branch prediction played a big role in that.
With regard to Spectre and company, compiler-based speculation would exhibit Spectre vulnerabilities just as hardware-based speculation does. Ok, you can tell the compiler not to speculate, but that severely restricts your compiler scheduling, increasing the disadvantage over OoO. Better fix Spectre in the OoO CPU.
Posted Aug 11, 2023 13:26 UTC (Fri)
by atnot (subscriber, #124910)
[Link] (13 responses)
You're thinking much too narrow here in terms of what "C" is in this context. It's has far less to do with the specific syntax and more with the general model of computation that derives from the original PDP11, i.e.:
Programs are a series of commands, whose effects become visible in order from top to bottom. The sequence of these commands can be arbitrarily replaced using a specific command, called a "branch". There is a singular, uniform thing called "memory", which is numbered from zero to infinity and you can create a valid read-write reference to any of it by using that number. And so on.
None of this is true internally for any modern compute device. It isn't even true for C anymore. But it was true for the creators of C, and as a result these assumptions were baked very deeply into the language, then tooling like gcc and LLVM, then languages that use that tooling like Rust, OpenCL, CUDA, and then architectures that wanted to be able to easily targetable by those tools like RISC-V and, most notably, AMD64. (As opposed to it's Itanic cousin). It's so established that people don't even recognize these as specific design choices anymore, it's just "how computers work".
Rust is definitely a step away from C, and one that has at least some potential to improve how chips are designed in the future, if the tooling allows for it. But it's not a very big step in the grand scheme of things.
Posted Aug 11, 2023 22:40 UTC (Fri)
by jschrod (subscriber, #1646)
[Link] (12 responses)
That you have statements that are executed from top to bottom, where preconditions, invariants, and postconditions exist, is basically the fundament of theoretical computer science. No proof about algorithmic semantics or correctness would work without that assumption.
If this, in your own words, isn't true any more - can you please point me to academic work that formalizes that new behavior and its semantics?
After all, I cannot believe that theoretical computer scientists have ignored this development. It is a to good opportunity to write new articles for archived journals.
If no computer science work is published on your claim, can you please explain why research is ignoring this development?
Posted Aug 12, 2023 11:50 UTC (Sat)
by atnot (subscriber, #124910)
[Link] (11 responses)
I reject this outright. There is an absolute world of wonderful compute models in between a turning machine and C or a PDP11. Many of them are von neumann machines, even. This should be clear from the fact that the key problem of the C model is that it's hard to model formally (see e.g. the size of the C memory model specification), while the turing machine was purpose designed for formal modelling.
Let me just give some examples:
For a very soft start, we can look at something like the 6502, which is generally pretty boring apart from treating he first 256 bytes of memory specially. Largely because of this, it is not supported upstream in any of the big C compilers.
Then we can look at something like Itanium, in which bundles of instructions are executed in parallel and can not see the effects of each other, along with things like explicit branch prediction, speculation and software pipelining.
This is actually pretty similar to modern CPUs, except instead of having it explicitly encoded in the instruction stream, they try to re-derive that information at runtime, often by just guessing.
Then we have things like GPUs, which have multiple tiers of shared memory, primarily work with masks instead of branches. Although they are slowly becoming more C-like as people seek to target them with C and C++ code.
There's also a whole bunch of architectures like ARM CHERI and many with memory segmentation, where addresses and pointers are not the same thing.
We can also talk about various other things like lisp machines, Mill, Transmeta, EDGE and many more things I'm forgetting.
Then even further asea, you can find things like FPGAs, which are programmed using a functional specification of behavior much like TLA+. (The current fad is, of course, trying to run C on them, to limited success)
Now if you say "But most of these are all obscure architectures nobody uses", then yes that's the point. It's because they don't look enough like K&R's PDP11. Itanium is far from the only innovative architecture that C killed and as primarily a hardware person, that's deeply frustrating.
Posted Aug 12, 2023 15:05 UTC (Sat)
by anton (subscriber, #25547)
[Link] (6 responses)
Why should the zero page of the 6502 be a reason not to support the 6502 in "big" C compilers? They can use the zero page like compilers for other architectures use registers (which leave no trace in C's memory model, either). Besides, there are C compilers for the 6502, like cc65 and cc64, so there is obviously no fundamental incompatibility between C and the 6502. The difficulties are more practical, stemming from having zero 16-bit registers, three 8-bit registers, only 256 bytes of stack, no stack-pointer-relative addressing, etc.
Concerning IA-64 (Itanium), this certainly was designed with existing software (much of it written in C) in mind, and there are C compilers for it, I have used gcc on an Itanium II box, and it works. C has not killed IA-64, out-of-order (OoO) execution CPUs have outcompeted it. IA-64, Transmeta and the Mill are based on the EPIC assumption that the compiler can perform better scheduling than the hardware, and it turned out that this assumption is wrong, largely because hardware has better branch prediction, and can therefore perform deeper speculative execution.
And the fact that OoO won over EPIC shows that having an architecture where instructions are performed sequentially is a good interface between software (not just written in C, but also, e.g., Rust) and hardware, an interface that allows adding a lot of performance-enhancing features under the hood.
Concerning Lisp machines, they were outcompeted by RISCs, which could run Lisp faster; which shows that they are not just designed for C. There actually was work on architectural support for LISP in SPUR, and some of it made it into SPARC, but one Lisp implementor wrote that their Lisp actually did not use the architectural support in their SPARC port, because the cost/benefit ratio did not look good.
Concerning GPUs, according to your theory C should have killed them long ago, yet they thrive. They are useful for some computing tasks and not good for others. In particular, let me know when you have a Rust or, say, CUDA compiler or OS kernel (maybe one written in Rust or CUDA) running on a GPU.
Posted Aug 14, 2023 9:50 UTC (Mon)
by james (subscriber, #1325)
[Link] (5 responses)
Would the conservative and increasingly security-sensitive server world have adopted the position that OoO couldn't be trusted? (Once Itanium was released, Intel would almost certainly have made that part of their marketing message.) In 2018, when in this timeline Meltdown and Spectre were discovered, the consensus of the security community was that more such attacks would be discovered, and time has sadly proven that to be correct — but we now have no other realistic option but to live with it.
We had other options around 2000 — then-current in-order processors (from Sun, for example).
The triumph of OoO looks much more like an accident of history rather than something inherent to computer science to me.
Posted Aug 14, 2023 11:05 UTC (Mon)
by anton (subscriber, #25547)
[Link] (4 responses)
By contrast, neither Intel nor AMD (nor AFAIK ARM or Apple) has fixed Spectre in the more than 6 years since they have been informed of it. This indicated that these CPU manufacturers don't believe that they can sell a lot of hardware by being the first to offer hardware with such a fix (and making it a part of their marketing message). So they think that few of their customers care about Spectre. But if they thought that many customers care about Spectre, they would design OoO hardware without Spectre.
As for IA-64, it has architectural features for speculative loads, and is therefore also vulnerable to Spectre. This vulnerability can probably be mitigated by recompiling the program without using speculative loads (if we assume that the hardware does not perform any speculative execution, it's good enough to perform the speculative load and then not use the loaded data until the speculation is confirmed; for security the speculatively loaded data should be cleared in case of a failed speculation). This mitigation would reduce the performance of Itanium CPUs to be close to the performance of architectures without these speculative features, i.e., even lower than the Itenium performance that we saw.
OoO certainly has other options wrt. Spectre than to live with it. Just fix it. All the OoO hardware designers (the Zen2 designers are the exception that proves the rule) are able to squash speculative architectural state on a misprediction; they now just need to apply the same discipline to speculative microarchitectural state. E.g., if they had squashed the speculative branch predictor state on a miscprediction, there would be no Inception, and if they had squashed the speculative AVX load buffer state on misprediction, there would be no Downfall.
Posted Aug 15, 2023 4:26 UTC (Tue)
by donald.buczek (subscriber, #112892)
[Link] (3 responses)
A branch predictor, which isn't allowed to learn, would't that just be a rather useless static branch predictor like "allways probably backwards" or "as hinted by machine code" ?
Posted Aug 15, 2023 11:01 UTC (Tue)
by anton (subscriber, #25547)
[Link] (2 responses)
A straightforward way is to learn from completed (i.e. architectural) branches, with the advantage that you learn from the ground truth rather than speculation.
If that approach updates the branch predictor too late in your opinion (and for the return predictor that's certainly an issue), a way to get speculative branch predictions is to have an additional predictor in the speculative state, and use that in combination with the non-speculative predictor. If a prediction turns out to be correct, you can turn the part of the branch predictor state that is based in that prediction from speculative to non-speculative (like you do for architectural state); if a prediction turns out to be wrong, revert the speculative branch predictor state to its state when the branch was speculated on (just like you do with speculative architectural state).
Posted Aug 15, 2023 14:26 UTC (Tue)
by donald.buczek (subscriber, #112892)
[Link] (1 responses)
Why wouldn't such a branch predictor always give the initial answer? If correct, it would be sensible to stick to it and if wrong, you want to ignore that and revert to the state of the last correct guess or the initial state.
Assuming you want to apply that to binary, taken/not taken branch predictor and not only target branch predictors?
Posted Aug 15, 2023 15:23 UTC (Tue)
by anton (subscriber, #25547)
[Link]
In more detail: If the mispredicted branch is non-speculative, you record it in the non-speculative predictor. If the mispredicted branch is still in the speculative part of execution (that would mean that you have a CPU that corrects mispredictions out-of-order; I don't know if real CPUs do that), you record it in the speculative part, and when this branch leaves the speculative realm, this record can also be propagated to the non-speculative predictor.
Posted Aug 12, 2023 16:51 UTC (Sat)
by farnz (subscriber, #17727)
[Link] (3 responses)
Itanium failed to outperform AMD64 on hand-coded assembly as well as on C code. It wasn't killed by the C model, it was killed by a failure to deliver performance greater other CPUs. VLIW CPUs like Transmeta failed because VLIW code is inherently low-density in memory, and our current bottleneck for performance tends to be L1 cache size. Mill has never reached a point where hand-written code in simulation outperforms hand-written code for AMD64 given the same simulated resources as AMD64. EDGE is an ongoing research project, and may (or may not) prove worthwhile - there's certainly not been an effort to build a good EDGE CPU that can be compared to something "C-friendly" like RISC-V.
Similar failures apply to Lisp Machines. While they had dedicated hardware to make running Lisp code faster, they lost out because RISC CPUs like SPARC and MIPS were even faster at running Lisp code for a given energy input than Lisp Machines were. Again, not about programming model, but about the Lisp Machines being worse hardware for running Lisp than MIPS or SPARC.
In terms of competing models of computation that have actually made it to retail sale, FPGAs are a commercial success, but are not programmed like CPUs, because they're defined as a sea of interconnected logic gates, and you are better off exploiting that via a Hardware Description Language than via something like C, FORTRAN or COBOL. GPUs are a commercial success; individual threads on a GPU are similar to a CPU with SIMD, with many threads per core (8 on Intel, more on others), and a hardware thread scheduler that allows you to have a pool of cores sharing thousands or even hundreds of thousands of threads.
None of this is about the "C model"; underpinning all of the noise is that humans struggle to coordinate concurrent logic in their heads, and prefer to think about a small number of coordination points (locks, message channels, rendezvous points, whatever) with a single thread of execution between those points. OoOE with speculative execution is one of the two local minima we've found for such a mental model of programming, and supports the case where a single thread of logic is the bottleneck. The other model that works well is the workgroup model used by GPU programming, where something distributes a very large number of input values to a pool of workers, and lets the workers build a large number of output values. Between the input and output values, there's very little (if not no) coordination between workers.
And while the 6502 is not supported upstream in any of the big C compilers, nor are many other CPUs of the same vintage. The Z80 is not supported in any of the big C compilers, nor is the 6809, for example, and both of those were big selling CPUs at the time the 6502 was current; the Z80 is also a lot friendlier to C than the 6502, since the Z80 does not limit you to a single 256 byte stack at a fixed location in memory, whereas the 6502 has a 256 byte stack fixed in page 1. I've never personally programmed a 6809 system, but I believe that it's also a lot more C friendly than the 6502.
Fundamentally, the thing that has killed every alternative to date is that the surviving processor types are simply faster for commercially significant problems than any competitor was, even with alternative programming models. This applies to VLIW, and to EPIC, and to Lisp Machines.
Posted Aug 14, 2023 20:08 UTC (Mon)
by mtaht (subscriber, #11087)
[Link] (2 responses)
Weirdly enough I do not care about IPC, what I care about is really rapid context and priv switching, something that unwinding speculation on the TLB flush on spectre really impacted. I am tired of building processors that can only go fast in a straight line. And like everyone here, tired of all these vulnerabilities.
The mill held promise of context or priv switching in 3 clocks. The implicit zero feature and byte level protections seemed like a win. But it has been a long 10+ years since that design was announced, have there been any updates?
Posted Aug 14, 2023 21:52 UTC (Mon)
by mathstuf (subscriber, #69389)
[Link]
Posted Aug 17, 2023 14:43 UTC (Thu)
by farnz (subscriber, #17727)
[Link]
It's a while since I saw the information (around 10 years), so I don't have links to hand, and it was investor-targeted. They seemed to be making the same mistake as Itanium designers, though - they compared hand-optimized code on their Mill simulator to GCC output on a then current Intel chip (Haswell, IIRC), showing that simulated Mill was better than GCC output on Haswell. The claim was that compiler improvements needed for Mill would bring Mill's performance on compiled code ahead of Haswell's performance; but it failed to take into account that, with a lot of human effort, I could get better performance from Haswell with hand-optimized code than they got with GCC output, using GCC's output as a starting point.
I am inherently sceptical of "compiler improvement" claims that will benefit one architecture and not another; while I'll accept that the improvement is not evenly distributed, until Mill Computing can show that their architecture with their compiler can outperform Intel, AMD, Apple, ARM or other cores with a modern production-quality (e.g. GCC, LLVM) compiler for the same language, I will tend towards the assumption that anything that they improve in the compiler will also benefit other architectures.
This holds especially true for compiler improvements around scheduling, which is what Mill depends upon, and what Itanium partially needed to beat OoOE - improvements to scheduling of instructions benefit OoOE by making the static schedule closer to optimal, leaving the OoOE engine to deal with the dynamic issues only, and not statically predictable hazards.
Posted Aug 10, 2023 23:59 UTC (Thu)
by khim (subscriber, #9252)
[Link] (4 responses)
And that's the beginning and the end. Most people out there don't want to think. And once these people have took over… the whole house of cards started unraveling. Today people don't want to think… about anything, really. They are ignoring as much as they could and concentrate on what's profitable. Only… you couldn't eat paper and zeros and ones in central banks servers are even more useless. It would be interesting to see if we would find a way to avoid collapse of western civilisation, but chances are not good: most people not only don't understand why it's collapsing, they don't even notice that collapse is not just started but well underway.
Posted Aug 11, 2023 15:10 UTC (Fri)
by Wol (subscriber, #4433)
[Link] (3 responses)
A couple of days ago we had an article about Drax in one of our daily newspapers - so we're talking maybe 20-30% of newspaper readers reading this article.
A major part of the story is about the power station shutting down and avoiding having to pay rebates to consumers - some government subsidy that had to be repaid if they were generating and selling electricity above a certain price. So they shut down and sold their fuel elsewhere instead.
That fuel being woodchip. So a second, large, part of the journalist's story was about how Drax was one of our biggest greenhouse gas emitters and polluters in the country! The eco-friendliness of shipping the wood from Canada is certainly up for debate, but burning wood? That's one of the greenest fuels we've got!
When journalists - who are supposed to inform the public! - get their facts so badly out of kilter, what hope do the public have?
Cheers,
Posted Aug 11, 2023 15:26 UTC (Fri)
by paulj (subscriber, #341)
[Link] (2 responses)
Particularly if that wood is coming from old wood forests that are being cleared. I don't know the details of Canadian wood pulp, but IVR a lot of their wood is from clearing old woods.
A final issue is that commercial forestry (least in UK and Ireland) is from dense pine forestry plantations, which is kind of a disaster for the native ecosystem. Really, we need to reforest our denuded countries (UK and Ireland) with natural, long-life forests - really good carbon capture and storage!
Which means we need something else for power. Something that is a lot more space efficient than covering the country in dense commercial and largely dead pine forests (which probably still won't give us enough fuel). The answer is obvious, but greens have irrational dogma.
Posted Aug 11, 2023 16:14 UTC (Fri)
by joib (subscriber, #8541)
[Link] (1 responses)
Burning biomass is, in the end, a very inefficient way of turning sunlight into usable energy. There just isn't enough arable land on the planet to replace the energy we currently get from fossil fuels. There are other very low carbon energy production technologies that are much more area efficient, like wind, solar and nuclear energy.
Anyway, this isn't the correct forum to debate this. ;)
Posted Aug 14, 2023 8:35 UTC (Mon)
by paulj (subscriber, #341)
[Link]
I consider myself pretty green, but I abhor the common the "green" stance on nuclear power. Which is completely at odds with having both a) A biodiverse and sustainable planet b) A modern way of life ("modern" implies high energy use in many many ways, and only nuclear can reliably replace fossil fuels to provide this). If you make society choose between A and B, society will choose B. Sigh sigh sigh.
Posted Aug 11, 2023 16:32 UTC (Fri)
by DemiMarie (subscriber, #164188)
[Link]
Posted Aug 10, 2023 22:41 UTC (Thu)
by dfc (subscriber, #87081)
[Link]
Posted Aug 9, 2023 6:39 UTC (Wed)
by ibukanov (subscriber, #3942)
[Link] (3 responses)
Posted Aug 9, 2023 10:10 UTC (Wed)
by excors (subscriber, #95769)
[Link] (2 responses)
They're curves because power is a non-linear function of frequency. There's some overlap where a lower-end CPU near its max frequency has worse power than a higher-end CPU at equal performance.
Draw a straight line through the origin and tangential to the purple (middle) curve. That should represent the optimal power/performance ratio for the Cortex-A715. By my rough measurements on this questionably-precise graph, the Cortex-A510 curve shows better power/performance than that when it's about 20%-60% of its max performance.
So if you're trying to optimise power/performance, and you're happy with <15% of the Cortex-A715's max performance - maybe your task doesn't need to complete quickly, or maybe you've got an embarrassingly parallel problem and can spread it over 6x as many cores with no extra overhead - then the Cortex-A510 seems worthwhile. But if you need even slightly more than that, and would have to drive the Cortex-A510 at a higher frequency, you'll get better efficiency *and* 3x better performance by switching to the Cortex-A715 at half its max frequency.
Posted Aug 10, 2023 9:48 UTC (Thu)
by anton (subscriber, #25547)
[Link]
Posted Aug 11, 2023 13:54 UTC (Fri)
by ibukanov (subscriber, #3942)
[Link]
Posted Aug 9, 2023 9:48 UTC (Wed)
by epa (subscriber, #39769)
[Link] (5 responses)
Posted Aug 9, 2023 10:45 UTC (Wed)
by mb (subscriber, #50428)
[Link] (1 responses)
Yes, but then, if I click a button, I want today's massive software stack triggered by this action to run as fast as possible. Otherwise it becomes non-interactive.
Posted Aug 9, 2023 13:53 UTC (Wed)
by epa (subscriber, #39769)
[Link]
Posted Aug 9, 2023 10:50 UTC (Wed)
by adobriyan (subscriber, #30858)
[Link] (2 responses)
> I might be happy with a slower, non-speculative CPU for most use.
> High-performance code for gaming or video decoding (or perhaps a kernel compile) can be explicitly tagged as less sensitive, and scheduled on a separate high-performance core.
Full x86_64 allmodconfig build takes about 3.5-4 hours on 1 core and kernel is not the slowest project to build.
Developers still need _many_ fast cores for parallel compilation.
Posted Aug 9, 2023 14:16 UTC (Wed)
by yodermk (subscriber, #3803)
[Link] (1 responses)
Posted Aug 9, 2023 21:45 UTC (Wed)
by DemiMarie (subscriber, #164188)
[Link]
Posted Aug 9, 2023 14:00 UTC (Wed)
by eru (subscriber, #2753)
[Link] (2 responses)
Posted Aug 9, 2023 14:30 UTC (Wed)
by excors (subscriber, #95769)
[Link]
Posted Aug 9, 2023 14:32 UTC (Wed)
by farnz (subscriber, #17727)
[Link]
The last Intel x86 CPU with no speculative execution at all was the 80486. The Pentium (in 1993) had a very limited amount of speculative execution driven by a dynamic branch predictor, and it's just grown from then on.
Posted Aug 10, 2023 16:07 UTC (Thu)
by DemiMarie (subscriber, #164188)
[Link] (16 responses)
Posted Aug 10, 2023 16:34 UTC (Thu)
by malmedal (subscriber, #56172)
[Link] (7 responses)
In theory, but the people who have tried, e.g. Sun with Niagara and Intel with Larrabee have so far failed...
Posted Aug 11, 2023 9:58 UTC (Fri)
by paulj (subscriber, #341)
[Link] (6 responses)
Larrabee failed, but... Intel tried to make that into a GPU competitor. And the amount of RAM was limited.
Posted Aug 11, 2023 17:06 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (5 responses)
It never worked well, garbage collection was slow because even the "parallel" GC in Sun JVM was not quite parallel and the sequential parts were causing huge delays because the single-threaded execution was super-slow.
Later, we tried to use Tilera CPUs (massively parallel CPUs with 32 cores) for networking software, and it ALSO failed miserably. Turns out that occasional serialized code just overwhelms everything. I still have a MikroTik Tilera-based router from that experiment, I'm using it for my home network.
Posted Aug 11, 2023 19:55 UTC (Fri)
by malmedal (subscriber, #56172)
[Link]
Especially annoying since I was not making these calls directly, they were from third-party libraries so it was practically impossible to figure out what could be safely run in parallel.
Posted Aug 14, 2023 8:48 UTC (Mon)
by paulj (subscriber, #341)
[Link] (2 responses)
Tilera, worked on software on that too. The people who architected that software had actually done a pretty good job of making sure the packet processing "hot" paths could all run independently, and each thread (1:1 to CPU core) had its own copy of the data required to process packets. Other, non-forwarding-path "offline" code would then in the background take the per-CPU packet data, process it, figure out what needed to be updated, and update each per-CPU hot-path/packer-processing data state accordingly. That worked very well.
The issue the shop I worked at had with Tilera was that it was unreliable. The hardware had weird lock up bugs. I figured out ways to increase the MTBF of these hard lock ups, by taking more care in programming the broadcom Phys attached to the chip (I think they were on ASIC, and part of the Tilera design - can't quite remember). But... MAU programming via I2C controllers shouldn't really be causing catastrophic lockups of the whole chip. We still had hard lock ups though - never fully figured them all out or work-arounds.
It seemed a 'fragile' and sensitive chip.
Posted Aug 14, 2023 9:20 UTC (Mon)
by paulj (subscriber, #341)
[Link]
Posted Aug 14, 2023 15:38 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link]
We found some strange lockups in glibc, something to do with pthreads and signals. We "solved" it by porting musl libc, at that time it was easier to do than figuring out how to build and debug glibc.
But yeah, lockups also happened.
Posted Aug 17, 2023 11:00 UTC (Thu)
by davidgerard (guest, #100304)
[Link]
Posted Aug 10, 2023 17:16 UTC (Thu)
by farnz (subscriber, #17727)
[Link] (1 responses)
Most systems have a compute device in them, called a GPU, which is designed that way. For certain workloads, such as graphics rendering and machine learning, this is an amazing model, because there's a huge amount of parallelism to exploit (so-called "embarrassingly parallel" problems). For others, such as running a TCP/IP stack, it's not great, because much of the problem is serial, and you're better off pushing the problem to a CPU which is designed to run a single thread exceedingly fast.
Posted Aug 11, 2023 9:59 UTC (Fri)
by paulj (subscriber, #341)
[Link]
Posted Aug 11, 2023 9:52 UTC (Fri)
by paulj (subscriber, #341)
[Link] (4 responses)
That machine got considerably more throughput on highly parallel web workloads as a result (as long as you didn't run a web app in a language that indiscriminately used floating-point, like PHP, cause they gave it one FPU to share between all cores!).
See link in another comment to a blog post with more details and references to a couple of really good papers - old, but still good reading.
Posted Aug 11, 2023 11:19 UTC (Fri)
by malmedal (subscriber, #56172)
[Link] (3 responses)
At one point I tried explaining to people, complete with benchmarks, why the Niagaras were not a good fit for a specific PHP application. It is quite difficult to convince people that the new expensive system they just bought will never work as well as the existing several years old servers it was supposed to replace.
Since the servers were bought and paid for I tried to find something useful for them to do, but did not really succeed.
Posted Aug 11, 2023 12:03 UTC (Fri)
by paulj (subscriber, #341)
[Link] (2 responses)
Posted Aug 14, 2023 10:21 UTC (Mon)
by epa (subscriber, #39769)
[Link] (1 responses)
Posted Aug 15, 2023 4:47 UTC (Tue)
by donald.buczek (subscriber, #112892)
[Link]
Not true for Perl, integers and doubles use native types [1].
[1]: https://github.com/Perl/perl5/blob/79c6bd015ed156a95e3480...
Posted Aug 13, 2023 20:16 UTC (Sun)
by kleptog (subscriber, #1183)
[Link]
You're working on a VM so there is some overhead, but the result is that your application can linearly scale with the number of cores. A 256-core machine will support twice as many requests per second and a 128-core machine. It was built for telephony exchanges, and it shows. For stuff like WhatsApp where you're managing millions of TCP connections and messages, it really shines.
It's a functional language though with no per-process shared mutable state. It avoids a lot of GC overhead because most threads die before the first GC pass is run. You simply toss all the objects associated with a thread when it exits without checking liveness.
There is absolutely no way you could make the existing mass of Javascript of C/C++ run in such a way. Maybe one day we will have AI systems smart enough to reformulate code in this way for us.
Posted Aug 10, 2023 22:29 UTC (Thu)
by jmspeex (subscriber, #51639)
[Link] (4 responses)
Posted Aug 10, 2023 22:32 UTC (Thu)
by farnz (subscriber, #17727)
[Link] (3 responses)
The 486 only had pipeline stalls for resolving conditional branches; the Pentium used a branch predictor and very limited speculative execution of the predicted outcome, and the amount of speculative execution in Intel processors went up from there onwards to today's CPUs.
Posted Aug 11, 2023 8:13 UTC (Fri)
by excors (subscriber, #95769)
[Link] (2 responses)
> PF: Prefetch
and says:
> In EX all u-pipe instructions and all v-pipe instructions except conditional branches are verified for correct branch prediction. [...] The final stage is Writeback (WB) where instructions are enabled to modify processor state and complete execution. In this stage, v-pipe conditional branches are verified for correct branch prediction.
(where the two pipes are: "The u-pipe can execute all integer and floating-point instructions. The v-pipe can execute simple integer instructions and the FXCH floating-point instruction.")
and:
> The Pentium processor uses a Branch Target Buffer (BTB) to predict the outcome of branch instructions which minimizes pipeline stalls due to prefetch delays. The Pentium processor accesses the BTB with the address of the instruction in the D1 stage. [...]
Apart from the quirk with v-pipe conditional branches, that sounds like all branch predictions are resolved by the EX stage. If the prediction made in D1 was wrong, then it doesn't EX the mispredicted instruction, it flushes the pipeline and starts again. There is speculative fetch and decode, but no speculative execution. Am I misinterpreting that, or using a different meaning of "speculative execution" or something?
(Speculative fetching sounds mostly harmless in relation to Spectre - it can't reveal any microarchitectural state except the contents of the BTB, in contrast to proper speculative execution where potentially-sensitive register contents are processed by the EX stages of potentially-bogus instructions and may be exposed through many microarchitectural side channels.)
Posted Aug 11, 2023 12:25 UTC (Fri)
by tao (subscriber, #17563)
[Link] (1 responses)
Posted Aug 11, 2023 14:11 UTC (Fri)
by excors (subscriber, #95769)
[Link]
Posted Aug 9, 2023 8:01 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (1 responses)
Cheers,
Posted Aug 10, 2023 7:33 UTC (Thu)
by tomsi (subscriber, #2306)
[Link]
Posted Aug 10, 2023 10:09 UTC (Thu)
by Aissen (subscriber, #59976)
[Link] (1 responses)
Posted Aug 10, 2023 16:08 UTC (Thu)
by deater (subscriber, #11746)
[Link]
famously there are fewer transistors in a 6502 than there are pages in the x86 documentation. It's actually possible for one person to know what each transistor in the 6502 is doing and audit it.
Posted Aug 9, 2023 2:07 UTC (Wed)
by dxin (guest, #136611)
[Link] (39 responses)
Posted Aug 9, 2023 2:19 UTC (Wed)
by willy (subscriber, #9762)
[Link] (7 responses)
Posted Aug 9, 2023 5:35 UTC (Wed)
by ibukanov (subscriber, #3942)
[Link]
Posted Aug 9, 2023 9:16 UTC (Wed)
by paulj (subscriber, #341)
[Link] (5 responses)
There is a good argument to be made that the increasing transistor count budgets could be better spent on adding more, simple, compute elements ("cores") rather than adding ever more complex speculative execution logic to ever more complex compute elements. That this would be more efficient overall.
I.e., rather than trying to make 1 (or a very small) number of parallel paths of execution very fast with speculative execution, we should just provide many more paths of execution with simpler cores. The simpler cores might each have to stall more waiting on memory latency, but if you have many of them you can get more throughput - they will not waste cycle or energy on misplaced speculative execution.
These are not new ideas, they go back a long way, and we're slowly going down that path it seems. GPUs are kind of part of that vision, CPUs have gone many-core, but still with very complex speculative logic to fulfil desire for good single-thread benchmark results. Old blog of mine, but the references are still good to read: https://paul.jakma.org/2009/12/07/thread-level-parallelis...
Posted Aug 10, 2023 16:05 UTC (Thu)
by DemiMarie (subscriber, #164188)
[Link] (4 responses)
Posted Aug 10, 2023 17:19 UTC (Thu)
by farnz (subscriber, #17727)
[Link] (2 responses)
You can't, easily. Much of the parallelism limit is inherent to the way we perceive the problem domain, and it's simply not possible to have more parallelism without radical new understandings of the problems we're trying to solve.
Some problems, such as graphics rendering and neural network modelling, do have a higher inherent parallelism, and we have an alternative type of processor, called a GPU for historical reasons, which is designed to be faster than a CPU on problems with lots of parallelism; it achieves this by sacrificing single thread performance in favour of running a large number of concurrent threads, complete with hardware support for launching a very large number of threads and multiplexing them onto a smaller number of executing threads.
Posted Aug 10, 2023 22:01 UTC (Thu)
by DemiMarie (subscriber, #164188)
[Link] (1 responses)
Posted Aug 10, 2023 22:08 UTC (Thu)
by farnz (subscriber, #17727)
[Link]
They're no more terrible at non-uniform control flow than CPUs are - in the worst case, you just use one SIMD lane per GPU core, get a much lower throughput, but still have the large number of threads. It's just that we look at GPUs differently to CPUs, so we see the slowdown from using only one SIMD lane as a big deal on a GPU, but we don't see it as a big deal that we only use scalar instructions on CPU cores with the ability to process 8 (AVX2) or 16 (AVX-512) 32-bit values in parallel, despite the fact that this is the same class of slowdown.
Posted Aug 11, 2023 9:48 UTC (Fri)
by paulj (subscriber, #341)
[Link]
Making efficient use of compute resources, in a world where the codes you want to run have limited parallelism? Run many different codes together on the same compute elements, and switch between them to keep memory bandwidth and compute occupied. No single code will run faster, but at least you maintain throughput in the aggregate.
This is kind of where computers have gone anyway. From your phone, to your desktop, to servers running containers running jobs in the cloud - they've all got many many dozens of jobs to run at any given time. If one stalls, switch to another.
Posted Aug 9, 2023 8:13 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (30 responses)
At roughly 1ft/ns, this means your typical ATX mobo cannot operate faster than 500MHz. Knock a nought off that, to give a 3cm chip, and you've stuck a nought on your chip speed, 5GHz. Careful placement of components will nudge that speed up, but if components need to communicate "across chip", you're stuffed ...
Cheers,
Posted Aug 9, 2023 9:16 UTC (Wed)
by joib (subscriber, #8541)
[Link]
Cray's were about the same size as other mainframe sized computers of the days. Going much bigger wasn't really useful, because neither software nor hardware at the time was ready for massive parallelism. Today it is, and thus we have warehouse sized supercomputers that can run (some, obviously not all) HPC style problems utilizing all that parallelism.
> At roughly 1ft/ns, this means your typical ATX mobo cannot operate faster than 500MHz. Knock a nought off that, to give a 3cm chip, and you've stuck a nought on your chip speed, 5GHz. Careful placement of components will nudge that speed up, but if components need to communicate "across chip", you're stuffed ...
That matters insofar as you require everything to be synchronous, with a signal traversing from across a wire within one clock cycle. The existence of CPU cores within your CPU running at different frequencies, not to mention long distance high speed network transmission, suggests that it's possible to design things without such synchronicity requirements.
Posted Aug 9, 2023 9:57 UTC (Wed)
by malmedal (subscriber, #56172)
[Link] (6 responses)
No, chip designers face many problems, but that one is solved. They are making complicated networks, called clock-trees, or meshes, that ensure each clock edge arrives at all components in the clock-domain simultaneously.
The problems they can't seem to solve include power and the fact that wires are getting slower faster than they get shorter in recent processes.
Posted Aug 9, 2023 11:53 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (5 responses)
That's the wrong problem! Making sure all the signals arrive together is easy. The problem is that if your edges do not share a common "light cone", you're stuffed!
Increasing the frequency reduces the size of the light cone. If the light cone is to encompass the entire chip, then the upper limit on frequency is 5GHz. As others have said, if you're only interested in communicating internal to a single core, then LOCALLY you can increase the frequency further, because everything you're interested in fits into a smaller light cone.
You can have all the fancy clock-trees you like, but if your components are that far apart that you physically require faster-than-light information transfer, you're stuffed.
Cheers,
Posted Aug 9, 2023 12:34 UTC (Wed)
by malmedal (subscriber, #56172)
[Link] (3 responses)
The actual speed was about 1mm per nanosecond over ten years ago. It is probably less than 0.05mm/ns now.
Posted Aug 9, 2023 13:18 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (2 responses)
Cheers,
Posted Aug 9, 2023 13:36 UTC (Wed)
by malmedal (subscriber, #56172)
[Link]
Posted Aug 9, 2023 14:37 UTC (Wed)
by joib (subscriber, #8541)
[Link]
What changes then is that for these very small conductors you'll find on modern deep submicron integrated circuits, the relative capacitance of the conductor starts to rise (and the resistance doesn't go down as well as you'd like either). This leads to a phenomenon where when you apply a voltage on one end of the conductor, it takes longer until the voltage/current rises enough on the other end to be registered as a 0->1 flip. So in effect it appears as if the speed of signal propagation drops. I'm not sure how well the telegrapher's equations mentioned by malmedal in the sibling post applies to multi-GHz signals propagating in these very narrow conductors, but something like that is the gist of it. I don't think you need to apply quantum mechanics or study the behavior of individual electrons per se to understand this phenomena.
Posted Aug 9, 2023 22:41 UTC (Wed)
by magnus (subscriber, #34778)
[Link]
Posted Aug 9, 2023 19:01 UTC (Wed)
by flussence (guest, #85566)
[Link]
We're already seeing that in RAM speeds where there's a tug-of-war over having it on-CPU/off-CPU, each new DDR version has a huge jump in latency that needs to be papered over with more cache, DIMMs on large boards need buffer chips, DDR5 (iirc) now *requires* ECC to survive normal operation…
I wouldn't be surprised if a few years from now we start seeing hard NUMA become mainstream. Back to the days of having an empty slot close to the CPU because the manufacturer was too cheap to populate it!
Posted Aug 9, 2023 19:45 UTC (Wed)
by willy (subscriber, #9762)
[Link] (20 responses)
Posted Aug 9, 2023 20:58 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (19 responses)
Cheers,
Posted Aug 9, 2023 21:40 UTC (Wed)
by willy (subscriber, #9762)
[Link] (4 responses)
The first is that things need to happen in a single cycle. An instruction that needs data from L3 cache can and will stall for hundreds of cycles. During that time the CPU will execute some of the other dozens of instructions that it has ready. It's something like six clock ticks to retrieve data from L1. Data in registers is ready to operate on and incurs no delay.
The second is that the speed of communication between different parts of the CPU have anything to do with the speed of light. Speed of electrons in copper is much slower. That's the part other people are telling you that you have wrong.
(There are other problems with your argument, but those are the big two)
Posted Aug 9, 2023 21:59 UTC (Wed)
by farnz (subscriber, #17727)
[Link] (3 responses)
Note that the speed of electrons is irrelevant; the voltage change that represents a change in state moves much faster than the electrons do, typically at around 60% to 70% of the speed of light in a copper conductor.
But the point about things not needing to happen in a single cycle is key; I can design my logic to account for propagation delays in the circuit, and have it work perfectly. This is what the timing diagrams that are part of any digital logic chip datasheet (and in every CPU datasheet since the 4004) are all about - how do I connect up the entire system's worth of logic such that the system's timing constraints are met?
Posted Aug 10, 2023 7:59 UTC (Thu)
by Wol (subscriber, #4433)
[Link] (2 responses)
Signals are carried by photons (or em waves, same(ish) thing) so the speed of light IS relevant, although from what others have said the telegraph effect is probably more important, and
My argument has repeatedly been prefixed with "IF components need to communicate" so okay, I'm not necessarily talking about clock cycles, but a single communication cycle has that upper limit. I'm not always clear in what I say, I know that, but if you make no attempt to understand me, I can't understand you either. So IFF a communication cycle equals a clock cycle, 5GHz is the maximum clock possible between two random components in a chip. Of course, splitting a communication clock cycle into multiple clock cycles can speed OTHER stuff up, but it makes no difference to the speed at which a signal travels across a chip.
(And of course, without communication a chip can't work.)
Cheers,
Posted Aug 10, 2023 15:03 UTC (Thu)
by farnz (subscriber, #17727)
[Link]
A corollary of your argument is that Starlink satellites (communication clock rate of around 230 kHz) can be no higher than 1.3 km above the receiver, and Sky TV satellites (communication clock rate of 22 MHz or above) can be no higher than 13 metres above the receiver.
Posted Aug 10, 2023 16:23 UTC (Thu)
by malmedal (subscriber, #56172)
[Link]
However you seem to be unable to understand what people are saying, please read more carefully.
For instance there is no "telegraph effect" the "telegrapher's equations" are just Maxwell's equations applied to signals in a wire.
If you wish to be able to say anything intelligible about chips you need to understand what "pipelines" are in this context. This appears to be a major gap in your knowledge, you completely ignore it when people bring this up. It is not just a word, it is one of the fundamental concepts.
Already the 8088 had a pipeline, it is not a new concept.
Posted Aug 9, 2023 21:57 UTC (Wed)
by malmedal (subscriber, #56172)
[Link] (13 responses)
Posted Aug 9, 2023 22:04 UTC (Wed)
by willy (subscriber, #9762)
[Link] (11 responses)
Posted Aug 9, 2023 22:21 UTC (Wed)
by mathstuf (subscriber, #69389)
[Link] (2 responses)
And another way to sanity check things: if the size of space between communication endpoints limited your processing rate, we'd probably still be waiting for the first (quality) images from various Mars rovers.
At least I think that's somewhat closer than what Wol has as a model.
Posted Aug 10, 2023 8:00 UTC (Thu)
by Wol (subscriber, #4433)
[Link] (1 responses)
Cheers,
Posted Aug 10, 2023 20:28 UTC (Thu)
by rschroev (subscriber, #4164)
[Link]
Posted Aug 9, 2023 22:23 UTC (Wed)
by malmedal (subscriber, #56172)
[Link]
Posted Aug 10, 2023 10:49 UTC (Thu)
by james (subscriber, #1325)
[Link] (6 responses)
This is the key difference between PCIe 5.0 (which used NRZ, or one bit per cycle) and PCIe 6.0. Both run at 32 billion signals per second: it's just with PCIe 6.0 each signal conveys two bits.
Your main point is correct, though -- this isn't what limits the length of a PCIe 6.0 connection.
Posted Aug 10, 2023 15:39 UTC (Thu)
by kpfleming (subscriber, #23250)
[Link] (5 responses)
Posted Aug 10, 2023 16:44 UTC (Thu)
by malmedal (subscriber, #56172)
[Link] (4 responses)
No. Each lane separately transmits 64 Gigabits per second.
Standard terminology is 64GT and and 32GHz.
Posted Aug 23, 2023 5:28 UTC (Wed)
by JosephBao91 (subscriber, #157211)
[Link] (3 responses)
Posted Aug 23, 2023 11:34 UTC (Wed)
by malmedal (subscriber, #56172)
[Link] (2 responses)
Posted Aug 23, 2023 12:13 UTC (Wed)
by excors (subscriber, #95769)
[Link] (1 responses)
For example https://blog.samtec.com/post/why-did-pcie-6-0-adopt-pam4-... describes the Nyquist frequency of PCIe 5.0/6.0 as 16GHz. (The sampling rate is also the same in both, the difference is that in 6.0 each sample encodes 2 bits, so it's 16GHz Nyquist frequency with 32GHz sampling rate and 64GT/s data rate.)
Posted Aug 23, 2023 13:14 UTC (Wed)
by malmedal (subscriber, #56172)
[Link]
Posted Aug 9, 2023 22:26 UTC (Wed)
by farnz (subscriber, #17727)
[Link]
For a very clear example, a geostationary TV satellite is typically transmitting at 22 MHz or higher symbol rates; if the signal has to propagate all the way from the satellite to the receiver before the satellite can start the next symbol, then geostationary orbit has to be no higher than 14 meters above the satellite dish. In practice, everything is designed to handle this delay, and thus it's fine.
If you insist on two-way communication, Starlink's signal has been partially reversed engineered, and has a symbol time of 4.4 µs; this corresponds to 1.3 km path length in free space. And yet, a Starlink satellite is around 550 km above the Earth's surface, for a propagation delay of around 1,800 µs - significantly more than the symbol time.
Posted Aug 9, 2023 15:27 UTC (Wed)
by farnz (subscriber, #17727)
[Link] (3 responses)
All of these vulnerabilities exist because we have shared state in the hardware between two threads with different access rights; how far away is a world where we can afford to let some CPU cores be near-idle so that threads with different access rights don't share state?
In theory, the CPU designers could fix this by tagging state so that different threads (identified to the CPU by PCID and privilege level) don't share state, and by making sure that the partitioning of state tables between different threads changes slowly.
And also in theory, we could fix this in software by hard-flushing all core state at the beginning and end of each context switch that changes access rights (including user mode to kernel mode and back). However, this sort of state flushing is expensive on modern CPUs, because of the sheer quantity of state (branch predictors, caches, store buffers, load queues, and more).
Which leaves just isolation as the fix for high-performance systems; with enough CPU cores, you can afford the expensive state flush when a core switches access rights, and you can use message passing (e.g. io_uring) to ask a different core to do operations on your behalf.
Posted Aug 9, 2023 16:18 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (1 responses)
Aren't we there already? If all your cores run at full power your chip will fry in seconds?
Just allocate one job per core and let the chip allocate power to whatever job is ready to run.
Cheers,
Posted Aug 9, 2023 17:39 UTC (Wed)
by farnz (subscriber, #17727)
[Link]
My laptop has over 500 tasks running, many of them for only short periods before going to sleep. My phone has similarly large numbers of tasks.
We don't yet have thousands of cores, so we can't simply assign each task to a core; we thus need to work out how to avoid having (e.g.) kernel threads and user threads sharing the same state. And note that because some state is outside the core (L2 cache, for example), it's not just a case of "don't share cores - neither sequentially nor concurrently" - depending on the paranoia level, you might want to reduce shared state further than that.
Posted Aug 9, 2023 21:23 UTC (Wed)
by andresfreund (subscriber, #69562)
[Link]
Posted Aug 9, 2023 17:01 UTC (Wed)
by eharris (guest, #144549)
[Link] (1 responses)
Posted Aug 9, 2023 17:28 UTC (Wed)
by mb (subscriber, #50428)
[Link]
And also pre-fetching so that the slow memory access can happen while the CPU is doing other calculations.
Posted Aug 10, 2023 3:25 UTC (Thu)
by wtarreau (subscriber, #51152)
[Link] (3 responses)
Of course that was a lie or at least a misconception that would conflict with all hopes for optimizations later. Caches are incompatible with confidentiality, yet they're absolutely mandatory with nowadays CPU frequencies. Busses are too small for the large number of cores and cause arbitration allowing to infer other cores' activities. The wide execution units in our CPUs are mostly idle, making SMT really useful but disclosing even more fine-grained activities, to the point that no more progress is being made in that direction (what CPU vendor does 4-SMT or 8-SMT, maybe only IBM's Power ?).
Meanwhile, the vast majority of us are using a laptop that we don't share with anyone and we all run commands using "sudo", most of the type not even having to re-type a password, because it's *our* machine, and we don't care about the loss of confidentiality there. And the huge number of users of cloud-based hosting shows that tiny dedicated systems definitely have a use case, so full machines of different sizes could be sold to customers, with zero sharing on them either.
Browsers are the only enemies on local machines and they could be placed into an isolation sandbox that runs in real-time mode and flushes caches and TLBs before being switched in. They would not be that much slower nor heavier anyway, they're already the most horrible piece of software ever created by humanity: software that takes gigs of RAM to start and do not even print "hello world" by default, doing nothing at all until connected to a site, so we could definitely afford to see them even slower.
With such mostly dedicated hardware approach, we could get back to using our *own* hardware at full speed and the way we want. We've entered an era where computers are getting slower over time only due to all mitigations for conceptual security trouble that most of us do not care about and that result in sacrificing performance.
Posted Aug 10, 2023 5:19 UTC (Thu)
by zdzichu (subscriber, #17118)
[Link] (2 responses)
Posted Aug 10, 2023 5:56 UTC (Thu)
by dxin (guest, #136611)
[Link] (1 responses)
Posted Aug 10, 2023 18:02 UTC (Thu)
by wtarreau (subscriber, #51152)
[Link]
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Wol
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
> Wondering what went wrong and why no one seems to care if their page takes five or ten seconds to update is a relevant inquiry.
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Wol
What is Pick and Scarlet?
What is Pick and Scarlet?
Wol
> With your typical RDBMS project coming in massively over time and budget, surely going back to a system where the right thing is the obvious thing will massively improve those stats!
Another round of speculative-execution vulnerabilities
> More importantly, Pick actually makes the latter easy!
Another round of speculative-execution vulnerabilities
Wol
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
> That's fine, but most people find some Web sites and phone apps useful, and those need high single-thread performance.
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Wol
Actually speculative execution works so well because it turns out that a lot of execution is very predictable. It's so predictable that the branch predictor has an accuracy of ~99% (depending on the application). This means that the instruction fetcher can fetch ahead for hundreds of instructions, and the OoO execution engine can execute these instructions ahead in an order determined by the data dependencies, not otherwise by the program order. This allows modern CPUs to complete (i.e., execute) several instructions per cycle.
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Compilers (AOT and JIT) can have a go at predicting control flow using PGO
Profile-based static prediction has ~10% mispredictions, while modern history-based hardware branch prediction has about 1% mispredictions (for real numbers check the research literature, but the tendency is in that direction; and it's actually hard to compare the research, because static branch prediction research stopped about 30 years ago).
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
What "C memory model specification" do you mean?
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
C has not killed IA-64, out-of-order (OoO) execution CPUs have outcompeted it.
It's an interesting theoretical exercise to consider what would have happened if Meltdown and Spectre had been discovered sometime around 2000. Presumably the software workaround for Meltdown would have had to have looked like Red Hat's 4G/4G split, which could:cause a typical measurable wall-clock
overhead from 0% to 30%, for typical application workloads (DB workload,
networking workload, etc.). Isolated microbenchmarks can show a bigger
slowdown as well - due to the syscall latency increase.
That would have made a big difference to the perceived advantages of Itanium.
AFAIK a mitigation for Meltdown was indeed to not share the address space between kernel and user space, leading to TLB flushes on system calls. Intel fixed Meltdown relatively quickly in hardware, and AMD hardware has not been vulnerable to Meltdown AFAIK.
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
What makes you think that this would mean "isn't allowed to learn"? The fact that architectural state is not changed on a misprediction does not mean that architectural state is immutable, either.
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
If the prediction is wrong, you throw away the speculative nonsense (and thus avoid Inception), but you record that the prediction was wrong. I had not written that earlier, sorry.
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
> Yes the programmer actually has to THINK about their database design
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Wol
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
A more revealing graph (but for the earlier A55 vs. A75 (vs. Exynos M4)) shows Perf/W. And it shows that the in-order A55 is better in Perf/W than the OoO A75 only at its very lowest performance. As soon as you need a little more, the A75 is more power-efficient.
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
How many in-order cores could one fit on a die?
How many in-order cores could one fit on a die?
How many in-order cores could one fit on a die?
How many in-order cores could one fit on a die?
How many in-order cores could one fit on a die?
How many in-order cores could one fit on a die?
How many in-order cores could one fit on a die?
How many in-order cores could one fit on a die?
How many in-order cores could one fit on a die?
How many in-order cores could one fit on a die?
How many in-order cores could one fit on a die?
How many in-order cores could one fit on a die?
How many in-order cores could one fit on a die?
How many in-order cores could one fit on a die?
How many in-order cores could one fit on a die?
How many in-order cores could one fit on a die?
How many in-order cores could one fit on a die?
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
> D1: Instruction Decode
> D2: Address Generate
> EX: Execute - ALU and Cache Access
> WB: Writeback
>
> A mispredicted branch (whether a BTB hit or miss) or a correctly predicted branch with the wrong target address will cause the pipelines to be flushed and the correct target to be fetched. Incorrectly predicted unconditional branches will incur an additional three clock delay, incorrectly predicted conditional branches in the u-pipe will incur an additional three clock delay, and incorrectly predicted conditional branches in the v-pipe will incur an additional four clock delay.
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Wol
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Of course this does not matter on a single-user computer that does not run arbitrary untrusted code from the Internet.
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
But they actually guessed it right: things gets really sketchy after 1000x.
We have scaled really well on number of cores, but barely made it past 10x clock speed, and on the way of 10x IPC we shot out feet on every single step.
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Using all of those cores
Using all of those cores
Not everything with parallelism is suitable for GPUs
Not everything with parallelism is suitable for GPUs
Using all of those cores
Another round of speculative-execution vulnerabilities
Wol
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Wol
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Wol
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Wol
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Wol
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Wol
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
To further emphasize this point, the speed of a PCIe gen 6 link is now 64GHz.
I'm pretty sure this isn't technically correct, at least when talking about how far the signal propagates before the next signal is generated. PCIe 6.0 usesPAM4 (Pulse Amplitude Modulation with 4 Levels) [...] a multilevel signal modulation format used to transmit data. [...] It packs two bits of information into the same amount of time on a serial channel. The utilization of PAM4 allows the PCIe 6.0 specification to reach 64 GT/s data rate and up to 256 GB/s bidirectional bandwidth via a x16 configuration.
It's basically the same concept as MLC versus SLC in flash.Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
PCIe Gen5 is 32GT/s, with a frequency of 16GHz (tranfer data both posedge and negedge), and Gen6 uses PAM4 instead of NRZ, it transfers 2bits each time, and the frequency is still 16GHz, but the speed is 64GT/s.
And for hardware design, PAM4 16GHz is more difficulty compared with NRZ 16GHz.
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Wol
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
...because I thought that modern multi-core chips need to talk to SHARED main memory housed across the bus in completely separate cards.....so maybe 10 centimeters away from the CPU chip.
*
Someone here can no doubt clear up my confusion.
Another round of speculative-execution vulnerabilities
Another round of speculative-execution vulnerabilities
Multithread SMT
Multithread SMT
Technically it's "fine-grained multi-threading", like most GPUs. The switching threads each cycle in round-robin style makes the wall-time of each cycle much longer, from individual thread's point of view, so pipeline delays doesn't exist.
Multithread SMT