|
|
Log in / Subscribe / Register

C was a great low-level language - for the PDP-11

C was a great low-level language - for the PDP-11

Posted Feb 15, 2021 12:47 UTC (Mon) by excors (subscriber, #95769)
In reply to: C was a great low-level language - for the PDP-11 by anton
Parent article: Python cryptography, Rust, and Gentoo

> Instead of caches, they wanted us to manage fast memory by software, with the most recent instance being the SPEs of the Cell Broadband Engine (used in the PlayStation 3). Instead of somewhat consistent shared memory, they would rather have given us distributed memory, with software managing the transfer of data from remote to local memory before processing (supercomputers still have this). All this would make general-purpose programming so much harder that the alternatives with more complex hardware won out.

On the other hand GPGPU has risen in popularity, and that often does require the programmer to explicitly handle distributed memory. In OpenCL terminology you have host memory (the system RAM shared with the CPU), global memory (VRAM), local memory (shared by a large group of work-items), and private memory (basically the register file for a single work-item, though with some sharing between nearby work-items). You have to declare where all your data will live in that hierarchy, and write code to copy it between different levels, and partition your work-items to be in the same group/subgroup when they need to share data efficiently, and that can have a massive effect (maybe 1-2 orders of magnitude) on performance.

For serious number-crunching, GPUs won out over CPUs, which I suspect is because their memory model is much more scalable than the CPU's illusion of consistent shared memory, *and* they have a programming model that makes it relatively easy to exploit that memory model (by running many thousands of parallel threads so the programmer can usually ignore memory latency and branch latency - even if 90% of threads are stalled, there's enough runnable threads to keep all the ALUs busy or to saturate memory bandwidth - and by having just enough sharing between threads so they can coordinate on non-trivially-parallelisable problems).

As far as I can see, Cell was somewhere in the middle: it had GPU-like memory (8 SPEs with 256KB of local memory, and 2KB of private memory (/registers) split between 4-16 work-items (/SIMD lanes)) but it had a more traditional CPU-like programming model (just a single thread per SPE, running SIMD instructions, but even worse than regular CPUs at branches). The problem wasn't the distributed memory model, the problem was that it didn't commit hard enough in either direction and so it was beaten by GPUs on one side and traditional CPUs on the other side.


to post comments


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds