|
|
Log in / Subscribe / Register

Python cryptography, Rust, and Gentoo

Python cryptography, Rust, and Gentoo

Posted Feb 11, 2021 5:12 UTC (Thu) by marcH (subscriber, #57642)
In reply to: Python cryptography, Rust, and Gentoo by Paf
Parent article: Python cryptography, Rust, and Gentoo

> I love the simplicity and feeling of precision

Emphasis on "feeling"

https://queue.acm.org/detail.cfm?id=3212479


to post comments

C was a great low-level language - for the PDP-11

Posted Feb 12, 2021 11:40 UTC (Fri) by sdalley (subscriber, #18550) [Link] (5 responses)

That was a *really* good article!

The increasingly mind-boggling and foot-shooting complexity of modern C compiler optimizations is the clearest evidence one could wish for that C is not "close to the metal" of any modern mainstream processor. Like a tree growing on top of a pile of buried scrap metal, modern architectures and compilers have had to distort and twist themselves to grow around the need of preserving the illusion that they have flat memory, fixed registers, pointer arithmetic and sequential operation.

What would a useful modern low-level language that treats vectors, co-processors, threads, segments, references and caches as first-class objects look like?

C was a great low-level language - for the PDP-11

Posted Feb 12, 2021 12:17 UTC (Fri) by pizza (subscriber, #46) [Link] (2 responses)

The reasons compilers are so re-writingly complex is the same reason that modern CPUs are so re-writingly complex: squeezing every last drop of performance out of _existing_ code.

After the top-line price, raw performance is the only thing that folks actually care about.

(Granted, the tide has begun to shift slightly in favor of "security", but given the choice, folks will choose "faster" over "more secure"... every. single. time.)

C was a great low-level language - for the PDP-11

Posted Feb 12, 2021 17:35 UTC (Fri) by anselm (subscriber, #2796) [Link] (1 responses)

The reasons compilers are so re-writingly complex is the same reason that modern CPUs are so re-writingly complex: squeezing every last drop of performance out of _existing_ code.

Also, humans have a better chance of writing working (let alone efficient) code if they don't need to think about “vectors, co-processors, threads, segments, references and caches as first-class objects”. We have compilers so we don't need to worry about all of those (the vast majority of us who aren't working on actual compilers, anyway).

C was a great low-level language - for the PDP-11

Posted Feb 12, 2021 22:08 UTC (Fri) by marcH (subscriber, #57642) [Link]

For the PDP-11, C provided an outstanding trade-off: user-friendly programming concepts that mapped really well to the hardware.

While these concepts don't map with the hardware anymore, they stayed familiar and their programmer-friendliness has indeed not regressed. But it hasn't progressed either.

It is a very sad vicious circle to see that programming concepts and hardware keep meeting in a place that does not exist any more. Something like "retpoline" is the absolute irony: still meeting the hardware in that old, fictional place BUT with the knowledge of what hardware really does behind the scenes AND the intention to defeat that! Multiple layers of masquerading; what a carnival.

It's fantastic to see that a new crop of programming languages are at least trying to evolve a bit.

http://worrydream.com/#!/TheFutureOfProgramming (Bret Victor)

C was a great low-level language - for the PDP-11

Posted Feb 15, 2021 9:46 UTC (Mon) by anton (subscriber, #25547) [Link] (1 responses)

The referenced article is not particularly good, just a hodgepodge of pet peeves.

As for the complexity of gcc and clang/LLVM, it is an indication that they have too much budget and want to produce good benchmark results (at the cost of worse usability) to justify that (admittedly they are also doing things that help usability, but they could do that without doing the other nonsense).

As for flat memory and caches (and, mentioned in the paper, cache coherency protocols), that is indeed hardware architecture for speeding up existing software written for a simple memory model, plus being able to run processes with large memory needs. Hardware architects needed a long time to get here, and tried to throw the complexity over to programmers the whole time (and are still doing it, with weak memory consistency): Instead of caches, they wanted us to manage fast memory by software, with the most recent instance being the SPEs of the Cell Broadband Engine (used in the PlayStation 3). Instead of somewhat consistent shared memory, they would rather have given us distributed memory, with software managing the transfer of data from remote to local memory before processing (supercomputers still have this). All this would make general-purpose programming so much harder that the alternatives with more complex hardware won out. So the architectures provide at least single-threaded programs with a "flat" memory model, and a language that reflects that memory model with, e.g., address arithmetic is a sensible low-level language for that (but note taht C as understood by the gcc and clang maintainers is not such a language).

Segments are what I first thought of when you mentioned "flat memory". This has been pretty much eliminated as architectural (mis)feature (and where it is present, it has not been used for a while); having it in an architecture costs extra hardware, and costs extra in software. As to how a low-level language would look that supports it, look at the C standard; it includes many restrictions that cater for these kinds of architectures; and these days the gcc and clang maintainers use these restrictions as justification for miscompiling programs on architectures with flat memory.

As for register renaming (vs. "fixed registers"), Intel has spent billions on IA-64 aka Itanium based on the idea that compilers could rename "fixed registers" and reorder instructions better than the hardware can. In the end it turned out that the hardware with register renaming performs better for most software. The IA-64 approach would also have required more complex compilers to perform well, and the Itanium CPUs are also quite power-hungry even without a register renamer.

Vectors as first-class objects: Look at APL, J, or FP, although I would not call these languages low-level. Still, Backus was not pleased with architecture and programming languages and proposed FP as an alternative programming model. But despite Backus' standing and his high-profile presentation of his critique and alternative, FP/FL have not seen mainstream success nor taken the functional programming community by storm.

On a completely different track, you can look at GNU C's vector extensions, which is pretty low-level.

As for threads, we have seen SMT in mainstream CPUs since 2002 and multi-core CPUs in the mainstream since 2005. The low-level approaches to that have been pthreads and the C++ memory model, but they are hard to program with. By contrast, Unix pipes (a high-level concept) lets me use multiple cores or hardware threads without particular effort (but typically only for rather limited amounts of parallelism).

Occam is a programming language for programming distributed-memory multiprocessors (but even on shared-memory machines, each thread could get its private memory, limiting the memory ordering headaches to the implementation of communications). I think that one other thing that the transputers and Occam did right was to make thread creation, destruction and communications very cheap, so finding the right granularity of parallel processing was not as critical as on current mainstream stuff. Still, I don't see these aspects of Occam being picked up in the mainstream, so maybe they are not as important as I think.

Overall, the problem of making good use of many threads with little burden on the programmers is still unsolved, and that's why architectures with lots of slow threads have not found mainstream success.

C was a great low-level language - for the PDP-11

Posted Feb 15, 2021 12:47 UTC (Mon) by excors (subscriber, #95769) [Link]

> Instead of caches, they wanted us to manage fast memory by software, with the most recent instance being the SPEs of the Cell Broadband Engine (used in the PlayStation 3). Instead of somewhat consistent shared memory, they would rather have given us distributed memory, with software managing the transfer of data from remote to local memory before processing (supercomputers still have this). All this would make general-purpose programming so much harder that the alternatives with more complex hardware won out.

On the other hand GPGPU has risen in popularity, and that often does require the programmer to explicitly handle distributed memory. In OpenCL terminology you have host memory (the system RAM shared with the CPU), global memory (VRAM), local memory (shared by a large group of work-items), and private memory (basically the register file for a single work-item, though with some sharing between nearby work-items). You have to declare where all your data will live in that hierarchy, and write code to copy it between different levels, and partition your work-items to be in the same group/subgroup when they need to share data efficiently, and that can have a massive effect (maybe 1-2 orders of magnitude) on performance.

For serious number-crunching, GPUs won out over CPUs, which I suspect is because their memory model is much more scalable than the CPU's illusion of consistent shared memory, *and* they have a programming model that makes it relatively easy to exploit that memory model (by running many thousands of parallel threads so the programmer can usually ignore memory latency and branch latency - even if 90% of threads are stalled, there's enough runnable threads to keep all the ALUs busy or to saturate memory bandwidth - and by having just enough sharing between threads so they can coordinate on non-trivially-parallelisable problems).

As far as I can see, Cell was somewhere in the middle: it had GPU-like memory (8 SPEs with 256KB of local memory, and 2KB of private memory (/registers) split between 4-16 work-items (/SIMD lanes)) but it had a more traditional CPU-like programming model (just a single thread per SPE, running SIMD instructions, but even worse than regular CPUs at branches). The problem wasn't the distributed memory model, the problem was that it didn't commit hard enough in either direction and so it was beaten by GPUs on one side and traditional CPUs on the other side.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds