LWN: Comments on "Kernel optimization with BOLT"

Cache sizes

raven667 — Fri, 08 Nov 2024 14:42:47 +0000

Some of what you describe is the fact that while software people sort of sort of imagine the computer operates in a virtual realm of abstract logic, hardware is actually a physical electrical device and you can't just abstractly put "more cache" on it the way you could refactor a software program because of the physical reality of electrical circuits and wiring that is the computer.

Grope

paulj — Thu, 07 Nov 2024 15:15:34 +0000

The free software community in 1998 was largely young people - from late teen students to 30-somethings (Linus was 29). So they're now in their 40s to 60s.

I.e., the set of young people who enjoyed mildly vulgar/shocking-to-norms puns then, are mostly the same set of people as the older set who today find it juvenile.

Grope

cmkrnl — Wed, 06 Nov 2024 23:35:50 +0000

Would have sounded lighthearted and witty to a close-knit group of young people in 1998, but today it just sounds juvenile an immature.

Intriguing

rolandog — Wed, 06 Nov 2024 12:41:43 +0000

I'm also curious as to whether BOLT is smart enough to distinguish functions that need to run in constant time to prevent timing attacks. (Gotta watch the presentation, though... Maybe it's addressed there).

Cache sizes

anton — Tue, 05 Nov 2024 08:06:57 +0000

The L2 cache of many cores is dedicated to the core, too (e.g., on Intel's P-cores for over a decade and on AMD's Zen-Zen5 cores).

The reason for keeping the L1 cache small is latency. If the cache grows, the miss rate decreases, but the latency increases. You can see the longer latency nicely in the comparison of L2 sizes and latencies in this article.

One reason is that the wires get longer, which increases the time that signals travel.

You also want to use a virtually-indexed physically-tagged (VIPT) cache as L1 cache, which allows to perform the TLB access and the cache access in parallel, i.e., with low latency. But that means that the size of a cache way is at most as large as a page; the number of ways is limited (you typically don't see more than 16-way set-associative caches, and a lower number of ways is common in L1 caches), the page size is 4KB on AMD64, which limits the L1 cache sizes to 64KB (and 32KB or 48KB is more common). Apple's Firestorm (M1 P-core) has larger caches (192KB I-cache, 128KB D-cache), but also 16KB pages, which allows a VIPT cache implementation with a 12-way (I) or 8-way (D) set-associative cache.

Cache sizes

himi — Mon, 04 Nov 2024 23:48:58 +0000

Out of curiosity, how much of that is because larger I and D caches didn't provide as much of a gain as increasing L3 caches? Particularly since the L1 caches are tightly coupled to each core rather than shared across the ever-increasing number of cores - devoting transistors to increasing L1 caches is going to have a very different cost/benefit mix than devoting them to more computational units or shared caches . . .

Cache sizes

paulj — Mon, 04 Nov 2024 10:45:09 +0000

Hell, the AMD *K6* had 32 KiB I and D cache!

Cache sizes

anton — Fri, 01 Nov 2024 18:46:52 +0000

The claims made in the article (maybe in the talk) about cache sizes are mostly wrong.

I-cache and D-cache typically have similar size, and if they differ, it's not always the D-cache that is larger. E.g., Zen4 has 32KB I-cache and 32KB D-cache, Zen5 and Raptor Cove have 32KB I-cache and 48KB D-cache, and Gracemont has 64KB I-cache and 32KB D-cache.

The sizes of L1 caches generally have not grown in the last 20 years; e.g., the 2003 Athlon 64 has 64KB I-cache and 64KB D-cache, and the 2003 Pentium M has 32KB I-cache and 32KB D-cache. Instead, they have added an L3 cache since that time. A number of cores have a microoperation cache in addition to the I-cache, but the sizes are hard to compare.

Other BOLT weirdness

anton — Fri, 01 Nov 2024 18:20:09 +0000

The P-cores of Alder Lake don't support AVX-512 (implemented but disabled), either, unless you are using some early firmware. It's a pity that Intel completely disabled that, even in Xeon-E24xx CPUs where the E-cores are disabled. But don't worry, buy an AMD CPU with a Zen4 or Zen5 core, and you will get AVX-512.

Grope

atnot — Tue, 29 Oct 2024 10:13:37 +0000

I think it is very amusing how fast I flip from being a filthy degenerate that must be kept away from society for their debauchery to a funless prude as soon as I impinge on peoples Sacred ability to make rape "jokes" and non-consenually touch people at conferences. Oh well.

Grope

LtWorf — Mon, 28 Oct 2024 14:33:14 +0000

Moralists never have fun, so hold a grudge against others who have fun instead.

Other BOLT weirdness

intelfx — Sun, 27 Oct 2024 06:59:56 +0000

> AVX-512 is stupid. It doesn't work on efficiency cores on Alder Lake

Perhaps it rather means that Alder Lake is stupid?

Other BOLT weirdness

Cyberax — Sun, 27 Oct 2024 05:57:27 +0000

> You can use runtime cpuid feature bit detection and identify the running CPU supports AVX512

AVX-512 is stupid. It doesn't work on efficiency cores on Alder Lake. Even though P-cores support it.

Other BOLT weirdness

kmeyer — Sun, 27 Oct 2024 04:35:42 +0000

There is some other BOLT behavior that derives from its use for HHVM: it can replace functions that use AVX512 intrinsics with traps. This is probably not useful for anyone aside from HHVM.

https://github.com/llvm/llvm-project/blob/7b88e7530d4329f...

Also, this is ... kind of insane, for library source code? You can use the compile-time feature detection support and identify the compiler target supports AVX512 (-mavx512 or whatever). You can use runtime cpuid feature bit detection and identify the running CPU supports AVX512. But if your binaries have been though BOLT with the -trap-avx512 flag, your AVX-accelerated function will just trap with a ud2 instruction.

If you find yourself needing to detect, at runtime, this BOLT bastardization of the binary, this ugly hack seems to work: https://github.com/facebook/folly/blob/d5e10f9d076838374f...

Intriguing

jd — Sat, 26 Oct 2024 12:35:24 +0000

It would seem like there are now quite an array of tools for optimising in various ways.

But one optimisation can potentially interact with another optimisation, and optimal binary reordering may be affected by compiler optimisation which may in turn be potentially affected by optimal binary reordering.

I'm trying to figure out from this article how, exactly, you get the most out of this.

I'd also be intrigued to know if this technique could be used effectively with the Verified Software Toolchain. VST is fine for producing provably correct binaries, but there's obvious drawbacks to this - there's not a whole lot of optimising you can do and still be certain the binaries are correct.

If you can greatly accelerate VST-produced binaries without impacting the proof of correctness in any way, I could imagine scenarios where this could actually be useful.

Grope

intelfx — Sat, 26 Oct 2024 10:01:57 +0000

There is nothing the quote or in the original text that would suggest that any kind of _actual_ harasssment, "physical assault", or, worse, "sexual assault" has taken place.

> I'd like to please stay as far away from you as possible.

The feeling is thus mutual.

Grope

atnot — Sat, 26 Oct 2024 09:22:24 +0000

Sorry, if your idea of "lighthearted fun" is getting "harassed" (direct quote) and physically assaulted into watching a presentation about how proud the author is about his sexual assault joke, I'd like to please stay as far away from you as possible.

Grope

intelfx — Sat, 26 Oct 2024 09:06:17 +0000

And the problem with these “uncomfortable” “yikes” “whatever this is”, besides people having apparently light-hearted fun, is exactly… what?

Grope

atnot — Sat, 26 Oct 2024 08:08:31 +0000

> Alan Cox has grabbed Miguel and forced him to sit down. The two of them are heading to the front. Apparently the harassment in the hallway had reached too high a level. No! He's escaped!
> "rope" is a pun on "cord" but then creates a great word combined with GNU

yeah this whole thing is just one yikes after another. jesus christ. As uncomfortable as I still am visiting events like this today, at least it's no longer... whatever this is.

Grope

willy — Sat, 26 Oct 2024 00:33:23 +0000

It only took 25 years to replace

https://lwn.net/1998/1029/als/rope.html

(Not sure why it never got released ...)

And, damn, that name and the "jokes" being made ... I think we're a bit better now.