GNU C Library version 2.39

By Daroc Alden
February 6, 2024

The GNU C Library (glibc) released version 2.39 on January 31, including several new features. Notable highlights include new functions for spawning child processes, support for shadow stacks on x86_64, new security features, and the removal of libcrypt. The glibc maintainers had also hoped to include improvements to qsort(), which ended up not making it into this release. Glibc releases are made every six months.

Pidfd changes

The Linux kernel added support for pidfds in the 5.1 development cycle in 2018. More support was added in the kernel in 2019, but still required some user-space support to be generally useful. Process IDs (PIDs) can be reused by different processes, so there can be races between the time a PID is obtained and used; if the PID is reused in that window, the wrong process will be operated on. Other objects on Unix-like systems are traditionally represented using files — pidfds are unique file descriptors that refer to a specific process, even if another process reuses the same PID. However, making this facility generally available to programs requires some user-space support. The new release of glibc includes several additional functions for working with pidfds; LWN covered the creation of two new helper functions when the work was being introduced: pidfd_spawn() and pidfd_spawnp(). These functions are used to create a process and directly return its pidfd, avoiding a potential race condition where a process could exit and be replaced by a different process before its parent can open a pidfd for it. These functions are only available on systems with the clone3() system call.

Those aren't the only new functions for working with pidfds, however. This release also adds posix_spawnattr_setcgroup_np() and posix_spawnattr_getcgroup_np() which can be used to set which control group a spawned process will be part of before the process is started. Setting a process's control group ahead of time avoids a race condition between starting the process and applying resource limits to it. These functions are part of the family of functions that modify the posix_spawnattr_t structure used by the posix_spawn() family of functions to specify implementation-defined attributes of the process to be spawned.

    int posix_spawnattr_getcgroup_np(const posix_spawnattr_t *restrict attr,
                                     int *restrict cgroup);

    int posix_spawnattr_setcgroup_np(posix_spawnattr_t *attr,
                                     int cgroup);

Finally, programs that use the new pidfd interface, but that still need to determine the child process's PID for whatever reason, can use pidfd_getpid():

    pid_t pidfd_getpid(int fd);

Shadow stacks

Another new kernel feature that this release allows programmers to take advantage of is support for x86's control-flow enforcement technology (CET). There are two technologies included in CET: shadow stacks and indirect branch tracking. Both are aimed at reducing or eliminating return-oriented programming (ROP) attacks. Glibc already supported programs compiled with support for indirect branch tracking. This glibc update is focused on adding support for shadow stacks, since setting up a shadow stack also requires the cooperation of the dynamic linker. Shadow stacks work by keeping a second copy of function return addresses in specially protected memory, so that an attack that overwrites the stack cannot take control of the program.

The Linux kernel has allowed programs to opt into using shadow stacks since version 6.6, but there hasn't been a good way for them to take advantage of that capability without making direct system calls. This release of glibc includes an --enable-cet configure option that enables support for shadow stacks. When built with this flag, glibc will ask the kernel to allocate memory for the shadow stack and enable the protection during program startup.

Programs also need to be compiled with shadow-stack support to take advantage of the feature, however. The GNU Compiler Collection (GCC) and Clang both support compiling programs with shadow-stack support via the -fcf-protection flag. Some distributions have already enabled this flag by default (including Ubuntu and Fedora). Glibc's --enable-cet configure option changes the behavior of loading shared objects. When a process starts, glibc checks whether the binary supports CET. If it does, loading a shared object that was not compiled with CET support becomes an error. Glibc also supports --enable-cet=permissive, which silently disables the shadow stack when opening a non-shadow-stack-aware shared object.

PLT rewrite

One last proactive security feature in this release is the addition of support for rewriting the Procedure Linkage Table (PLT) on startup. A dynamically linked program cannot tell at compile time what the location of any shared objects in memory will be at run time. Therefore, when the program makes a call to a function from another shared object, that call goes via the PLT. On the first invocation of a given function in the PLT, the code there jumps to the dynamic linker, which searches to find the right offset for the function and then rewrites the PLT entry to point to the function directly; subsequent calls do not need to involve the dynamic linker.

This release of glibc adds a "tunable" called glibc.cpu.plt_rewrite. When this option is enabled, the dynamic linker rewrites the PLT to use absolute jumps to the correct location on startup, instead of waiting for a particular function to be invoked. Once the PLT is rewritten, it is then remapped as read-only memory. This can slightly increase the startup time of the program, but it removes a potential avenue for attacks that seek to control the program: overwriting the PLT with maliciously chosen jump destinations. Glibc tunables can be set using the GLIBC_TUNABLES environment variable as follows: GLIBC_TUNABLES=glibc.cpu.plt_rewrite=1.

`qsort()`

One change that did not make it into the release was a proposed improvement to qsort(). In October, Adhemerval Zanella authored a patch that swapped qsort()'s venerable merge sort implementation for one based on introsort instead. He noted that merge sort performs poorly on already-sorted or nearly-sorted arrays, a problem that does not appear in introsort. He also noted that the "mergesort implementation has some issues", going on to point out that merge sort requires auxiliary memory, meaning that qsort() can call malloc(). This demand for additional memory means that calling qsort() might introduce arbitrary delays while the system conjures enough memory for merge sort's auxiliary array.

This potential delay also makes the function not async-cancel safe. POSIX requires that programs in certain contexts (such as signal handlers, or after a new thread is created) call only async-cancel-safe functions, or risk undefined behavior. POSIX only requires a handful of functions be async-cancel safe, but glibc maintains a list of which library functions are async-cancel safe in practice, to aid developers wishing to rely on details of glibc's implementation. In an earlier discussion on the mailing list, Zanella noted that he sees making glibc functions async-cancel safe where possible as a quality-of-life improvement for users.

Zanella ended up reverting the change at the beginning of January, saying that the removal of merge sort "had the side-effect of making sorting nonstable. Although neither POSIX nor C standard specify that qsort should be stable, it seems that it has become an instance of Hyrum's law where multiple programs expect it". Stable sorting algorithms such as merge sort keep two inputs that compare equal in the same order in the output, while unstable sorting algorithms don't make that guarantee. Zanella went on to note that the existing implementation does fall back to a heap-sort algorithm if it is unable to allocate the needed additional memory. Heap sort is not a stable sort, so there is still a potential bug lurking in programs that expect qsort() to be stable, although it can't be triggered on the vast majority of Linux systems. Most Linux systems use memory overcommit: a feature of the kernel which allows programs to allocate more anonymous memory than the system can support in total, on the theory that many programs don't use all the anonymous memory they request. Memory overcommit is enabled on nearly all Linux systems — the exception being some embedded systems that specifically turn it off — which may make finding the error in any programs that do trigger this problem somewhat difficult.

Qualys reported a security vulnerability in the qsort() implementation in January, but Florian Weimer had actually already fixed it during some unrelated work the month before. The vulnerability affects all versions of glibc present in the project's Git repository, going back to 1989 at least. In the announcement, Qualys merely claims the vulnerability is present in 1992, but there were no changes to the relevant code between the initial commits recorded in the Git history and the 1.04 release in 1992. The project's security team were quoted in Qualys's report as saying that the behavior required to trigger the vulnerability is "undefined according to POSIX and ISO C standards", and that they nonetheless "acknowledge that this is a quality of implementation issue and we fixed this in a recent refactor of qsort".

libcrypt

Another thing that this release does not contain is libcrypt, the library that provided the crypt() password-hashing function. The default build configuration of glibc already omitted libcrypt as of version 2.38, but users could opt-in using the --enable-crypt configure flag, which is now removed. Users of libcrypt should consider switching to libxcrypt, a replacement offering the same interface under a compatible license, but maintained separately from glibc. Libcrypt was the last remaining use of Mozilla's Network Security Services (NSS) cryptography library in the C library, allowing the project to drop that dependency. The removal of libcrypt is the end of a process that began in 2017, when Zack Weinberg suggested that libcrypt might best be moved to "a separate project that could move faster" in response to a request to add an OpenBSD-compatible bcrypt() function.

Conclusion

Version 2.39 does not include any groundbreaking new developments, which is to be expected from a stable project like the GNU C library. This release includes access to new kernel features and several security improvements, alongside the expected maintenance of the locale system and bug fixes. 68 contributors helped produce the new version, with 21 reviewing patches. Version 2.40 is expected in another six months, around the end of July.

GNU C Library version 2.39

Posted Feb 6, 2024 22:58 UTC (Tue) by bluca (subscriber, #118303) [Link]

> pidfd_spawn() and pidfd_spawnp()

Great stuff, already switched systemd to use this, finally we can fix some long-standing issues with cgroup placement when spawning services

https://github.com/systemd/systemd/commit/2e106312e2e8e8f...

PLT Rewriting

Posted Feb 6, 2024 23:03 UTC (Tue) by jrtc27 (subscriber, #107748) [Link] (10 responses)

Firstly, one key piece of information is missing from the PLT explanation (for most architectures in common use): the PLT entry isn't rewritten, the PLT *GOT* entry is. The PLT entry (code) loads the corresponding pointer from the PLT GOT (data) and performs an indirect jump to it (which, for x86, is one instruction, an indirect jump taking a memory operand). When using lazy binding, the run-time linker sets things up such that this indirect jump eventually ends up inside the run-time linker itself, which looks up the function that the binary was trying to call and replaces the PLT GOT entry with its address such that future calls of the PLT don't need to go into the run-time linker, and finally itself jumps to the function in question for this call. Exceptions to this include SPARC and some PowerPC ABIs, but x86 follows this standard model.

Now, as far as I'm aware, PLT rewriting isn't really a security feature, it's a performance one. Lazy binding can already be disabled in two ways:

1. Set the LD_BIND_NOW environment variable
2. Link with -z now

Both methods instruct the run-time linker to fully initialise the PLT GOT when loading the binary rather than deferring most of until the first call. However, option 2 is the better one as, when combined with RELRO, it will make the PLT GOT read-only once the run-time linker has loaded the binary. Option 1, to the best of my knowledge, will incur the startup cost of the initialisation (but gain more deterministic function calls rather than penalising first use), without being able to make the PLT GOT read-only, because the region of memory in the program that can be marked as read-only once loaded is dictated by the static linker, though one could imagine an implementation that adds the PLT GOT to that list when LD_BIND_NOW is set, with the caveat that it likely won't be page sized and aligned, so the ragged ends would remain read-write.

What PLT rewriting aims to do is reduce the cost of using the PLT by getting rid of the load and indirect branch. With eager binding enabled you still have an indirect branch, but never exploit its dynamic potential, since the pointer loaded from the PLT GOT is initialised before the code is run and never changes. PLT rewriting replaces that load and indirect branch with a direct branch to the actual function. Removing the load reduces the memory footprint of the binary. Replacing the indirect branch also removes a hard-to-predict instruction, and one that is dependent on the load to have executed. This comes at the cost of not being able to share the PLT code pages of the executable or library in question with other processes that have the same object loaded, though, and does not work on systems that forbid mapping code as writable. Depending on the ISA it may also be difficult, or impossible, to do this in the general case, due to the range of operands available for direct branches. On x86 it's easy as you can have the full 64-bit absolute address in the instruction, but most architectures can't do that. You might still want to rewrite it to an indirect branch where the address is constructed directly by an instruction sequence rather than loaded from memory, which avoids the PLT GOT use, but that's rather more limited in benefit and likely quickly stops being an optimisation.

PLT Rewriting

Posted Feb 7, 2024 0:33 UTC (Wed) by pbonzini (subscriber, #60935) [Link] (2 responses)

Wasn't the 64-bit absolute jump just announced as part of APX? I think glibc must rely on the PLT being within +/-2GiB of the GOT for the direct jump to work?

PLT Rewriting

Posted Feb 7, 2024 9:01 UTC (Wed) by maskray (subscriber, #112608) [Link] (1 responses)

I have some notes at https://maskray.me/blog/2021-09-19-all-about-procedure-li... (I frequently update articles after I've learned new things).

> otherwise, when GLIBC_TUNABLES=glibc.cpu.plt_rewrite=2 is specified on APX processors, the indirect jump instruction is rewritten to jmpabs $target (64-bit absolute jump).

PLT Rewriting

Posted Feb 7, 2024 18:21 UTC (Wed) by jrtc27 (subscriber, #107748) [Link]

Ah, I had missed that detail and made an assumption there (knowing that MOV r64, imm64 exists, but forgetting that it's a special case). Thanks both for the correction and added information.

PLT Rewriting

Posted Feb 11, 2024 17:45 UTC (Sun) by anton (subscriber, #25547) [Link] (6 responses)

Replacing the indirect branch also removes a hard-to-predict instruction, and one that is dependent on the load to have executed.

Actually an indirect branch that always branches to the same target is easy to predict perfectly: predict that it jumps to the same target as the last time. Even the 1993-vintage Pentium has a branch target buffer that is up to that task. On a core with out-of-order execution (common on smartphones, universal on laptops, PCs and servers), a correctly predicted branch means that the load is just needed for verification and does not add to the latency of other operations.

I have seen slow indirect branches on the PPC970 and Itanium 2 (both from 2002), but I doubt that many users will use glibc-2.39 and PLT rewriting on these CPUs. And I think that these CPUs have no direct jumps that have enough reach for rewriting the PLT jumps as direct jumps, so even if PLT rewriting is used for these CPUs, the slow indirect branch will still strike.

PLT Rewriting

Posted Feb 12, 2024 12:50 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (3 responses)

Doesn't it take up space in the branch predictor table though that might be better spent on other branches? (Not saying that it isn't worth it, just trying to understand the costs better.)

PLT Rewriting

Posted Feb 12, 2024 13:54 UTC (Mon) by anton (subscriber, #25547) [Link] (2 responses)

I have measured indirect (and direct) branch/call speed, so what I wrote earlier is based on direct evidence. I have not made measurements on branch predictor capacity, so what I write below is just based on what I have read:

The hardware not only predicts indirect branches, it predicts unconditional direct branches as well. That's because, even with an uop cache that caches decoded instructions, you don't want to wait until the direct branch is fetched from the uop cache and processed by the branch unit in order to decide what instructions or uops to fetch next.

Instead, the hardware puts the direct branch into the BTB, too, where it will be predicted based on the address of the instruction, not the contents. So at least as far as the BTB is concerned, there is no difference between a direct and an indirect branch.

Modern cores also use an additional indirect branch predictor that keeps more history. It's possible that a perfectly predictable indirect branch occupies capacity there, or the hardware people may have optimized this relatively common case (PLTs, practically monomorphic method calls) for better utilization of the sophisticated branch predictor (some of the articles I read on chipsncheese gave me that idea).

In any case, if I got a penny for every unsubstantiated claim of a performance advantage from a compiler maintainer or compiler supremacy advocate, ... And even when the original claim is debunked, often someone feels the urge to bring up another possible, but equally unsubstantiated performance advantage. This is not about you or jrtc27 in particular, I just have been in too many of these kinds of discussions.

PLT Rewriting

Posted Feb 12, 2024 22:19 UTC (Mon) by andresfreund (subscriber, #69562) [Link] (1 responses)

> I have measured indirect (and direct) branch/call speed, so what I wrote earlier is based on direct evidence. I have not made measurements on branch predictor capacity, so what I write below is just based on what I have read:

It's pretty easy to end up being bottlenecked by something other than what you're trying to measure if you're not careful.

> In any case, if I got a penny for every unsubstantiated claim of a performance advantage from a compiler maintainer or compiler supremacy advocate, ... And even when the original claim is debunked, often someone feels the urge to bring up another possible, but equally unsubstantiated performance advantage. This is not about you or jrtc27 in particular, I just have been in too many of these kinds of discussions.

FWIW, I've seen pretty clear production performance benefits from using -fno-plt, even though that "just" eliminates a single direct call by moving the load from the GOT inline into the callers.

PLT Rewriting

Posted Feb 14, 2024 14:06 UTC (Wed) by anton (subscriber, #25547) [Link]

Concerning the supposedly hard-to-predict and therefore slow indirect jump that always jumps to the same place, I would like to see how "bottlenecked by something other" explains the original threaded-code microbenchmark results on CPUs with BTBs (e.g., the Pentium, compare with the BTB-less 486) or the v2 results on recent CPUs, where, e.g., direct-threaded code achieves 1 indirect branch per cycle on Tiger Lake (and that despite the fact that only half of the executed indirect branches always jump to the same target in V2, while the other half varies between 5 targets; modern history-based indirect branch predictors predict that perfectly, too).

My feeling is that -fno-plt has better performance benefits than PLT rewriting, because the latter just replaces a perfectly predictable indirect jump with a direct jump (which is often equally expensive). But of course my feelings are just in the "unsubstantiated claims" department, and we need proper and reproducible measurements to get something better.

PLT Rewriting

Posted Feb 12, 2024 22:09 UTC (Mon) by andresfreund (subscriber, #69562) [Link] (1 responses)

> On a core with out-of-order execution (common on smartphones, universal on laptops, PCs and servers), a correctly predicted branch means that the load is just needed for verification and does not add to the latency of other operations.

The branch predictor has limited size, the cold cache case is still important. And even if perfectly predicted, a indirect call still requires more uops, which can be a limiting factor (due to limit of the total number of uops being tracked and due to execution ports being busy).

According to https://uops.info/table.html on Alder-Lake-P a 32bit offset JMP has a throughput of ~2, whereas a jmp to a memory location has a throughput of ~1. And the indirect jump needs 2 execution units. It's similar on other architectures.

Would be interesting to measure the performance benefits of PLT rewriting (1 direct call, 1 direct jmp) vs the performance benefits of -fno-plt (1 indirect jump).

PLT Rewriting

Posted Feb 14, 2024 13:11 UTC (Wed) by anton (subscriber, #25547) [Link]

Is the cold case important? If it is important, why is it cold? I doubt that the cold case is important. Do you have any data to support your claim of the importance of cold cases?

Anyway, in the cold case, both variants will see two branch mispredictions (one for the cold call to the PLT, one for the cold jump in the PLT), but indeed, the memory-indirect jump will also have to load the memory value (probably incurring a D-cache miss) in order to be able to know where the target address is and the misprediction penalty will be larger. Both variants probably also incur I-cache misses for both the PLT cache line and for the target. The -fno-plt variant will have only one misprediction (with the likely D-cache miss and longer misprediction penalty) and only one I-cache miss. I would not be surprised if the -fno-plt variant performed better.

Concerning the throughput of JMP (Rel32) on Alder-Lake-P, the TP column actually contains the inverse throughput, so the 2.08 says that they measured 1 JMP every 2.08 cycles (compare with, e.g., ADD_01 (R64,R64) with an inverse throughput of 0.2, i.e., 5 register-register adds every cycle). The JMP number may be due to the way they measured it (only a limited number of taken jumps per I-cache or uop-cache line have BTB predictions and are fast, see the TP number of JMP Rel8, or compare the loop and unrolled inverse throughputs of JMP M64). In any case most microarchitectures support at most one taken branch/jump per cycle.

Concerning the additional load uop, please present a real-world example where this makes a difference.

GNU C Library version 2.39

Posted Feb 7, 2024 5:15 UTC (Wed) by dankamongmen (subscriber, #35141) [Link] (4 responses)

that qsort(3) news is horrifying and depressing, much like when i realized glibc sprintf(3) was taking internal locks twenty years ago (i believe this to no longer be the case).

GNU C Library version 2.39

Posted Feb 7, 2024 7:23 UTC (Wed) by smurf (subscriber, #17840) [Link] (3 responses)

It's quite easy to teach any sorting algorithm to be stable. Just allocate + populate an index table, then use that to disambiguate otherwise-equal comparisons.

The problem, of course, is the "allocate a table" part.

GNU C Library version 2.39

Posted Feb 7, 2024 7:35 UTC (Wed) by epa (subscriber, #39769) [Link] (2 responses)

What’s the best available stable sorting algorithm that has good worst-case performance and doesn’t need to allocate memory? Or is it proven that you have to “pick any two”?

GNU C Library version 2.39

Posted Feb 7, 2024 10:57 UTC (Wed) by Wol (subscriber, #4433) [Link]

I thought the shell sort was intrinsically stable, but thinking about it I can't see why it should be. It's a good general purpose sort though.

Cheers,
Wol

GNU C Library version 2.39

Posted Feb 7, 2024 18:27 UTC (Wed) by cytochrome (subscriber, #58718) [Link]

Wikipedia (https://en.wikipedia.org/wiki/Sorting_algorithm) has a good rundown of the computational complexity of various sorting algorithms. To my naïve eye, block sort (https://en.wikipedia.org/wiki/Block_sort) seems to satisfy many of the more apparent requirements. However, I have little knowledge of the variety of use cases for sorting via the GNU C Library and the potential issues with implementing and testing a new sorting algorithm within the library.

Who exactly are expecting a stable qsort?

Posted Feb 7, 2024 8:24 UTC (Wed) by makendo (guest, #168314) [Link] (4 responses)

> the removal of merge sort "had the side-effect of making sorting
> nonstable. Although neither POSIX nor C standard specify that qsort
> should be stable, it seems that it has become an instance of Hyrum's
> law where multiple programs expect it".

I am curious. Exactly which programs are expecting qsort to be stable?
musl, from in-code comments, uses smoothsort; a Stack Overflow answer
and an unreliably cited Wikipedia statement states that it isn't a
stable sort. Expecting qsort to be stable pretty much locks a code
to glibc or gnulib, if anything.

Who exactly are expecting a stable qsort?

Posted Feb 7, 2024 9:30 UTC (Wed) by atnot (subscriber, #124910) [Link]

You would be surprised how many tools break with anything besides glibc, relying on stable sorts is really a minor issue in the grand scheme of things there.

Who exactly are expecting a stable qsort?

Posted Feb 7, 2024 11:53 UTC (Wed) by error27 (subscriber, #8346) [Link] (2 responses)

It's annoying when sorted lists are different. I sometimes do:

sort foo > one
sort bar > two
diff -u one two

And the diff doesn't show the parts which have changed but instead just random differences from sort.

Who exactly are expecting a stable qsort?

Posted Feb 7, 2024 21:00 UTC (Wed) by tialaramex (subscriber, #21167) [Link] (1 responses)

Do your foo and bar files often have differences so subtle that while you care about them sort thinks they're equivalent?

In software it's not uncommon that we're sorting things whose identity matters, but I don't see that in sorting lines of text - one identical line of text is the same as another.

Who exactly are expecting a stable qsort?

Posted Feb 12, 2024 17:34 UTC (Mon) by draco (subscriber, #1792) [Link]

Case insensitive sort would be the most common example

GNU C Library version 2.39

Posted Feb 7, 2024 11:25 UTC (Wed) by meven-collabora (subscriber, #168883) [Link] (11 responses)

A feature that I am glad landed is cachestat https://www.phoronix.com/news/Linux-6.5-MM-cachestat that glibc 2.39 now exposes, https://sourceware.org/git/?p=glibc.git;a=commit;h=72511f...

Strangely not even included in the release notes.

In particular for GUI file copy util, when writing to a USB stick we want to report the actual write to the media rather than just the writing in kernel dirty pages.

Allowing to fix https://bugs.kde.org/show_bug.cgi?id=281270 and other similar potential data lose for users.

Other solution, such as using O_DIRECT | O_SYNC flag couldn't realistically be used.

If the same info could be provided through io_uring that'd be great for our use case, allowing to use less syscall and more asynchronous code, letting the kernel do its magic.

GNU C Library version 2.39

Posted Feb 7, 2024 11:41 UTC (Wed) by paulj (subscriber, #341) [Link] (10 responses)

The reference to "Workload-aware writeback pacing" as a use-case is interesting. In the network world we call this "congestion control". Workload-pacing for storage writes, thankfully, shouldn't have to worry about lost writes though!

GNU C Library version 2.39

Posted Feb 7, 2024 12:34 UTC (Wed) by farnz (subscriber, #17727) [Link] (9 responses)

There's a related issue that distros don't currently set per-bdi writeback limits "sensibly" (and indeed, "sensible" is in the eyes of the beholder here). If the user expects to remove a device, then the per-bdi writeback limits should be set so that the kernel won't buffer more than a second or so worth of writeback, accepting that the consequence is that all operations on the device slow down as soon as there's a small amount of dirty data to write, while for devices intended to be permanently connected, the writeback limits should be large so that the kernel can delay writes for longer, and only pays the penalty of delaying if you have a large amount of data pending writeback at shutdown time.

The kernel can't do this itself, because the policy about "permanent" or "removeable" isn't known to the kernel; if you have a USB SSD attached as the main drive for a Raspberry Pi, that's "permanent", and a large writeback limit is reasonable. If you have a USB SSD plugged into that same RPi so that you can copy data to it and then unplug it to move to another location, that's "removeable", and you want the writeback limit to be small.

GNU C Library version 2.39

Posted Feb 7, 2024 13:04 UTC (Wed) by paulj (subscriber, #341) [Link] (6 responses)

This sounds mildly similar to the problem in the networking world of "buffer bloat". Network devices, and the transmitting host particularly, buffer up writes to network (a.k.a packets) with the expectation writes will be bursty. The buffering allows the bursts to be smoothed out and the links kept busy even when the sender isn't actively writing. The problem of course is when the sender is /not/ bursty, and has a long amount of data to send; there is now an intrinsic delay that need not be there. Plus the problem that this buffering makes it harder for senders to accurately measure the latency - necessary for good workload-pacing / congestion control.

Interesting parallels. ;)

GNU C Library version 2.39

Posted Feb 7, 2024 14:00 UTC (Wed) by farnz (subscriber, #17727) [Link] (5 responses)

The difference is that a networking host doesn't have any way to determine the path capacity - indeed, it can change significantly over time. A storage host does usually have a way to determine the device capacity; we can make good guesses at the number of IOPS the device can do, and at how large each IOP can be before it reduces the number of IOPS we can handle.

Also, we have the weirdness that for some devices, we want the buffer to be bloated, and for others, we don't; /, /home and other internal filesystems on my laptop can have a very bloated buffer, since there's no practical cost to it, but there is a potential gain from a huge buffer (turning lots of small operations into a smaller number of big operations).

GNU C Library version 2.39

Posted Feb 7, 2024 15:30 UTC (Wed) by paulj (subscriber, #341) [Link] (2 responses)

True, though, the analogues entity in the storage case is more the user process - not the host and its system software. The user process doesn't have good knowledge of the storage throughput either. Just as in the networking case, it has to write, time to some event, and measure.

One difference you're raising there is that the storage case, you have what the networking world would call "content addressable networking". I.e., the process specifies /what/ content to read and write, thus allowing the system (a tiny distributed system, in a case) to offer caching (inc. write caching) at various levels. This is something the networking world generally lacks, sadly (?). In networking the reads/writes are generally intimately tied to the location of the data. Caching is thus minimal, and we have to build very complicated systems to virtualise the location of the data /somewhat/ (within the scope of that complicated system).

Of course, the single-system storage model morphs into that same problem once it exceeds the capacity of the highly-cohesive, coherent single-system model, as the answer will involve introducing much less coherent technologies, i.e. networking . ;)

GNU C Library version 2.39

Posted Feb 7, 2024 16:32 UTC (Wed) by farnz (subscriber, #17727) [Link] (1 responses)

The other important difference is that in the storage case, we rarely care about the effects of congestion on shared links. Either we can afford to wait when the in-memory kernel cache flushes out to the device (the internal drive case), or we want to keep the cache small compared to the speed of the device so that it's quick to flush when needed (the removable drive case), and we very rarely have links slower than the devices they're connecting (even under congestion).

GNU C Library version 2.39

Posted Feb 7, 2024 16:55 UTC (Wed) by paulj (subscriber, #341) [Link]

Well, you have reliable links, so you don't have to worry about over-loading the links and causing loss in transmission. However, you do still want some system that is able to pace each distinct user's traffic, and keep things fair between those users. In networking - least the classic "dumb packet switching network" that the likes of TCP/IP runs on, notionally - that ends up a distributed problem. In the coherent single system, a central scheduler can arbitrate and enforce.

But that was what my first comment was pointing at: the noted use-case of /end process/ workload pacing would start to introduce some of that functionality into the user process (which is equivalent to the "end host"). ;) Who knows where that leads to in the future. ;)

Maybe at some point the coherent single-system becomes more of an explicit distributed system. (It already is a distributed system, but hides it very well; HyperTransport, PCIe, etc., are all at least packet based, but non-blocking and [very near] perfectly reliable - making the presentation of a very coherent system much easier than with networking).

GNU C Library version 2.39

Posted Feb 15, 2024 11:32 UTC (Thu) by tanriol (guest, #131322) [Link] (1 responses)

> The difference is that a networking host doesn't have any way to determine the path capacity - indeed, it can change significantly over time. A storage host does usually have a way to determine the device capacity; we can make good guesses at the number of IOPS the device can do, and at how large each IOP can be before it reduces the number of IOPS we can handle.

And then the RAM cache of the SSD fills up and the available bandwidth drops. And then the SLC cache area of the SSD fills up and it drops again.

GNU C Library version 2.39

Posted Feb 15, 2024 11:55 UTC (Thu) by Wol (subscriber, #4433) [Link]

And then you discover it's not an SSD, it's a shingled rotating rust drive ... :-)

Cheers,
Wol

GNU C Library version 2.39

Posted Feb 8, 2024 4:40 UTC (Thu) by intelfx (subscriber, #130118) [Link] (1 responses)

Last time I checked, per-bdi writeback limits were totally nonfunctional even when set manually. In the kernel, I think there even were some references to auto-tuning these limits, but that behavior is also nowhere to be seen.

Could you please clarify how do I make them work? :-)

GNU C Library version 2.39

Posted Feb 8, 2024 10:28 UTC (Thu) by farnz (subscriber, #17727) [Link]

Tested on 6.6.6 and 6.7.3 kernels, on a Fedora 39 system. I have a slow USB device as /dev/sda, which is therefore bdi 8:0. To restrict it to 1 second of dirty data, I need to run two commands as root:

echo 1 > /sys/class/bdi/8:0/strict_limit
echo $((4 * 1024 * 1024)) > /sys/class/bdi/8:0/max_bytes

Per the documentation for bdi limits, the first command tells the kernel that this device's limits must be respected even if the global background dirty limit isn't reached; my system has a global background dirty limit around 8 GiB, so without this, any limit I set below 8 GiB is ignored.

The second sets the actual limit - in this case, 1 second of writes to the device, which is 4 MiB of data. You can see why strict limits matters here, though - without strict limits, the global background dirty limit would push me to 8 GiB of data before even checking the per-bdi limits, unless I had a lot of writes in progress to other devices. And I want relatively large limits for the global limits, since my laptop has a big battery, and when I build a Yocto image, I've often got many readers in parallel that need time on the NVMe drive, along with writes that can be delayed, and I'd prefer the writes to wait so that I'm blocking on CPU, not on I/O.

Interposers and rewriting the Procedure Linkage Table (PLT) o

Posted Feb 8, 2024 0:40 UTC (Thu) by davecb (subscriber, #1574) [Link] (3 responses)

In a previous life, I merrily interposed my functions between callers and libraries with LD_PRELOAD. For example, I built a library profiler that way (https://github.com/davecb/libprof/blob/master/libprof.1)

I'm rather hoping that the rewrite of the PLT to use absolute jumps to the correct location on startup honours LD_PRELOAD=./libprof.so.1 <program>, which is available on startup.

Interposers and rewriting the Procedure Linkage Table (PLT) o

Posted Feb 8, 2024 1:26 UTC (Thu) by jrtc27 (subscriber, #107748) [Link] (2 responses)

It does not change symbol preemption rules, only what happens once it's figured out what the symbol is.

Interposers and rewriting the Procedure Linkage Table (PLT) o

Posted Feb 8, 2024 9:57 UTC (Thu) by Wol (subscriber, #4433) [Link]

Sounds a bit like the Pr1me linker from the 80s! I don't remember the details, but when a subroutine was called for the first time it went through a handler that replaced the caller site with a direct jump. That feels like I'm describing Unix ld as well ...

I feel like I'm stating the obvious here - probably missed something important :-)

Cheers,
Wol

Interposers and rewriting the Procedure Linkage Table (PLT) o

Posted Feb 8, 2024 12:59 UTC (Thu) by davecb (subscriber, #1574) [Link]

Super, thanks!