Leading items

Welcome to the LWN.net Weekly Edition for February 8, 2024

This edition contains the following feature content:

GNU C Library version 2.39: the most significant new features in this glibc release.
The hard life of a virtual-filesystem developer: writing a filesystem is never easy, but creating a virtual filesystem in Linux is perhaps harder than it needs to be.
The end of tasklets: there may finally be a plan to remove an old and unloved kernel deferred-work mechanism.
Zig 2024 roadmap: what's coming in the Zig programming language.
So you think you understand IP fragmentation?: predicting when IP packets will be fragmented in transit is not as easy as it seems.
Please welcome Joe Brockmeier to LWN: another writer/editor has been brought into the LWN fold.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

GNU C Library version 2.39

By Daroc Alden
February 6, 2024

The GNU C Library (glibc) released version 2.39 on January 31, including several new features. Notable highlights include new functions for spawning child processes, support for shadow stacks on x86_64, new security features, and the removal of libcrypt. The glibc maintainers had also hoped to include improvements to qsort(), which ended up not making it into this release. Glibc releases are made every six months.

Pidfd changes

The Linux kernel added support for pidfds in the 5.1 development cycle in 2018. More support was added in the kernel in 2019, but still required some user-space support to be generally useful. Process IDs (PIDs) can be reused by different processes, so there can be races between the time a PID is obtained and used; if the PID is reused in that window, the wrong process will be operated on. Other objects on Unix-like systems are traditionally represented using files — pidfds are unique file descriptors that refer to a specific process, even if another process reuses the same PID. However, making this facility generally available to programs requires some user-space support. The new release of glibc includes several additional functions for working with pidfds; LWN covered the creation of two new helper functions when the work was being introduced: pidfd_spawn() and pidfd_spawnp(). These functions are used to create a process and directly return its pidfd, avoiding a potential race condition where a process could exit and be replaced by a different process before its parent can open a pidfd for it. These functions are only available on systems with the clone3() system call.

Those aren't the only new functions for working with pidfds, however. This release also adds posix_spawnattr_setcgroup_np() and posix_spawnattr_getcgroup_np() which can be used to set which control group a spawned process will be part of before the process is started. Setting a process's control group ahead of time avoids a race condition between starting the process and applying resource limits to it. These functions are part of the family of functions that modify the posix_spawnattr_t structure used by the posix_spawn() family of functions to specify implementation-defined attributes of the process to be spawned.

    int posix_spawnattr_getcgroup_np(const posix_spawnattr_t *restrict attr,
                                     int *restrict cgroup);

    int posix_spawnattr_setcgroup_np(posix_spawnattr_t *attr,
                                     int cgroup);

Finally, programs that use the new pidfd interface, but that still need to determine the child process's PID for whatever reason, can use pidfd_getpid():

    pid_t pidfd_getpid(int fd);

Shadow stacks

Another new kernel feature that this release allows programmers to take advantage of is support for x86's control-flow enforcement technology (CET). There are two technologies included in CET: shadow stacks and indirect branch tracking. Both are aimed at reducing or eliminating return-oriented programming (ROP) attacks. Glibc already supported programs compiled with support for indirect branch tracking. This glibc update is focused on adding support for shadow stacks, since setting up a shadow stack also requires the cooperation of the dynamic linker. Shadow stacks work by keeping a second copy of function return addresses in specially protected memory, so that an attack that overwrites the stack cannot take control of the program.

The Linux kernel has allowed programs to opt into using shadow stacks since version 6.6, but there hasn't been a good way for them to take advantage of that capability without making direct system calls. This release of glibc includes an --enable-cet configure option that enables support for shadow stacks. When built with this flag, glibc will ask the kernel to allocate memory for the shadow stack and enable the protection during program startup.

Programs also need to be compiled with shadow-stack support to take advantage of the feature, however. The GNU Compiler Collection (GCC) and Clang both support compiling programs with shadow-stack support via the -fcf-protection flag. Some distributions have already enabled this flag by default (including Ubuntu and Fedora). Glibc's --enable-cet configure option changes the behavior of loading shared objects. When a process starts, glibc checks whether the binary supports CET. If it does, loading a shared object that was not compiled with CET support becomes an error. Glibc also supports --enable-cet=permissive, which silently disables the shadow stack when opening a non-shadow-stack-aware shared object.

PLT rewrite

One last proactive security feature in this release is the addition of support for rewriting the Procedure Linkage Table (PLT) on startup. A dynamically linked program cannot tell at compile time what the location of any shared objects in memory will be at run time. Therefore, when the program makes a call to a function from another shared object, that call goes via the PLT. On the first invocation of a given function in the PLT, the code there jumps to the dynamic linker, which searches to find the right offset for the function and then rewrites the PLT entry to point to the function directly; subsequent calls do not need to involve the dynamic linker.

This release of glibc adds a "tunable" called glibc.cpu.plt_rewrite. When this option is enabled, the dynamic linker rewrites the PLT to use absolute jumps to the correct location on startup, instead of waiting for a particular function to be invoked. Once the PLT is rewritten, it is then remapped as read-only memory. This can slightly increase the startup time of the program, but it removes a potential avenue for attacks that seek to control the program: overwriting the PLT with maliciously chosen jump destinations. Glibc tunables can be set using the GLIBC_TUNABLES environment variable as follows: GLIBC_TUNABLES=glibc.cpu.plt_rewrite=1.

`qsort()`

One change that did not make it into the release was a proposed improvement to qsort(). In October, Adhemerval Zanella authored a patch that swapped qsort()'s venerable merge sort implementation for one based on introsort instead. He noted that merge sort performs poorly on already-sorted or nearly-sorted arrays, a problem that does not appear in introsort. He also noted that the "mergesort implementation has some issues", going on to point out that merge sort requires auxiliary memory, meaning that qsort() can call malloc(). This demand for additional memory means that calling qsort() might introduce arbitrary delays while the system conjures enough memory for merge sort's auxiliary array.

This potential delay also makes the function not async-cancel safe. POSIX requires that programs in certain contexts (such as signal handlers, or after a new thread is created) call only async-cancel-safe functions, or risk undefined behavior. POSIX only requires a handful of functions be async-cancel safe, but glibc maintains a list of which library functions are async-cancel safe in practice, to aid developers wishing to rely on details of glibc's implementation. In an earlier discussion on the mailing list, Zanella noted that he sees making glibc functions async-cancel safe where possible as a quality-of-life improvement for users.

Zanella ended up reverting the change at the beginning of January, saying that the removal of merge sort "had the side-effect of making sorting nonstable. Although neither POSIX nor C standard specify that qsort should be stable, it seems that it has become an instance of Hyrum's law where multiple programs expect it". Stable sorting algorithms such as merge sort keep two inputs that compare equal in the same order in the output, while unstable sorting algorithms don't make that guarantee. Zanella went on to note that the existing implementation does fall back to a heap-sort algorithm if it is unable to allocate the needed additional memory. Heap sort is not a stable sort, so there is still a potential bug lurking in programs that expect qsort() to be stable, although it can't be triggered on the vast majority of Linux systems. Most Linux systems use memory overcommit: a feature of the kernel which allows programs to allocate more anonymous memory than the system can support in total, on the theory that many programs don't use all the anonymous memory they request. Memory overcommit is enabled on nearly all Linux systems — the exception being some embedded systems that specifically turn it off — which may make finding the error in any programs that do trigger this problem somewhat difficult.

Qualys reported a security vulnerability in the qsort() implementation in January, but Florian Weimer had actually already fixed it during some unrelated work the month before. The vulnerability affects all versions of glibc present in the project's Git repository, going back to 1989 at least. In the announcement, Qualys merely claims the vulnerability is present in 1992, but there were no changes to the relevant code between the initial commits recorded in the Git history and the 1.04 release in 1992. The project's security team were quoted in Qualys's report as saying that the behavior required to trigger the vulnerability is "undefined according to POSIX and ISO C standards", and that they nonetheless "acknowledge that this is a quality of implementation issue and we fixed this in a recent refactor of qsort".

libcrypt

Another thing that this release does not contain is libcrypt, the library that provided the crypt() password-hashing function. The default build configuration of glibc already omitted libcrypt as of version 2.38, but users could opt-in using the --enable-crypt configure flag, which is now removed. Users of libcrypt should consider switching to libxcrypt, a replacement offering the same interface under a compatible license, but maintained separately from glibc. Libcrypt was the last remaining use of Mozilla's Network Security Services (NSS) cryptography library in the C library, allowing the project to drop that dependency. The removal of libcrypt is the end of a process that began in 2017, when Zack Weinberg suggested that libcrypt might best be moved to "a separate project that could move faster" in response to a request to add an OpenBSD-compatible bcrypt() function.

Conclusion

Version 2.39 does not include any groundbreaking new developments, which is to be expected from a stable project like the GNU C library. This release includes access to new kernel features and several security improvements, alongside the expected maintenance of the locale system and bug fixes. 68 contributors helped produce the new version, with 21 reviewing patches. Version 2.40 is expected in another six months, around the end of July.

Comments (38 posted)

The hard life of a virtual-filesystem developer

By Jonathan Corbet
February 1, 2024

Filesystem development is not an easy task; the performance demands are typically high, and the consequences for mistakes usually involve lost data and irate users. The implementation of a virtual (or "pseudo") filesystem — a filesystem implemented within the kernel and lacking a normal backing store — can also be challenging, but for different reasons. A series of conversations around the eventfs virtual filesystem has turned a spotlight on the difficulty of creating a virtual filesystem for Linux.

The longstanding "tracefs" virtual filesystem provides access to the ftrace tracing system; among other things, it implements a directory with a set of control files for every tracepoint known to the system. A quick look on a 6.6 kernel shows over 2,800 directories containing over 16,000 files. Until the 6.6 release, the kernel had to maintain directory entry ("dentry") and inode structures for each of those directories and files; all of those structures consumed quite a bit of memory.

For added fun, multiple instances of the tracepoint hierarchy can be mounted, with each one causing the kernel to duplicate all of that memory overhead. Even with a single tracepoint hierarchy, the chances are that almost none of the files contained within it will be accessed over the life of the system, so the memory is simply wasted.

Eventfs was merged in 6.6 as a way of eliminating this waste. It is a reimplementation of the portion of tracefs that represents the actual tracepoints, but optimized so that dentries and inodes are only allocated when a file is actually accessed. Vast amounts of memory were returned to the system for better use, and there was widespread rejoicing.

That rejoicing would have been more enthusiastic, though, had not a series of bugs, some with security implications, turned up in eventfs. This filesystem has required a long series of fixes — a process that is ongoing as of this writing. As all this has unfolded, there has been an extensive series of long threads between tracing maintainer Steve Rostedt, Linus Torvalds, and others. Among other things, there have been discussions on the size reported for virtual files, whether those files should have unique inode numbers, and many conversations on the details of interfacing with the kernel's virtual filesystem (VFS) layer. In the end, Torvalds ended up creating a patch series addressing a number of problems in eventfs.

For those wanting rather more sensational coverage than is LWN's habit, now might be a good time to search out the articles published elsewhere. The focus here will be on two points that came out in the discussions.

One of those is that documentation for would-be filesystem developers is lacking. Some, including VFS maintainer Christian Brauner, would disagree with that claim. There is, indeed, a fair amount of VFS documentation, including detailed descriptions of the VFS locking rules, which are some of the most complex in the kernel. But the number of things that Rostedt, who is not an inexperienced kernel developer, stumbled over during the course of this work makes it clear that many things remain undocumented. That is perhaps especially true for a developer wanting to implement a virtual filesystem, which tends to be a one-time project entered into by a developer whose focus is on another part of the kernel.

Consider, for example, the subsystem known as "kernfs". It is a framework designed to ease the implementation of virtual filesystems; it is currently used to implement control groups and the resctrl filesystem. It seems like exactly what a developer of a virtual filesystem would need, except for one little problem: it is meticulously undocumented. No attempt has been made to describe its use; as a result, when Rostedt considered it, he concluded: "kernfs doesn't look trivial and I can't find any documentation on how to use it" and passed it by.

Perhaps, had kernfs been more accessible when eventfs was developed, it would have been found suitable to the task and would have helped to prevent the long series of mistakes that plagued eventfs. Perhaps, if the missing documentation were to be provided, the next virtual filesystem project could have an easier time of it.

There is another problem, though, that was nicely spelled out by Torvalds: the VFS layer is oriented toward the needs of "real" filesystems, those that are charged with the task of persistently storing data in a hierarchical directory structure. As a result, it has a lot of performance-driven quirks that are not only unhelpful for virtual filesystems, they also complicate the task of implementing those filesystems. To take it even further, though, the whole filesystem concept is a bit of an awkward fit for virtual filesystems:

And realize that [they] aren't really designed to be filesystems per se - they are literally designed to be something entirely different, and the filesystem interface is then only a secondary thing - it's a window into a strange non-filesystem world where normal filesystem operations don't even exist, even if sometimes there can be some kind of convoluted transformation for them.

That results in pathologies like even simple filesystem operations (stat() on a /proc file, for example) not working properly in virtual filesystems. In a normal filesystem, the lifetime of the files themselves is directly tied to filesystem operations. The objects represented in a virtual filesystem, instead, have unrelated lifetimes of their own. The combination of two separate worlds, Torvalds said, is "why virtual filesystems are generally a complete mess".

So how does one improve on this situation? One approach would be to abandon the idea of a virtual filesystem entirely, saying that the filesystem abstraction is simply not suitable for this kind of kernel ABI. Arguably, that is what the networking subsystem (along with some others) has done by adopting netlink for complex interfaces. Netlink works well for many things, but it is not a universally popular interface. An older variant of this approach, of course, is to simply provide a set of ioctl() calls. Use of ioctl() is somewhat discouraged, though; it tends to produce widely varying interfaces that see little review before being merged into the kernel. Yet another approach is the addition of new system calls, as was done by the VFS layer itself with the listmount() and statmount() system calls that were merged for the 6.8 release.

In the end, though, there is value to a filesystem-oriented interface. It is familiar to users, scriptable, and relatively well defined. If everything is a file, then utilities written to work with files can be brought to bear. That is why virtual filesystems have proliferated over the years; it suggests that there would be value in making it easier for developers to correctly implement virtual filesystems. That, in turn, indicates that putting some effort into APIs like kernfs and, crucially, documenting them could do a lot to make life less difficult for the next developer who takes on a virtual filesystem project.

Comments (37 posted)

The end of tasklets

By Jonathan Corbet
February 5, 2024

A common problem in kernel development is controlling when a specific task should be done. Kernel code often executes in contexts where some actions (sleeping, for example, or calling into filesystems) are not possible. Other actions, while possible, may prevent the kernel from taking care of a more important task in a timely manner. The kernel community has developed a number of deferred-execution mechanisms designed to ensure that every task is handled at the right time. One of those mechanisms, tasklets, has been eyed for removal for years; that removal might just happen in the near future.

One context where deferred execution is often needed is interrupt handlers. An interrupt diverts a CPU from whatever it was doing at the time and must be handled as quickly as possible; sleeping in an interrupt handler is not even remotely an option. So interrupt handlers typically just make a note of what needs to be done, then arrange for the actual work to be done in a more forgiving context. There are several options for this deferral:

Threaded interrupt handlers. This mechanism, which originated in the realtime tree, was merged into the 2.6.30 release in 2009; it causes the bulk of a driver's interrupt handler to be run in a separate kernel thread. Threaded handlers, since they are running in process context, are allowed to sleep; the system administrator can also adjust their priority if need be.
Workqueues were first added during the 2.5 development series and have been extensively enhanced since then. A driver can create a work item, containing a function pointer and some data, and submit it to a workqueue. At some future time, that function will be called with the provided data; again, this call will happen in process context. There are various types of workqueues with different performance characteristics, and subsystems can create their own private workqueues if need be.
Software interrupts (or "bottom halves"). This mechanism is among the oldest in the kernel; it takes its inspiration from earlier Unix systems. A software interrupt is a dedicated handler that runs, usually immediately after completion of a hardware interrupt or before a return to user space, in atomic context. There has been a desire to remove this mechanism for years, since it can create surprising latencies in the kernel, but it persists; adding a new (direct) user of software interrupts would encounter significant opposition. See this article for more information on software interrupts.
Tasklets. Like workqueues, tasklets are a way to arrange for a function to be called at a future time. In this case, though, the tasklet function is called from a software interrupt, and it runs in atomic context. Tasklets have been around since the 2.3 development series; they, too, have been on the chopping block for many years, but no such effort has succeeded to date.

Threaded interrupt handlers and workqueues are seen as the preferred mechanisms for deferred work in modern kernel code, but the other APIs have proved hard to phase out. Tasklets, in particular, remain because they offer lower latency than workqueues which, since they must go through the CPU scheduler, can take longer to execute a deferred-work item.

Mikulas Patocka recently encountered a problem with the tasklet API. A tasklet is defined by struct tasklet_struct, which contains the address of the callback function and related information. The tasklet subsystem needs to be able to manipulate that structure, and may do so after the tasklet function has completed its execution and returned. This can be a problem if the tasklet function itself wants to free that structure, as might happen for a one-shot tasklet that will not be called again. The tasklet subsystem could end up writing to a structure that has been freed and allocated for another use, with predictably unpleasant consequences.

Patocka sought to fix this problem by adding a new "one-shot" tasklet variant, where the tasklet subsystem would promise to not touch the tasklet_struct structure after the tasklet itself runs. Linus Torvalds, though, did not like that patch; he said that tasklets just should not be used in that way. Workqueues are better designed, he said, and are better for that use case — except for the extra latency they can impose. So, he suggested, the right approach might be a new type of workqueue:

I think if we introduced a workqueue that worked more like a tasklet - in that it's run in softirq context - but doesn't have the interface mistakes of tasklets, a number of existing workqueue users might decide that that is exactly what they want.

Tejun Heo, the workqueue maintainer, ran with that idea; the result was this patch series adding a new workqueue type, WQ_BH, with the semantics that Torvalds described. A work item submitted to a WQ_BH workqueue will be run quickly, in atomic context, on the same CPU.

Interestingly, these work items are run out of a tasklet — for now. Fearing priority-inversion problems between WQ_BH work items and existing tasklets, Heo chose to leave the tasklet subsystem in control. The patch series converts a number of tasklet users over to the new workqueue type, though, and the plan is clearly to convert the rest over time. That may take a while; there are well over 500 tasklet users in the kernel. Once that conversion is complete, though, it will be possible to run WQ_BH workqueues directly from a software interrupt and remove the tasklet API entirely.

This implementation, of course, still leaves software interrupts in place; removing that subsystem will be a job for another day. Using software interrupts led to a complaint from Sebastian Andrzej Siewior, who would rather see tasklet users moved to threaded interrupt handlers or regular workqueues. But, as Heo answered, that doesn't help the cases where the shortest latency is required. It seems there may always be a place for a deferred-work mechanism that does not require scheduling, as much as the realtime developers would like to avoid it.

Heo has the patch series marked as targeted at the 6.9 kernel release, meaning that it would need to be ready for the merge window in mid-March. That is relatively quick for a significant new feature like this, but it is using a well-established kernel API to edge out a subsystem that developers have wanted to get rid of for years. So there a is a reasonable chance that this particular work may not be deferred past the next kernel cycle.

Comments (26 posted)

Zig 2024 roadmap

By Daroc Alden
February 2, 2024

The Zig language 2024 roadmap was presented in a talk last week on Zig Showtime (a show covering Zig news). Andrew Kelley, the benevolent dictator for life of the Zig project, presented his goals for the language, largely focusing on compiler performance and continuing progress toward stabilization for the language. He discussed details of his plan for incremental compilation, and addressed the sustainability of the project in terms of both code contributions and financial support.

Priorities for Zig

Kelley opened by discussing the upcoming 0.12 release, confirming that the GitHub issues tagged with that milestone constitute an accurate list of what he expects to land in the release, and that there are no large disruptive features planned for 0.12. Kelley said that he expected 0.12 to be fairly stable, and that many users would prefer it because it mostly contains bug fixes to 0.11, only a select handful of which are going to be backported to 0.11.1. He then talked about the process for deciding on the most important problems to fix as part of aiming for stability, and invited anyone who had a big, real-world Zig project to add their problems with the language to the tracking document.

Kelley then segued into discussing the number of open issues with the compiler, pointing out that if he fixed one bug a day, it would take years to resolve all of them. So, naturally, the Zig project needs some way to solve open issues faster. Kelley's solution to this problem is to work on making the compiler more performant — if contributors need to spend less time compiling, they can spend more time actually working on contributions. Currently, the Zig compiler takes ten minutes to build from scratch on my laptop, but the tests take more than 600 CPU-minutes. A large portion of that time is spent on code generation using LLVM, which is often slow (despite recent improvements by the LLVM project).

LWN previously reported on the Zig project's attempts to make LLVM an optional dependency. Kelley said that great progress has been made on that goal, and that the native x86 backend is complete enough that "you can start playing with it and seeing if it works for your use case", although it is not passing all of the behavior tests quite yet. He expects those changes to land in 0.12, although the LLVM backend will remain the default for at least one more release. Kelley hopes that the native backend will be useful for debug builds, where producing code quickly is more important that producing optimal code.

He was also excited about progress on incremental builds, saying: "This feature is unique to Zig because you just can't do it on accident. You need to design for it from the beginning". While other languages support incremental compilation, the Zig project has plans to go further than other languages do, including directly patching changed functions into on-disk binaries. In support of incremental compilation, Kelley has been working on re-designing the compiler's internal data structures to make incremental compilation as smooth as possible.

He has converted the compiler's internal representation to avoid the use of pointers, instead storing information as flat arrays accessed by index. This representation permits the compiler's internal knowledge of the program to be saved directly to disk and then loaded again without parsing or fixups, so that incremental compilation can just mmap() the previous analysis and access it directly. Kelley went on to say that incremental compilation "is actually Zig's killer feature, and we haven't even launched it yet". Eventually, he wants to make incremental rebuilds of a project so fast as to appear instantaneous.

Later, Kelley claimed that the four main priorities for Zig before tagging version 1.0 were performance, improvements to the language itself, bringing the standard library up to a consistent level of quality, and writing a formal language specification. He presented each of these as being a necessary prerequisite of the later steps, with work on performance informing what language changes are necessary, which themselves inform what goes into the specification. He addressed concerns about how long it would take for Zig to reach 1.0 by remarking that "things are pretty unstable right now, but maybe 1.0 is not really what you're waiting for. Maybe you're just waiting for, you know, the draft of the language specification to come out. I think that's going to be a stability turning point".

During the Q&A after the main talk, one person asked how they could pitch the use of Zig to their company, given that the language isn't stable yet. Kelley's response was that people should "ask for forgiveness, not permission". Loris Cro, the host of Zig Showtime, elaborated by explaining that focusing too much on whether Zig has reached version 1.0 is missing the point. Cro said that it's more important to have a good mental model of where Zig stands compared to the alternatives, and whether using the language would actually be better for your use case. Kelley agreed with that assessment, saying that one should introduce Zig at one's company only if it would actually make some piece of software better, not just for the sake of introducing a new language.

Another question concerned whether the Zig project intends to focus on external tooling, such as a better language server. Kelley responded that he intends to get incremental compilation working first, before integrating an officially supported language server, because it could reuse many of the same internal components. He did assert that anyone can influence the path and timeline of Zig by submitting a pull request, and encouraged anyone who was particularly interested in a specific tool or feature to try contributing to the project.

A different audience member asked whether there had been any fix for the issue highlighted in Martin Wickham's 2023 talk at the Software You Can Love conference where a function argument that is implicitly passed by reference can cause problems when the same memory is also aliased through a mutable pointer. Kelley responded that this had not yet been addressed, and remarked that while this problem has drawn a lot of attention in tech news circles, he does not consider it very important. He noted that TigerBeetle, a financial database written in Zig that cares a lot about correctness, listed several other problems with the language as being more pressing. Despite that, he assured the audience that the problem would be fixed eventually, it just hasn't been a priority yet.

One potential contributor said that they were excited to write some compiler optimization passes for Zig, and wondered when the compiler will have settled down enough that they could look at contributing those. Kelley responded that there was still some additional research work to be done around efficient internal representations, and that he had been looking in particular at the Sea of Nodes representation. Kelley continued to say that high-level optimizations would need to wait until that had been explored.

Another question had to do with when support for async-await would return to the language. Kelley responded that the asynchronous I/O support in previous versions of Zig had been experimental, and that they had learned a lot about what problems come up when trying to do something like that in a systems language. He said that he really liked single-threaded asynchronous event loops, because of how they let you be very explicit about what order I/O operations ought to occur in, and he wanted to bring it back to the language, but the prerequisites weren't in place. He spoke about the difficulties with debugging asynchronous programs (including lack of support in DWARF for asynchronous functions), and the problems with LLVM's coroutine code generation. Ultimately, he said that asynchronous programming would need to wait for the native backend, which can exercise greater control over how the code for asynchronous functions is generated.

Finances

Kelley also spent a portion of the talk discussing the Zig Software Foundation's finances. The foundation recently published a report as part of its 2024 fundraiser explaining how it used donations in 2023. While many volunteers contribute time to the Zig project directly, nearly two thirds of the foundation's expenditures went to paying contractors to work on the core language. The remaining third was split between paying Kelley for his time working on the language and paying for necessary infrastructure and legal expenses.

Donations diminished in the last portion of 2023, but Zig is more popular than ever. There were a record number of GitHub issues filed and pull requests merged in the past year. In 2024, the Zig Software Foundation is hoping that a large portion of its revenue can once again come from individual donations, which accounted for a third of its revenue in 2023. It is also experimenting with new funding sources, including "Donor bounties". The foundation has previously expressed discomfort with feature bounties, saying "bounties are a poor form of sponsorship when it comes to software development", because they discourage cooperation and maintainable code. Donor bounties are a proposed alternative where the bounty is paid to the foundation, instead of directly to contributors. In this model, bounties can still contribute to the development of Zig, without setting contributors at odds with each other or encouraging the development of shoddy code. Kelley was quick to emphasize that donor bounties are an experiment, and "are not a new thing we want everyone to do". No-strings-attached donations remain the foundation's preferred method of contribution. But the few donors who have specific, urgent requests for particular features can reach out to the Zig Software Foundation to fund a bounty.

Conclusion

While the Zig project is continuing to make progress on its ambitious goals for the language, there are numerous difficult technical challenges ahead. Kelley's vision for the language in 2024 focuses mainly on performance, as a way to increase development velocity and unblock other work. There is lots of additional work needed in the long term on the standard library, documentation, and language specification. Despite that, the Zig project is healthy, with 354 contributors in the past year and a growing pool of serious real-world users. Kelley noted that he doesn't know how long it will take to tag 1.0; like any free software project, Zig's progress depends on people's willingness to help. Despite this reality, Kelley asserted that the project would get there with time.

Comments (81 posted)

So you think you understand IP fragmentation?

February 7, 2024

This article was contributed by Valerie Aurora

What is IP fragmentation, why is it important, and do people understand it? The answer to that last question is "not as well as they think". This article will also answer the rest of those questions and introduce fragquiz, a game that I wrote to allow players to guess how IP packets will behave when they are too large for the network. As evidence that IP fragmentation is not well-understood, a room full of networking experts played fragquiz and got a score that was nowhere close to perfect. In addition, I will describe a new algorithm for fragmentation avoidance, which some colleagues and I developed, that helped motivate development of fragquiz.

Why care?

IP fragmentation is when an IP (Internet Protocol) packet is split into smaller pieces before it is sent to another computer. TCP and UDP, along with a lot of other network protocols, are implemented on top of IP. Many networking experts think they know when IP fragmentation will happen, and I thought I did too—until I had to implement an algorithm for a VPN. That's when I learned that, like me, a lot of other networking experts are quite bad at predicting when a packet would be split into pieces. To explain why, we start with what IP fragmentation is.

An IP packet is a building block of the internet: a little chunk of application data with a header describing what it contains, where to send it, and what intermediate routers are allowed to do to it, among other things. Each router on the path between the source and destination host reads the IP header, changes it slightly, consults the routing tables, and (hopefully) sends the packet on to the next router in the path.

Each network link has a maximum size of IP packet that can be sent over it: the Maximum Transmission Unit (MTU). The path MTU (PMTU) is the minimum of all of the MTUs on the path between two hosts. The path can change over time, however, based on congestion, outages, and other network changes.

IP fragmentation happens when IP packets get split up into smaller IP packets, each with their own header, so that they can fit into the MTU of the network path. In IPv4 and IPv6, fragmentation can occur at the source, the computer where the packet is coming from. In IPv4, packets can also be fragmented by any router on the path between the source and the destination.

Generally speaking, IP fragmentation is bad for performance in just about every dimension: throughput, latency, CPU usage, memory usage, and network congestion. To see why, imagine a typical IPv4 packet of 20 bytes of IP metadata and 1480 bytes of data that has been fragmented into packets that each contain only eight bytes or fewer of data for a total of 1480/8 = 185 packets. (This is possible but unlikely to ever happen in reality; usually packets are only split into two pieces.)

To send 1480 bytes of data in eight-byte fragments, the source must send 185*20 = 3700 bytes of metadata instead of just 20 bytes in the unfragmented case. Processing the packet header costs a certain amount of CPU time, which will happen 185 times at every host in the path. The destination can't pass the data up the networking stack until it receives all of the fragments, so the latency is the worst case of 185 packets. The destination must also reserve memory for assembling the fragmented packet, which it will throw away if it does not receive even one of the fragments after waiting for a reasonable time.

Worse, fragments are more likely to be lost. Many routers and firewalls treat fragments as a security risk because they don't include the information from higher-level protocols like TCP or UDP and can't be filtered based on port, so they drop all IP fragments. Also, load-balancing systems might route fragments to different hosts, where they can never be reassembled.

Even when an IP packet is only split into two pieces, it usually causes a noticeable degradation of connection performance due to the doubling of the per-packet overhead. Sometimes IP fragmentation results in a network "black hole" if a router is configured to drop fragments. The small packets that initiate a connection get through, but the larger packets containing the data are fragmented, so they are all dropped. This is why network programmers really really want to prevent IP fragmentation.

Prevention

IP fragmentation is prevented by only sending packets that are equal to or smaller than the path MTU between two hosts. But how do we find the path MTU? This is called path MTU discovery (PMTUD) and there are a variety of methods to do this, depending on the networking protocol and the characteristics of the network. One reliable way to find the path MTU is to send IP packets of a known size that are not allowed to be fragmented. If the source gets confirmation that a packet arrived at the destination, then the path MTU is at least as large as that packet.

So, to prevent IP fragmentation, you must understand IP fragmentation well enough to predict two things: the size of the IP packet as sent by the source host, and whether any intermediate routers are permitted to fragment the packet into smaller pieces. This depends on, among other things:

the MTU of the local interface
the IP version (IPv4 or IPv6)
the options in the IP packet header
the protocol (TCP/UDP/ICMP/etc.)
the socket options
any system-wide PMTUD-related settings
any relevant PMTU-cache entries

If the sender tries to send a IP packet that is bigger than the MTU on any part of the path to the receiving host, there are three possibilities: the send() system call returns EMSGSIZE, the packet is fragmented, and/or the packet is dropped. (The last two may happen on either the source host or an intermediate router, depending on the packet type and options.) When I say that someone "understands IP fragmentation", I mean that they can predict which of those things might happen to a given packet.

Well-understood?

If you'd asked me a year ago if most networking experts could predict the size and fragmentation status of an IP packet, I would have confidently said "yes". Then I had to implement DPLPMTUD for a VPN. (Yes, that's a real acronym, for real software, from a real RFC. It stands for Datagram Packetization Layer Path Maximum Transmission Unit Discovery.)

Initially, it seemed like it would be easy. My colleagues were networking experts with a lot of experience working on the application, which is a WireGuard-based VPN using IPv4 and IPv6. Together, we came up with a fast, simple path MTU discovery algorithm. They were confident that the software already only sent packets that couldn't be fragmented, so all we had to do is send the right size of probe packets, using a built-in ping feature, and record the response. Imagine our surprise when the packet captures turned out to be full of fragmented packets.

As I searched for ways to disable IP fragmentation, I found a lot of misleading and unhelpful answers on Stack Overflow. Sometimes the best answer would be down-voted. The official documentation either didn't exist (macOS) or was hard to understand (Linux). We all thought the probe packets should be sent on a socket with IP_PMTUDISC_DO set on Linux, but it took a few weeks to realize that we actually wanted IP_PMTUDISC_PROBE. Eventually I figured out all the correct settings for Linux and macOS, but it took much longer than it should have.

I wanted to share what I learned with other people, but now I faced an even harder problem: How do you teach people something they think they already know? People were confidently wrong about IP fragmentation everywhere I looked, including in the mirror. Also, let's face it, IP fragmentation is kind of boring.

Introducing fragquiz

I decided to write a game to help people learn IP fragmentation. The program would send packets that were larger than the MTU of the local network connection (the gateway interface), while changing the IP version (IPv6 or IPv4), the transport-layer protocol (TCP or UDP), and the socket-fragmentation options (do/don't fragment on macOS, four different PMTUD options from the ip(7) man page for Linux). It would then report whether the packet was sent, what the packet's fragmentation setting was, and if it was fragmented en route—but first it would make the user guess what would happen. At the end, it would tell them their score and encourage them to send their score and a link to the program to someone else, Wordle-style.

I had a few requirements:

Works on macOS and Linux
Easy to run (no superuser, no separate server, no configuration)
No virtualization, tunnels, or loopback interface since they often have bugs related to MTU
No host packet tracing because fragmentation/reassembly often happens on the network interface

I decided to use a traceroute-style solution. The default mode in traceroute sends packets with a small time-to-live (TTL), or hop limit for IPv6. When a router receives a packet, it subtracts one from the TTL; if the TTL is now zero and the packet isn't for the router itself, it will throw away the packet and send an ICMP Time Exceeded message back to the source. Traceroute then reads the IP address of the sending router from the Time Exceeded message and prints that out. It continues sending packets with increasing TTLs to find the IP address of routers that are increasingly close to the destination.

Fragquiz uses the same TTL technique, sending each packet with a small TTL, and reading the ICMP Time Exceeded packet sent by the router. The Time Exceeded message includes the header of the packet that triggered the message, which includes the packet size and fragmentation status. On macOS and Linux, an unprivileged user can read (and send) a restricted subset of ICMP messages using the unprivileged ICMP socket type.

It worked, but there were a few surprises along the way.

Initially, I assumed that routers would not reassemble a packet with a TTL of one, since they would just decrement the TTL and throw it away as soon as they finished. But the first router that I tested, my home WiFi access point, did exactly that. I added code to automatically probe the network with larger-than-MTU packets with increasing TTL values until it received a Time Exceeded message for a fragment instead of the whole packet, signaling that the packets reached a router that did not reassemble the packet before sending a Time Exceeded message. Then I used that value for the TTL for the packets testing IP fragmentation. Usually the necessary TTL is one or two; the largest TTL I've seen in practice is six, meaning that routers one through five all reassembled fragments before sending Time Exceeded responses.

Some networks simply don't send Time Exceeded messages. I found this out the hard way when fragquiz suddenly stopped working and I spent a few minutes frantically trying to figure out how I'd broken the code. Then I realized that turning off my VPN made fragquiz work again. In my experience, it is rare for a network to correctly generate Time Exceeded messages for both IPv4 and IPv6.

While the unprivileged ICMP listener allowed non-superusers to read Time Exceeded messages on macOS, that code only worked for the superuser on Linux. According to the initial commit message for ICMP sockets, an ICMP Time Exceeded message can only be read by an unprivileged user using IP_RECVERR on the sending socket. I didn't implement that, so currently the Linux version only works as root.

Both macOS and Linux keep a cache of path MTUs discovered by the operating system. The cached path MTU will affect the IP-fragmentation behavior in some cases, which made testing a pain since I had to wait for the cached path MTU to expire. I would like to add an option to clear the path MTU cache in the future. Also, note that the prototype version has a kind of placeholder license, for now, but I plan to release a version with an open license in the future.

I presented fragquiz at RIPE 87, which is a conference for network operators and internet service providers. At the end of the talk, I had the audience play fragquiz by voting with raised hands. Almost every question had people voting both "yes" and "no". Collectively, their score was just below 80%. That means an audience full of professional network engineers and researchers working together didn't even get a "B" on the assignment. I think we can safely conclude that understanding IP fragmentation is hard.

A novel (?) algorithm

Finally, I promised to explain the new algorithm, which I co-created with Salman Aljammaz and James Tucker.

Most path MTU discovery algorithms test one path MTU at a time. They send a packet of a certain size and see whether it got through, then decide what to do next: send a bigger packet, send a smaller packet, or decide that the current estimate of the path MTU is good enough and terminate the search algorithm. This can take several round trips to find the best path MTU.

Our first insight was that, as shown by Custura, et al., in the real world there are a small number of likely packet sizes, less than ten. We aren't the first to realize that; in fact, RFC 8899 says: "Implementations could optimize the search procedure by selecting step sizes from a table of common PMTU sizes."

What we did differently is this: we sent ALL of the possible packet sizes at the same time. So if the local MTU is 9000 bytes, then we send packets with sizes of 1280, 1400, 1500, 8000, and 9000 bytes all at the same time. The other end sends an acknowledgment for every packet it sees. Then we set the path MTU to the largest packet size that was acknowledged. It's okay if it's off by a few bytes; most PMTU search algorithms stop probing when they get "close enough."

Every ten minutes, we reprobe the path MTU by sending a packet that is the next MTU size up. If we get an acknowledgment for the larger MTU size, then we know the path MTU has changed, and we reprobe with all of the packet sizes larger than that and smaller than the local MTU. Otherwise we use the current path MTU for another ten minutes. If we start losing packets for any reason, including the path MTU shrinking, we renegotiate the connection from scratch.

This algorithm has a latency of one RTT (round trip time) and is extremely simple: one timer, one static table, and one variable to hold the current path MTU. The downside is that it might use more bandwidth than other path MTU search algorithms if they can find the path MTU with fewer packets.

Reader challenge

I hope you can now state proudly that you also don't understand IP fragmentation. If you're still not sure, here's a fun closing challenge: Download fragquiz and run the following on either Linux or macOS with the standard configuration. (If you've made a TCP connection to bing.com in the last 10 minutes, replace it with a domain you haven't connected to recently. If you're on macOS, you don't need the sudo.)

    $ sudo ./fragquiz -p udp4 -f default -a bing.com:80
    $ sudo ./fragquiz -p tcp4 -f default -a bing.com:80
    $ sudo ./fragquiz -p udp4 -f default -a bing.com:80

Do you get the same answer on the first and third command? Why or why not? Hint: consult the Linux ip(7) man page linked above.

[ Valerie Aurora is a software consultant who enjoys writing ridiculous hacks and solving difficult systems problems. ]

Comments (57 posted)

Please welcome Joe Brockmeier to LWN

At the beginning of November, we let it be known that we were looking to hire a writer/editor to augment the LWN team. In past attempts, we have found it difficult to attract writers who could produce the kind of content that LWN readers expect. This time around, as we have said before, was different; we had a number of candidates who could have filled the bill and were forced to make some difficult choices.

While "hire them all" was an attractive idea, it was not one that our budget would support. We did conclude, however, that we could stretch to a second hire. So we are pleased to announce that the opportunity to bring Joe Brockmeier on board was too good to pass up — so we didn't. You will start to see his work return to LWN within the next few days.

We say "return" because Joe has a long history with the Linux community, and with LWN in particular. Our archives include an extensive series of articles that he contributed between 2003 and 2011 while working as a freelance author and editor.

Joe has a long history of working with (and writing about) open source. He's a member of the Apache Software Foundation (ASF) and has been involved in the Fedora Project, Apache CloudStack, and many other projects. He lives in North Carolina with his family and a menagerie of cats and dogs.

By bringing on Joe and Daroc Alden, we have firmed up the foundation on which LWN is built and opened opportunities for an even better LWN in the future. As always, the biggest thanks are to you, LWN's subscribers, who have made LWN possible for all these years.

Comments (51 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>