|
|
Subscribe / Log in / New account

Leading items

Welcome to the LWN.net Weekly Edition for February 17, 2022

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Uniting the Linux random-number devices

By Jake Edge
February 16, 2022

Blocking in the kernel's random-number generator (RNG)—causing a process to wait for "enough" entropy to generate strong random numbers—has always been controversial. It has also led to various kinds of problems over the years, from timeouts and delays caused by misuse in user-space programs to deadlocks and other problems in the boot process. That behavior has undergone a number of changes over the last few years and it looks possible that the last vestige of the difference between merely "good" and "cryptographic-strength" random numbers may go away in some upcoming kernel version.

Random history

The history of the kernel RNG is long and somewhat twisty; there are two random-number devices in the kernel, /dev/random and /dev/urandom, that can be read in order to obtain the random data. /dev/urandom was always meant as the device for nearly everything to use, as it does not block; it simply provides the best random numbers that the kernel can provide at the time it is read. /dev/random, on the other hand, blocks whenever it does not have sufficient entropy to provide cryptographic-strength random numbers. That entropy comes from sources like interrupt timing for various kinds of devices (e.g. disk, keyboard, network) and hardware RNGs if they are available. /dev/urandom will log a warning message (once) if it is called before its pool is initialized (from the random pool once it has been initialized using gathered entropy), but it will provide output from its pseudorandom-number generator (PRNG) and never block.

In 2014, for Linux 3.17, the getrandom() system call was added to provide a reliable way for user-space applications to request random numbers even in the face of file-descriptor exhaustion or lack of access to the random devices (as might happen for an application running in a container). getrandom() was designed to use the urandom pool, but only after it has been fully initialized from the random pool. So, while reads to /dev/urandom do not block, calls to getrandom() would until the requisite entropy is gathered. getrandom() callers can choose to use the random pool via a flag, which makes the call subject to the full entropy requirements for data coming from that pool.

In 2019, an unrelated change to the ext4 filesystem led to systems that would not boot because it reduced the number of interrupts being generated, so the urandom pool did not get initialized and calls to getrandom() blocked. Since those calls were made early in the boot process, the system never came up to a point where enough entropy could be gathered, because the boot process was waiting for getrandom() to return—thus a deadlock resulted. The ext4 change was temporarily reverted for the 5.3 kernel and a more permanent solution was added by Linus Torvalds for 5.4. It used CPU execution time jitter as a source of entropy to ensure that the random pool initialized within a second or so. That technique is somewhat controversial, even Torvalds is somewhat skeptical of it, but it has been in place, and working as far as anyone can tell, for several years now.

In 2020, the blocking nature of /dev/random was changed to behave like getrandom(), in that it would only block until it is initialized, once, and then would provide cryptographic-strength random numbers thereafter. Andy Lutomirski, who contributed the patches for that change, said: "Linux's CRNG generates output that is good enough to use even for key generation. The blocking pool is not stronger in any material way, and keeping it around requires a lot of infrastructure of dubious value." Those patches also added a GRND_INSECURE flag for getrandom() that would return "best effort" random numbers even if the pool was not yet initialized.

As can be seen, the lines between the two devices have become rather blurrier over time. More of the history of the kernel RNG, going even further back in time, can be found in this LWN kernel index entry. Given that the two devices have grown together, it is perhaps no surprise that a new proposal, to effectively eliminate the distinction, has been raised.

No random blocking (for long)

Jason A. Donenfeld, who stepped up as a co-maintainer of the kernel's RNG subsystem a few months back, has been rather active in doing cleanups and making other changes to that code of late. On February 11, he posted an RFC—perhaps a "request for grumbles" in truth—patch proposing the removal of the ability for /dev/urandom to return data before the pool is initialized. It would mean that the kernel RNG subsystem would always block waiting to initialize, but always return cryptographic-strength random numbers thereafter (unless the GRND_INSECURE flag to getrandom() is used). Because of the changes made by Torvalds in 5.4, which Donenfeld calls the "Linus Jitter Dance", the maximum wait for initialization is minimal, so Donenfeld suggested the change:

So, given that the kernel has grown this mechanism for seeding itself from nothing, and that this procedure happens pretty fast, maybe there's no point any longer in having /dev/urandom give insecure bytes. In the past we didn't want the boot process to deadlock, which was understandable. But now, in the worst case, a second goes by, and the problem is resolved. It seems like maybe we're finally at a point when we can get rid of the infamous "urandom read hole".

There are some potential hurdles to doing so, however. The jitter entropy technique relies on differences in timing when running the same code, which requires both a high-resolution CPU cycle counter and a CPU that appears to be nondeterministic (due to caching, instruction reordering, speculation, and so on). There are some architectures that do not provide that, however, so no entropy can be gathered that way. Donenfeld noted that non-Amiga m68k systems, two MIPS models (R6000 and R6000A), and, possibly, RISC-V would be affected; he wondered if there were other similarly affected architectures out there. He believes that the RISC-V code is not truly a problem, however, and no one has yet spoken up to dispute that. Meanwhile, setting those others aside might be the right approach:

If my general analysis is correct, are these ancient platforms really worth holding this back? I halfway expect to receive a few thrown tomatoes, an angry fist, and a "get off my lawn!", and if that's _all_ I hear, I'll take a hint and we can forget I ever proposed this. As mentioned, I do not intend to merge this unless there's broad consensus about it. But on the off chance that people feel differently, perhaps the Linus Jitter Dance is finally the solution to years of /dev/urandom kvetching.

The proposed patch was fairly small; it simply eliminated the file_operations struct for /dev/urandom and reused the one for /dev/random in its place, thus making the two devices behave identically. It also shorted out the behavior of the GRND_INSECURE flag, but he later said that was something of a distraction. The main intent of his proposal was to do the following:

Right now, we have:
/dev/random = getrandom(0)
/dev/urandom = getrandom(GRND_INSECURE)
This proposal is to make that:
/dev/random = getrandom(0)
/dev/urandom = getrandom(0)

Torvalds had a positive response to the RFC. He said that the patch makes sense for architectures that have a cycle counter; the jitter entropy change has been active for two-and-a-half years without much complaint, so "I think we can call that thing a success". There may have been a few complaints about it, but: "Honestly, I think all the complaints would have been from the theoretical posers that don't have any practical suggestions anyway". Torvalds is known to have little patience for theoretical concerns about cryptography (or theoretical concerns about anything else, in truth).

He did object to removing GRND_INSECURE for architectures that cannot do the jitter dance, since it is a way for user space to work around the lack of boot-time entropy, even if it is not at all secure:

Those systems are arguably broken from a randomness standpoint - what the h*ll are you supposed to do if there's nothing generating entropy - but broken or not, I suspect they still exists. Those horrendous MIPS things were quite common in embedded networking (routers, access points - places that *should* care)

[...] And almost nobody tests those broken platforms: even people who build new kernels for those embedded networking things probably end up using said kernels with an existing user space setup - where people have some existing saved source of pseudo-entropy. So they might not ever even trigger the "first boot problem" that tends to be the worst case.

But, he said, he would be willing to apply the patch: "at some point 'worry about broken platforms' ends up being too weak an excuse not to just apply it". According to Joshua Kinard, the two MIPS models in question were from the 1980s, not ever used in systems, and the kernel test for them in the random code "was probably added as a mental exercise following a processor manual or such". Maciej W. Rozycki said that there may have been a few systems using those models, but no Linux port was ever made for them. That might mean that the only problem systems are "some m68k museum pieces", Donenfeld said.

As Geert Uytterhoeven pointed out, though, the cycle-counter code for the Linux generic architecture, which is the default and starting point for new architectures, is hardwired to return zero. "Several architectures do not implement get_cycles(), or implement it with a variant that's very similar or identical to the generic version." David Laight added a few examples (old x86, nios2) of architectures where that is the case.

But what about my NetHack machine?

Lutomirski had a more prosaic complaint:

I dislike this patch for a reason that has nothing to do with security. Somewhere there’s a Linux machine that boots straight to Nethack in a glorious 50ms. If Nethack gets 256 bits of amazing entropy from /dev/urandom, then the machine’s owner has to play for real. If it repeats the same game on occasion, the owner can be disappointed or amused. If it gets a weak seed that can be brute forced, then the owner can have fun brute forcing it.

If, on the other hand, it waits 750ms for enough jitter entropy to be perfect, it’s a complete fail. No one wants to wait 750ms to play Nethack.

More seriously, he was concerned about devices like backup cameras or light bulbs that need to boot "immediately", and where the quality of the random numbers may not truly be a problem. The GRND_INSECURE escape hatch is there for just that reason. In a similar vein, Lennart Poettering was worried that systemd would have to wait one second to get a seed for its hash tables, when it already has a mechanism to reseed the tables:

So, systemd uses (potentially half-initialized) /dev/urandom for seeding its hash tables. For that its kinda OK if the random values have low entropy initially, as we'll automatically reseed when too many hash collisions happen, and then use a newer (and thus hopefully better) seed, again acquired through /dev/urandom. i.e. if the seeds are initially not good enough to thwart hash collision attacks, once the hash table are actually attacked we'll replace the seeds with [something] better. For that all we need is that the random pool eventually gets better, that's all.

It turns out that systemd is already using GRND_INSECURE on systems where it is available, so not changing that behavior, as was originally proposed, would neatly fix Poettering's concern. Donenfeld was completely amenable to pulling the disabling of GRND_INSECURE from his patch; it is not really his primary focus with the proposal, as noted.

Based on Torvalds's response, it would seem there are no huge barriers to removing the final distinction between /dev/random and /dev/urandom—other than the names, of course. If there are more architectures that cannot use the jitter technique, though, that distinction may live on, since Torvalds also thought there might be value in keeping "the stupid stuff around as a 'doesn't hurt good platforms, might help broken ones'". The code removal would not be huge, so it does not really provide much of a code simplification, Donenfeld said; it is more a matter of being able to eliminate the endless debate about which source of randomness to use on Linux. To that end, it seems like a worthwhile goal.

Comments (16 posted)

The long road to a fix for CVE-2021-20316

By Jonathan Corbet
February 10, 2022
Well-maintained free-software projects usually make a point of quickly fixing known security problems, and the Samba project, which provides interoperability between Windows and Unix systems, is no exception. So it is natural to wonder why the fix for CVE-2021-20316, a symbolic-link vulnerability, was well over two years in coming. Sometimes, a security bug can be fixed with a simple tweak to the code. Other times, the fix requires a massive rewrite of much of a projects's internal code. This particular vulnerability fell firmly into the latter category, necessitating a public rewrite of Samba's virtual filesystem (VFS) layer to address a non-disclosed vulnerability.

The story starts with a bug report from Michael Hanselmann in May 2019. When an SMB client instructs the server to create a new directory, the server must carry out a number of checks to ensure that the client is entitled to do that. Among other things, the server makes sure that the requested directory actually lies within the exported SMB share rather than being at some arbitrary location elsewhere in the server's filesystem. Unfortunately, there is inevitably a window between when the server performs the check and when it actually creates the directory. If a malicious user is able to replace a component in the path for the new directory with a symbolic link during that window, Samba will happily follow the link and make the directory in the wrong place, with results that are generally seen as distasteful by anybody but an attacker.

This is a classic time-of-check/time-of-use (TOCTOU) vulnerability, of the sort that symbolic links have become notorious for. It is also a hard one to fix, especially for a system like Samba, where portability is an important concern. There is no easy, cross-platform way to query the attributes of a path in the filesystem and safely act on the result, secure in the knowledge that a malicious actor cannot change things in the middle. Still, something clearly needed to be done, so Samba developer Jeremy Allison jumped in to write a fix. The CVE number CVE-2019-10151 was duly assigned to this problem.

The real problem

The hope was to come up with a quick solution but, from the outset, Allison identified the real problem: the use of path names to interact with the server-side filesystem. Every time that a path is passed to the kernel, the process of walking through it must be carried out anew; any user who can make that process arrive in different places at different times (through carefully timed use of symbolic links, for example) could use that ability to confuse the server. The good news is that there is another way that doesn't rely on unchanging paths.

Over the years, the kernel has gained a set of system calls that operate on file handles (open file descriptors) rather than path names. A carefully written server can, for example, use openat2() to create a file descriptor for a directory of interest, do its checks to ensure that the directory is what is expected, then use mkdirat() to create a subdirectory that cannot be redirected to the wrong place. Properly used, these system calls remove the TOCTOU race from this kind of operation, but they only work if they are being used — and Samba's use of them in 2019 was limited. At the time, Allison remarked: "Ultimately we need to modify the VFS to use the syscallAT() variants of all the system calls, but that's a VFS rewrite we will have to schedule for another day".

Over a month or so of attempts to close the vulnerability (and other symbolic-link issues that arose once people started looking for them), it became increasingly clear that "another day" was coming sooner than anybody had thought. By mid-July 2019, Allison seemed resigned to the big rewrite: "This is going to be a long slog, re-writing the pathname processing in the server". There was a complication, though: while this slog was underway, the vulnerability would remain unfixed and undisclosed. So how would all of this work be explained to anybody else who was watching the Samba project's work? According to Allison:

We need to rewrite the fileserver to [make] arbitrary symlink race change safe on all pathname operations. This is too large to do in private, so I'm doing this in public under the guise of "modernising the VFS to use handle-based operations" (without saying explicitly *why* I'm doing so).

There were a few other aspects of this project — beyond the need to hide its real purpose — that made it hard. One of those is that version 1 of the SMB protocol (SMB1) is path-name-based at its core, making it almost impossible for the server to use anything else. The abrupt deprecation of SMB1 in the Samba 4.11 release in September 2019 was partially driven by this problem.

The SMB2 protocol, instead, is based heavily on file handles, which should make it easier for the server implementation to work the same way. But Samba is an old program with a lot of history, so many of the internal interfaces were still using path names, even when a file handle was readily available. This includes the VFS interface that is used to talk with modules for specific host filesystems, add functionality like virus scanning, and more. Changing all of those internal APIs was a large job that would touch much of the code in the Samba server.

Thousands of changes

During the next two years, Allison would contribute 1,638 commits to the Samba repository — 17% of the total over that period. Not all of those were aimed at the VFS rewrite, but most of them were. And Allison was not alone; Ralph Böhme (1,261 commits), Noel Power (438) and Samuel Cabrero (251) also contributed heavily to this project. "Modernizing" the Samba VFS took up much of the project's attention while remaining mostly under the radar for anybody who was unaware of the real problem.

Böhme presented this work at the 2021 SambaXP event (video, slides) without ever mentioning the (still urgent) security problem that was driving it. The talk gets into a lot of the details of what needed to be done and how various problems were solved on Linux; it is recommended viewing for anybody who wants to dig deeper. There is also some information on the Samba wiki.

In July 2021, Allison declared victory:

With the master commit e168a95c1bb1928cf206baf6d2db851c85f65fa9, I believe all race conditions on meta-data are now fixed in the default paths. The async DOS attribute read still uses path-based getxattr, and some of the VFS modules are not symlink safe, but out of the box Samba I believe will no longer be vulnerable to this in 4.15.0.

Since then, the remaining path-based extended-attribute calls have been fixed as well. Of course, there were a few details to deal with yet, including the fact that the original CVE number had expired due to not being updated for too long. That necessitated the assignment of a new number, which is why this vulnerability is known as CVE-2021-43566. The work did show up as expected in the Samba 4.15.0 release in September 2021 — more than two years after the initial vulnerability report.

In the vulnerability disclosure, the Samba project described the situation this way:

A two and a half year effort was undertaken to completely re-write the Samba VFS layer to stop use of pathname-based calls in all cases involving reading and writing of metadata returned to the client. This work has finally been completed in Samba 4.15.0. [...]

As all operations are now done on an open handle we believe that any further symlink race conditions have been completely eliminated in Samba 4.15.0 and all future versions of Samba.

The disclosure also notes that, due to the massive nature of the rewrite, it will not be possible to fix this vulnerability in earlier Samba releases.

In the end, the Samba project was able to get this vulnerability fixed before word of the problem spread, and before any known exploits took place. But it was a bit of a gamble; attackers tend to keep an eye on the repositories of interesting projects in the hope of noticing patches addressing undisclosed vulnerabilities. It is hard not to draw comparisons with the events leading up to the disclosure of Meltdown and Spectre, both of which also required massive changes to address an undisclosed vulnerability. But, unlike the developers working on Spectre, the Samba developers found a way to do their work in public, ensuring that all patches were properly reviewed and minimizing the problems that had to be addressed after disclosure of the problem.

The gamble appears to have paid off, though things could have gone differently. Since then, Allison has been making the point that symbolic links are dangerous in general, and that other projects almost certainly have similar problems. He has a talk planned for SambaXP later this year that, presumably, will be more forthcoming than Böhme's 2021 presentation. Samba users (those who have updated, at least) are hopefully immune to symbolic-link attacks, but that is probably not true of many other systems that we depend on.

(Thanks to Jeremy Allison for answering questions and performing technical review on a draft of this article.)

Comments (60 posted)

Going big with TCP packets

By Jonathan Corbet
February 14, 2022
Like most components in the computing landscape, networking hardware has grown steadily faster over time. Indeed, today's high-end network interfaces can often move data more quickly than the systems they are attached to can handle. The networking developers have been working for years to increase the scalability of their subsystem; one of the current projects is the BIG TCP patch set from Eric Dumazet and Coco Li. BIG TCP isn't for everybody, but it has the potential to significantly improve networking performance in some settings.

Imagine, for a second, that you are trying to keep up with a 100Gb/s network adapter. As networking developer Jesper Brouer described back in 2015, if one is using the longstanding maximum packet size of 1,538 bytes, running the interface at full speed means coping with over eight-million packets per second. At that rate, CPU has all of about 120ns to do whatever is required to handle each packet, which is not a lot of time; a single cache miss can ruin the entire processing-time budget.

The situation gets better, though, if the number of packets is reduced, and that can be achieved by making packets larger. So it is unsurprising that high-performance networking installations, especially local-area networks where everything is managed as a single unit, use larger packet sizes. With proper configuration, packet sizes up to 64KB can be used, improving the situation considerably. But, in settings where data is being moved in units of megabytes or gigabytes (or more — cat videos are getting larger all the time), that still leaves the system with a lot of packets to handle.

Packet counts hurt in a number of ways. There is a significant fixed overhead associated with every packet transiting a system. Each packet must find its way through the network stack, from the upper protocol layers down to the device driver for the interface (or back). More packets means more interrupts from the network adapter. The sk_buff structure ("SKB") used to represent packets within the kernel is a large beast, since it must be able to support just about any networking feature that may be in use; that leads to significant per-packet memory use and memory-management costs. So there are good reasons to wish for the ability to move data in fewer, larger packets, at least for some types of applications.

The length of an IP packet is stored in the IP header; for both IPv4 and IPv6, that length lives in a 16-bit field, limiting the maximum packet size to 64KB. At the time these protocols were designed, a 64KB packet could take multiple seconds to transmit on the backbone Internet links that were available, so it must have seemed like a wildly large number; surely 64KB would be more than anybody would ever rationally want to put into a single packet. But times change, and 64KB can now seem like a cripplingly low limit.

Awareness of this problem is not especially recent: there is a solution (for IPv6, at least) to be found in RFC 2675, which was adopted in 1999. The IPv6 specification allows the placement of "hop-by-hop" headers with additional information; as the name suggests, a hop-by-hop header is used to communicate options between two directly connected systems. RFC 2675 enables larger packets with a couple of tweaks to the protocol. To send a "jumbo" packet, a system must set the (16-bit) IP payload length field to zero and add a hop-by-hop header containing the real payload length. The length field in that header is 32 bits, meaning that jumbo packets can contain up to 4GB of data; that should surely be enough for everybody.

The BIG TCP patch set adds the logic necessary to generate and accept jumbo packets when the maximum transmission unit (MTU) of a connection is set sufficiently high. Unsurprisingly, there were a number of details to manage to make this actually work. One of the more significant issues is that packets of any size are rarely stored in physically contiguous memory, which tends to be hard to come by in general. For zero-copy operations, where the buffers live in user space, packets are guaranteed to be scattered through physical memory. So packets are represented as a set of "fragments", which can be as short as one (4KB) page each; network interfaces handle the task of assembling packets from fragments on transmission (or fragmenting them on receipt).

Current kernels limit the number of fragments stored in an SKB to 17, which is sufficient to store a 64KB packet in single-page chunks. That limit will clearly interfere with the creation of larger packets, so the patch set raises the maximum number of fragments (to 45). But, as Alexander Duyck pointed out, many interface drivers encode assumptions about the maximum number of fragments that a packet may be split into. Increasing that limit without fixing the drivers could lead to performance regressions or even locked-up hardware, he said.

After some discussion, Dumazet proposed working around the problem by adding a configuration option controlling the maximum number of allowed fragments for any given packet. That is fine for sites that build their own kernels, which prospective users of this feature are relatively likely to do. It offers little help for distributors, though, who must pick a value for this option for all of their users.

In any case, many drivers will need to be updated to handle jumbo packets. Modern network interfaces perform segmentation offloading, meaning that much of the work of creating individual packets is done within the interface itself. Making segmentation offloading work with jumbo packets tends to involve a small number of tweaks; a few drivers are updated in the patch set.

One other minor problem has to do with the placement of the RFC 2675 hop-by-hop header. These headers, per the IPv6 standard, are placed immediately after the IP header; that can confuse software that "knows" that the TCP header can be found immediately after the IP header in a packet. The tcpdump utility has some problems in this regard; it also seems that there are a fair number of BPF programs in the wild that contain this assumption. For this reason, jumbo-packet handling is disabled by default, even if the underlying hardware and link could handle those packets.

Dumazet included some brief benchmark results with the patch posting. Enabling a packet size of 185,000 bytes increased network throughput by nearly 50% while also reducing round-trip latency significantly. So BIG TCP seems like an option worth having, at least in the sort of environments (data centers, for example) that use high-speed links and can reliably deliver large packets. If tomorrow's cat videos arrive a little more quickly, BIG TCP may be part of the reason.

See Dumazet's 2021 Netdev talk on this topic for more details.

Comments (31 posted)

Remote per-CPU page list draining

By Jonathan Corbet
February 15, 2022
Sometimes, a kernel-patch series comes with an exciting, sexy title. Other times, the mailing lists are full of patches with titles like "remote per-cpu lists drain support". For many, the patches associated with that title will be just as dry as the title itself. But, for those who are interested in such things — a group that includes many LWN readers — this short patch series from Nicolas Saenz Julienne gives some insight into just what is required to make the kernel's page allocator as fast — and as robust — as developers can make it.

Per-CPU page lists

As its name would suggest, the page allocator is charged with managing the system's memory in units of whole pages; that distinguishes it from the slab allocators, which usually deal in smaller chunks of memory. Allocation and freeing of memory happens frequently within the kernel, making the page allocator a part of many hot paths. A single system call or device interrupt can result in numerous calls into the page allocator, for example, so that code needs to be fast. At times, memory management has been identified as the bottleneck limiting the performance of other parts of the kernel, despite the efforts that have gone into optimizing it.

At a high level, the page allocator is based on a "buddy allocator", which deals with memory in power-of-two-sized blocks of pages. Among other things, the buddy allocator is good at coalescing adjacent pages into larger blocks when possible. This abstraction begins to be problematic, though, when faced with the needs of contemporary systems, where even a phone handset can have numerous CPUs in it. Maintaining a global buddy structure means a lot of concurrent access to its data; that, in turn, implies locking and cache misses, both of which can wreck performance.

One of the best ways to mitigate performance problems resulting from concurrent access to shared data is to stop performing concurrent access to shared data. To the extent that each CPU can work within its own private sandbox, without contending with other CPUs, performance will be improved. The page allocator, like many parts of the kernel, uses this approach by keeping per-CPU lists of free pages.

Specifically, the memory-management subsystem keeps a per-CPU list of free pages in the zone structure used to describe a memory-management zone. While the reality is (of course) a little more complicated, this structure can indeed be thought of as a simple array of lists of pages, one list for each CPU in the system. Whenever a given CPU needs to allocate a page, it looks first in its per-CPU list and grabs a page from there if one is available. When that CPU frees a page, it puts it back into the per-CPU list. In this way, many page-allocator requests can be satisfied without write access to any global data structures at all, speeding things considerably. Rapid reuse of pages that are cache-hot on the local CPU also helps.

That only works, of course, if the per-CPU lists have a reasonable number of pages in them. If a CPU needs a page and finds its per-CPU list empty, it will have to take the slower path to obtain memory from the global free list, possibly contending with other CPUs in the process. If, instead, the per-CPU list grows too long, it could tie up memory that is needed elsewhere and some of those pages will need to be given back to the global allocator. Much of the time, though, each CPU can work with its own lists and everybody is happy.

There is another situation that arises, though, when the system as a whole comes under memory pressure. If the memory-management subsystem reaches a point where it is scrounging for pages anywhere it can find them, it will soon turn an eye to the per-CPU lists, which may contain memory that is sitting idle and ripe for the taking. Unfortunately, even ripe memory cannot just be taken haphazardly; the per-CPU lists only work as long as each CPU has exclusive access to its own lists. If some other CPU pokes its fingers in, the whole system could go up in flames.

So what is to be done when the system needs to pillage the per-CPU lists for free pages? What happens now is that the kernel asks each CPU to free ("drain") pages out of its own per-CPU lists; this is done with a special, per-CPU workqueue, which is used to queue a callback to free the pages. The workqueue entry will run as soon as each target CPU gets around to scheduling it, which should normally happen fairly quickly.

This solution is not perfect, though. At best, it causes context switches on each target CPU to run the list-draining callback. But if a target CPU is running in the tickless mode, or if it is running a high-priority realtime task, then the workqueue entry may not run at all for a long time. So any free pages on that CPU will remain locked up in its local lists and, as luck would have it, that's probably where most of the free pages have ended up.

Draining the lists remotely

The patch set is designed to mitigate this problem by making it possible to remotely (i.e. from a different CPU) take pages from a CPU's local lists. A previous attempt added spinlocks to control access to the per-CPU lists, essentially taking away much of their per-CPUness; this solution worked, but it added just the sort of overhead that the per-CPU lists were created to avoid. So those patches did not make it into the kernel.

The current series, instead, falls back on one of the kernel community's favorite tools for dealing with scalability problems: read-copy-update (RCU). That, in turn, requires the original trick of computer science: adding a layer of indirection. With this patch series, each CPU now has two sets of lists to hold free pages, one of which is in use at any given time, while the other is kept in reserve (and is empty). A new pointer added to the zone structure points to the set of lists that is currently in use; whenever a CPU needs to access its local lists, it must follow that pointer to get to them.

When the time comes to raid a CPU's lists, the raiding CPU will use an atomic compare-and-swap operation to switch the target CPU's pointer to the second (empty) set of lists. The target CPU might still be working with the previous set of lists, though, even after the switch is done, so the raiding CPU must wait for an RCU grace period to pass before actually accessing the old lists. Since this is a per-CPU data structure, the target CPU cannot still hold a reference to the old list once the grace period has passed; at that point, the old lists are fair game and can be emptied out. The target CPU, meanwhile, continues merrily along, though without its local stash of free pages, without ever having been interrupted.

This approach is not entirely free of performance costs either; adding an extra pointer dereference into the memory-allocation hot paths will add some overhead. Various benchmark results show little difference in most cases, and a 1-3% performance loss in some; the cover letter describes this cost as being acceptable.

Whether other memory-management developers will agree with that assessment remains to be seen. Kernel developers will work long and hard for a 1% performance increase; they may not be happy to give up that much performance for the benefit of a subset of use cases. In the end, though, the problem being solved is real, and it is not clear that a better solution is on offer. Exciting or not, remote per-CPU list draining may be a feature of future kernels.

Comments (7 posted)

Debian reconsiders NEW review

By Jonathan Corbet
February 11, 2022
The Debian project is known for its commitment to free software, the effort that it puts into ensuring that its distribution is compliant with the licenses of the software it ships, and the energy it puts into discussions around that work. A recent (and ongoing) discussion started with a query about a relatively obscure aspect of the process by which new packages enter the distribution, but ended up questioning the project's approach toward licensing and copyright issues. While no real conclusions were reached, it seems likely that the themes heard in this discussion, which relate to Debian's role in the free-software community in general, will play a prominent part in future debates.

Some background

The Debian project does not hand out the right to place packages into the distribution lightly. Prospective packagers must first become Debian developers via a lengthy process that involves working with a mentor and convincing that person that the candidate firmly understands Debian's philosophy and policies. A considerable amount of time may elapse between the initial application and the eventual invitation to throw their key into the keyring and become a proper Debian developer.

Even then, though, there is an obstacle to overcome in the form of the "NEW queue". Any new package added to the distribution — a package for a program that Debian has not previously distributed, for example — will be placed in the NEW queue for manual review prior to being accepted into the Debian repository. The review process checks that the package complies with Debian's policies in general, that it plays well with existing packages in the repository, and that it is something that Debian can legally distribute. It is, in a sense, the final quality-control step imposed by Debian before a new package can enter the repository.

FTP??
Younger readers may not be familiar with the file transport protocol (FTP), which was once how we downloaded Linux distributions before writing them to a pile of diskettes. That is the source of the "ftpmasters" name; it's a reminder of how much worse things used to be.
The task of reviewing packages in the NEW queue falls to the "ftpmaster team" (or just "FTP team"). This team has a number of responsibilities, including actually running the Debian repository, moving packages around under the direction of the release managers, maintaining some of the project's packaging tools, and reviewing packages in the NEW queue. This team, thus, plays a crucial role in the overall operation of the Debian project.

The FTP team is also small; it currently contains five developers, helped by five assistants. Working on the FTP team is not a paid role, so its members are volunteers who are busy people with claims on their attention beyond what the team itself requires. But they are the only people who are empowered to review packages in the NEW queue and either accept or reject them. Given the overall size of Debian, it is easy to predict that the result will be a NEW queue that is longer than one might like.

As of this writing, the NEW queue contains 208 packages, some of which have been there for as long as 11 months. That number seems high, but it is down from nearly 300 earlier this week and a high of over 500 in November; see these plots for the history of the NEW queue's backlog. Putting a package into the NEW queue is a nondeterministic experience for a Debian developer; there is no way to know how long it will take to get a response. So it is not surprising that a lot of Debian developers dread their encounters with this queue.

NEW and SONAME

Given this context, it is not hard to understand where the initial query from Andreas Tille came from. If every package had to go through NEW, Debian would never get a distribution release out; there simply are not enough reviewers to keep up with what the developers are doing. So updates for existing packages bypass the NEW queue and go directly into the repository. It is only new packages that must pass through the review process.

That said, one has to understand what the definition of a "new package" is. "New", in this case, refers to the binary package(s) created from a source package, regardless of how long the source package has been part of Debian. So if, for example, the "fooutils" package gains a new program called "metafoo", and that program is put into its own "metafoo" binary package, a new trip through the NEW queue will be mandated, even if fooutils has been shipped by Debian for decades.

This policy has an interaction with shared-library packages that is at the core of Tille's question. Like many (or most) distributions, Debian names packages containing shared libraries using the "SONAME" associated with the library itself. A change of SONAME generally indicates that an updated package includes incompatible ABI changes, so the new library must be seen as being distinct from the previous version. If an application is linked against libfoo-5, trying to run it with libfoo-6 is unlikely to lead to joy. An SONAME change prevents that from happening, but it will also cause a name change for the resulting binary package.

That name change will cause the new package to be subjected to the rigors of the NEW queue. As noted above, this can be a painful experience, especially for maintainers of libraries that indulge in frequent SONAME changes. Updated libraries stuck in the NEW queue can, in turn, delay updates to other packages and generally lead to unhappiness; it is fair to say that, for developers in this situation, the NEW queue experience gets rather old. Tille's question, in short, was whether the policy could be interpreted to allow this kind of update to bypass NEW review and make developers' lives easier.

M. Zhou quickly responded that, while going through NEW on an SONAME change is painful, it can also be seen as a sort of necessary periodic check to ensure that the package has not gone out of compliance. As a way of making life easier while preserving that check, Zhou suggested a "lottery NEW queue" that would allow the checks to be bypassed some of the time.

Vincent Bernat, instead, described the NEW queue as a "hindrance" and suggested adopting a reputation-based system that would allow proven packagers to bypass it. Adam Borowsky replied that the NEW queue is the only review that packages get and cannot be skipped: "Otherwise, we'd fall to the level of NPM. And there's ample examples what that would mean." Steve Langasek also defended the NEW queue in general, but said that forcing NEW review for an SONAME change is "a misapplication of resources" that hurts the project as a whole.

The question of licensing review

Jonas Smedegaard also expressed dislike for the NEW queue, but highlighted the copyright and licensing review part of the process as being important: "I just don't think the solution is to ignore copyright or licensing statements". Scott Kitterman, the only member of the FTP team who participated significantly in this discussion, made it clear that the licensing checks are not the only controls that happen during NEW queue review. But Russ Allbery shifted the discussion in his response to Smedegaard by saying that ignoring licensing isn't the objective of those who would like to change the rules for the NEW queue:

That's not the goal. The question, which keeps being raised in part because I don't think it's gotten a good answer, is what the basis is for treating copyright and licensing bugs differently than any other bug in Debian?

The NEW queue, he said, creates a lot of friction for the project as a whole; perhaps that friction could be reduced by treating licensing bugs like ordinary bugs — problems to be fixed when they are discovered. NEW review does not try to ensure that the package is bug-free; that is the developer's responsibility, and the project fixes bugs as they are found later on. Licensing issues, he suggested, could be treated the same way. He later added an observation that, as one might expect, licensing problems do occasionally slip through the review process and enter into the repository, but that there has been little in the way of consequences from these mistakes, suggesting that letting a few more problems get through would not pose a significant increase in risk for the project.

One suggestion that was raised a few times over the course of the discussion was opening up the review process beyond the small FTP team and encouraging more peer review of packages. As Alec Leamas pointed out, that is how Fedora reviews packages. Fedora developers with languishing packages often engage in "review swaps" to motivate reviews; Leamas suggested this leads to "a more transparent process". Philip Hands called the NEW process "a bit of a black box" and said that, if there were a way for developers to help get packages through NEW review, volunteers would be forthcoming.

Kitterman, though, was not attracted by the peer review idea:

I have zero sense that there's any real interest in improving the quality of the archive in this regard from people not on the FTP Team. If people think reviews by the broader community can replace New, I would invite you to get started on the work and demonstrate that there is sufficient interest in doing the work. There are plenty of in-archive issues to be found and fixed.

He was also unimpressed with the idea of doing less license review, saying that he "would certainly not support the notion that we have too few licensing documentation bugs in the archive". Hands raised another potential problem: packages that enter into the Debian repository are widely mirrored across the Internet and incorporated into downstream distributions. If Debian makes changes that increase the probability that inappropriate packages could enter the repository, it could be imposing risks on all of those redistributors as well.

Allbery persisted with his line of argument, though, pointing out that the cost of the NEW queue and the checks required there is high. Debian developers, he said, are sometimes willing to let bugs persist in published packages rather than deal with the NEW queue; "this indicates that we may have our priorities out of alignment". He went on to suggest that licensing review is not an all-or-nothing choice; packages seen as posing a higher risk could continue to be reviewed while others are exempt, for example. He reiterated a common suggestion from the discussion: if nothing else, NEW queue review could be skipped for new binary packages created from source packages that are already present in the distribution.

In the conclusion to that last note, Allbery touched on a core, but otherwise unspoken aspect to this discussion. The Debian project is well known for its adherence to free-software principles and its focus on getting licensing right. The project serves, in some ways, as the conscience for distributors in general; others may not always agree with Debian's decisions (its past disagreements with Mozilla, for example), but they always pay attention. Perhaps, he suggested, the project should reconsider whether it wants to continue as "the copyright notice and license catalog review system for the entire free software ecosystem". If Debian were to abandon that role, it could well leave a important gap in our community as a whole. But performing this service imposes significant costs on the project and, he suggested, perhaps the time has come to reconsider whether Debian wants to continue to do that work.

That said, while this discussion aired a lot of ideas, it did not result in anything resembling a policy change. For now, the NEW queue remains an obstacle for many packages entering the distribution, though this discussion might have spurred the effort that has reduced the size of the queue in recent days. Nothing may have been resolved, but history suggests that Debian will revisit the topic in the future; stay tuned.

Comments (43 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>


Copyright © 2022, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds