Leading items
Welcome to the LWN.net Weekly Edition for March 17, 2022
This edition contains the following feature content:
- Python finally offloads some batteries: the multi-year effort to remove some unloved standard-library modules approaches a conclusion.
- Removing SHA-1 for signatures in Fedora: what would it take to wean the Fedora distribution away from SHA-1 signatures?
- Toward a better list iterator for the kernel: working through a linked list is trickier than it appears.
- Random numbers and virtual-machine forks: duplicating a series of "random" numbers is bad news.
- Triggering huge-page collapse from user space: letting processes control where huge pages should be used.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Python finally offloads some batteries
Python has often been touted as a "batteries included" language because of its rich standard library that provides access to numerous utility modules and is distributed with the language itself. But those libraries need maintenance, of course, and that is provided by the Python core development team. Over the years, it has become clear that some of the modules are not really being maintained any longer and they probably are not really needed by most Python users—either because better alternatives exist or because they address extremely niche use cases. A long-running project to start the removal of those modules has recently been approved.
A 2018 Python Language Summit session was
the start of Christian Heimes's quest to unplug some of the old batteries
in the standard library. That discussion led to the first
draft of PEP 594
("Removing dead batteries from the standard library
") in
May 2019. It listed more than two dozen modules, scattered across the
standard library, to consider for removal.
The PEP has been floating around in Python space since that time; in general, core developers have been favorably inclined toward the idea, though deciding exactly which modules would be removed was always difficult. The process of removing a module from the standard library starts with deprecation for two release cycles, followed by the actual removal. But the project has struggled with how to handle deprecations in the language over the last few years, as our Python article index entry shows.
Revival
Discussion of the PEP occurred sporadically in a thread
on the Python discussion forum since Heimes first posted the PEP there
in 2019. In early February, the PEP was revived
by new co-author Brett Cannon in a new
forum post. Cannon said that he expected to propose it for a decision
by the steering council (of which he is a member) soon, "barring any
major objections that come up in this topic
". As can be seen in the
update
history section of the PEP, the list of modules to be removed has
evolved over time.
The current version removes 21 separate modules that are described in
the PEP abstract as: "mostly historic data formats
(e.g. Commodore and SUN file formats), APIs and operating systems that have
been superseded a long time ago (e.g. Mac OS 9), or modules that have
security implications and better alternatives (e.g. password and
login).
"
The full list of modules that would be removed in the PEP is as follows:
Type Modules Data encoding uu (and the associated uu codec) and xdrlib Multimedia aifc, audioop, chunk, imghdr, ossaudiodev, sndhdr, and sunau Networking asynchat, asyncore, cgi, cgitb, smtpd, nntplib, and telnetlib OS interface crypt, nis, and spwd Miscellaneous msilib and pipes
Comparing that table with the one in our article on the introduction of the PEP shows that the broad strokes are the same, but the details have changed somewhat. The removals were meant to be largely non-controversial, so if good reasons to keep a module were raised—and the maintenance burden was low—it was retained. The list is also different because some of the modules have already been removed. Modules that were considered for removal along the way, but retained (at least for now), were also described in the PEP along with the reasons behind keeping them. One of the more amusing reasons for retaining a module is for wave, which handles the WAV sound-file format:
According to David Beazley the wave module is easy to teach to kids and can make crazy sounds. Making a computer generate sounds is a powerful and highly motivating exercise for a nine-year-old aspiring developer. It’s a fun battery to keep.
The wave module also provides an example of the kinds of work that remains to be done if the modules are removed; wave relies on the audioop module that is being removed:
The module uses one simple function from the audioop module to perform byte swapping between little and big endian formats. Before 24 bit WAV support was added, byte swap used to be implemented with the array module. To remove wave’s dependency on audioop, the byte swap function could be either be moved to another module (e.g. operator) or the array module could gain support for 24-bit (3-byte) arrays.
A few of the to-be-removed modules were actually deprecated long ago—even more modules had been proposed for deprecation in two now-inactive PEPs—while the bulk of the modules would be deprecated for the upcoming Python 3.11 (due in October) and potentially removed in Python 3.13 (due in October 2024). Three modules, asynchat, asyncore, and smtpd, would be removed in Python 3.12 in 2023. This would be the biggest upheaval in the standard library for quite a long time, if not in its history.
The discussion thread on the PEP revival had relatively few comments, mostly corrections or clarifications, but there were a few muted complaints about some of the choices. The PEP does not specify what will happen to the modules after they are removed; the code will obviously still be available, so interested users could create Python Package Index (PyPI) modules or incorporate parts into other projects. "Vendoring" some pieces, by copying the code directly into an affected project (e.g. into the Roundup Issue Tracker) is another possibility.
On February 16, Cannon submitted the PEP to the steering council and on March 11, Gregory P. Smith accepted the PEP on behalf of the council. There were a few suggestions from the council as part of the acceptance, starting with backporting the deprecation notices into the module documentation for earlier—but still active—versions of the language, so that more developers will be aware of the upcoming removals.
In addition, the council asked that care be taken
during the alpha and beta parts of the release cycle to ensure that there
were no problems being caused. "If it turns out the removal of a
module proves to be a problem in practice despite the clear deprecation,
deferring the removal of that module should be considered to avoid
disruption.
" We saw just that kind of deferral back in February
when deprecated portions of two modules
were causing problems for Fedora.
While Smith said that the council expects this kind of mass-removal event to be a one-time thing, it does mean that more ongoing attention should be paid to the contents of the standard library:
Doing a “mass cleanup” of long obsolete modules is a sign that we as a project have been ignoring rather than maintaining parts of the standard library, or not doing so with the diligence being in the standard library implies they deserve. Resolving ongoing discussions around how we define the stdlib for the long term does not block this PEP. It seems worthwhile for us to conduct regular reviews of the contents of the stdlib every few releases so we can avoid accumulating such a large pile of dead batteries, but this is outside the scope of this particular PEP.
urllib too?
At roughly the same time Cannon revived PEP 594, Victor Stinner was proposing the deprecation (and eventual removal) of the urllib module on the python-dev mailing list. As its name would imply, urllib is for handling URLs, but it does quite a bit more than just that. In his lengthy message, Stinner described a number of problems that he sees with the module, including a complicated API, many better alternatives, no support for HTTP/2 or HTTP/3, and a lack of maintenance, with lots of open issues, including some security issues.
There are four different sub-modules for urllib, with urllib.request, for opening URLs, and urllib.parse, for parsing URLs, as the most prominent and widely used of them. Stinner proposed deprecating all four, but, as he recognized, deprecating all, or even parts, of urllib is going to be controversial. It is likely going to be an uphill battle (and require a PEP of its own) as the discussion showed.
There were a number of objections raised, including Dong-hee Na's concern about the use of urllib in the pip PyPI-package installer. While Stinner thought that perhaps retaining urllib.parse would be sufficient for pip, Damian Shaw disagreed, noting that pip is dependent on parts of urllib.request as well. Beyond that, Shaw said that some of the alternative libraries mentioned by Stinner rely on parts of urllib too.
Paul Moore was strongly
against the proposal; he said that while use of urllib.request was not a best
practice, it is "still extremely
useful for simple situations
". Beyond that, the standard library
itself uses parts of urllib heavily and dependencies on
modules outside the standard library are unsuitable in some domains; he was
concerned about pip as well:
[...] pip relies pretty heavily on urllib (parse and request), and pip has a bootstrapping issue, so using 3rd party libraries is non-trivial. Also, of pip's existing vendored dependencies, webencodings, urllib3, requests, pkg_resources, packaging, html5lib, distlib and cachecontrol all import urllib. So this would be *hugely* disruptive to the whole packaging ecosystem (which is under-resourced at the best of times, so this would put a lot of strain on us).
No one really directly refuted Stinner's contentions about the problems with urllib; the complaints and concerns were about removing it without adequate replacement in the standard library itself. Heimes generally agreed that the problems are real:
The urllib package -- and to some degree also the http package -- are constant source of security bugs. The code is old and the parsers for HTTP and URLs don't handle edge cases well. Python core lacks a true maintainer of the code. To be honest, we have to admit defeat and be up front that urllib is not up to the task for this decade. It was designed [and] written during a more friendly, less scary time on the internet.
He also said that if he "had the power and time
", he would
replace urllib with a simpler HTTP client that used the services
provided by the underlying operating systems. For more complex uses, there
are other options available in PyPI. Another possibility would be to
"reduce the feature set of urllib to core HTTP
(no ftp, proxy, HTTP auth)
" coupled with a partial rewrite to make
the other remaining pieces more standards-compliant and simpler.
Several were in favor of either of those options, though Smith felt
that even the simplification options would cause "disruption to the
world and loss of trust in Python
". There are, it seems, lots of
good reasons to keep urllib around, but the question of
maintenance remains. No one volunteered to take on urllib and
address some of its obvious problems, though Senthil Kumaran said
that urllib.parse "is semi-maintained
".
Given all of that, users of urllib can be pretty confident it will be around for quite a bit longer, perhaps forever. But the maintenance problem needs to be addressed somehow given that urllib interacts with the internet and all of the inherent messiness and danger that comes along for the ride. With luck, part of the frequent re-evaluation of the contents of the standard library that the steering council recommended will also find ways to identify and address the maintenance holes in the standard library as well. If not, it would seem that there are some ticking time bombs lurking there.
Removing SHA-1 for signatures in Fedora
Disruptive changes are not much fun for anyone involved, though they may be necessary at times. Moving away from the SHA-1 hash function, at least for cryptographic purposes, is probably one of those necessary disruptive changes. There are better alternatives to SHA-1, which has been "broken" from a cryptographic perspective for quite some time now, and most of the software components that make up a distribution can be convinced to use other hash functions. But there are still numerous hurdles to overcome in making that kind of a switch as a recent discussion on the Fedora devel mailing list shows.
How to?
On March 8, Alexander Sosedkin posted
a lengthy message asking for help in "planning a disruptive
change
" to switch Fedora away from SHA-1. Currently, SHA-1 is
disabled for TLS, he said, but SHA-1 use is more widespread than that;
cryptographic signatures for packages may use SHA-1 and other libraries,
such as OpenSSL, are still using the older hash algorithm in some cases. But Fedora does
not need to be first on doing the switch:
Good news is, RHEL-9 is gonna lead the way and thus will take a lot of the hits first. Fedora doesn't have to pioneer it. Bad news is, Fedora has to follow suit someday anyway, and this brings me to how does one land such a change.
He does not think this kind of change can be done in a single Fedora
release cycle, in part because finding and fixing all of the uses of SHA-1
is going to require breaking things: "the only realistic way to weed
out its reliance on SHA-1 signatures
from all of its numerous dark corners is to break them.
Make creation and verification fail in default configuration.
" But
because Fedora has fairly short release cycles (six months or so between
releases), even doing that will not give developers enough time to fix the
problems—if they are even brought to light.
Maintainers need time to get bugs, look into them, think, analyze, react and test --- and that's just if it fails correctly! Unfortunately, it's not just that the error paths are as dusty as they get because the program counter has never set foot on them before. Some maintainers might even find that picking a different hash function renders their code non-interoperable, or even that protocols they implement have SHA-1 hardcoded in the spec. Or that everything is ready, but real world deployments need another decade. Or that on-disk formats are just hard to change and migrate.
He suggested several possible approaches, but did not think that simply
announcing the change would be enough to cause all of the changes needed,
likening it to the bypass announcements
in the The Hitchhiker's Guide to the Galaxy. Breaking the creation
and use of SHA-1 signatures in, say Fedora 37 Rawhide, backing it out for
the release, then doing so again for Fedora 38 (without the back-out),
might be one way. Another might be to break it in Rawhide and leave it
that way, but unbreak it for the first release. "But these are all
rather... crude?
" He would rather find a smoother process,
preferably one that had been used before.
Breaking things
Zbigniew Jędrzejewski-Szmek was not in favor of
breaking things in order to flush out the problem areas. "We should make newer hashes
the default, and we can make tools print a warning if sha1 is used
where it shouldn't, but please don't break things on purpose.
" He
also noted that SHA-1 is perfectly reasonable to use in non-cryptographic
settings. Beyond that, existing signatures that use SHA-1 need to be
verifiable "essentially forever
" and signatures using it will be
provided going forward—Fedora cannot dictate signature policies to others—so being
able to check them will always be important. The situation is similar to
self-signed TLS certificates, where disallowing them completely "will only result in users
jumping to different tools
".
The focus of the effort is only on blocking cryptographic signatures using SHA-1,
Simo Sorce said;
that means "Certificates, TLS session setup,
DNSSEC (although a lot of signatures are still SHA-1 based there ...),
VPNs session establishment, PGP, etc...
" While he did recognize
that verifying SHA-1 signatures would still need to be supported by Fedora,
he said that by default it should be blocked. Verifying signatures on
older content should be reserved for local data, not things coming over the
network:
This is only reasonable for stuff like emails, where you may have a reasonable expectation that the archived messages have not been tampered with after the fact. Allowing verification of signatures with SHA-1 for any "online" communication would be pointless.
Neal Gompa was concerned
that "breaking" SHA-1 verification might also "break Fedora's *own* GPG
keys and the GPG keys of preferred third party repositories
". He
encountered a problem of that sort when a similar change was made for
CentOS, but Demi Marie Obenour said
that all of the official RPM packages back to at least Fedora 25 have
SHA-256 signatures.
Gompa listed
several third-party repositories,
RPM Fusion, Google, Copr, and Microsoft, that might be affected; it turns
out that Fedora's Copr repository was affected by the CentOS 9 change
that Gompa mentioned, as reported
more fully by Gary Buhrmaster. There is a bug report tracking the
switch to use SHA-256 signatures for Copr going forward.
Sosedkin disagreed
with Jędrzejewski-Szmek's idea that warnings would suffice; "Cryptographic libraries
aren't in position to do any meaningful logging,
and even if they did, it'd be gleefully ignored until we break
things.
" He was skeptical that there was any other way to make this
kind of change, though backward compatibility is important: "Yes, you
can still make a modern Fedora do RC4 or DES,
with just a little bit of extra configuration.
No, we must phase things out of defaults, there's no way around this.
"
The configuration he is referring to is to change the setting in /etc/crypto-policies/config and then run update-crypto-policies as described in a feature added for Fedora 21. There are three pre-defined settings available: LEGACY, DEFAULT, and FUTURE. Those will reconfigure various libraries to different levels of cryptographic strength. Since the policy option was added, support for more libraries has been added, and the policies themselves updated for Fedora 28 and for Fedora 33.
RHEL 9
Daniel P. Berrangé asked what has
been learned from the under-development RHEL 9, which will help lead the
way for Fedora. He asked about the number and percentage of RHEL 9 packages that were
affected by the switch in order to get a better idea what the impact will
be for Fedora. "Assuming RHEL-9 has dealt with the problems, Fedora should
inherit those fixes, which gives us a good base for the most
commonly used / important packages in Fedora.
" He suggested that
perhaps there might only be "a few important remaining
packages
" that need to be evaluated, leaving the others to
potentially fail, given that there is the LEGACY fallback.
Sosedkin seemed
skeptical of that approach. He noted that RHEL 9 is not yet
widely available, so the number of broken packages for it is still unknown.
Furthermore, RHEL has roughly 10% of the packages that Fedora carries, so
"leaving the 90% of the packages you've labelled 'less important'
to be 'fixed after the fact' is gonna be a disaster.
" He reiterated
that he did not think a single cycle would be sufficient time to handle the problems.
While the 90% of Fedora packages will not have been tested as part of
RHEL 9, Berrangé said, that does
not mean they are likely to be affected by this change. "Only a
relatively small subset will do anything
crypto related, and most of that just be HTTPS and completely
outsourced to a crypto library.
" He thought it possible that only a
few of those packages will be using cryptographic signatures with SHA-1,
and even less "will be considered
important enough to be blockers for this change
".
An explicit example of the kinds of problems that will be encountered was
pointed out by
Alexander Bokovoy. The Kerberos
network-authentication protocol requires the use of SHA-1 "due to
interoperability issues and also due to compliance with RFCs
". He
noted a March 4
bugzilla entry that reported the problem; the "fix" for RHEL 9 will
be to override any restrictions on SHA-1 in OpenSSL for Kerberos, while
other options "will be discussed upstream
". Fedora will pick
up that fix, of course, but it does show that these kinds of problems will
crop up unexpectedly.
Warnings
Berrangé also brought up the idea of using "a scary warning message
"
on stderr or in the system logs. That could be turned on for a
release cycle to see which bugs get reported. Sosedkin said
that warnings to the log files would result in zero bugs being filed, while
stderr reports would simply result in bugs about
"'$CRYPTOLIB has no business messing up my stderr/stdout',
which we'll promptly close by reverting the changes
".
Fedora project leader Matthew Miller agreed with
the thought, though it was "a little more cynically-phrased than I'd
put it
". He suggested logging the problem and having a tool
that specifically looked for those log messages, "we could do something like a Test
Day where people could send in reports
". While Sosedkin still thought
that logging from cryptographic libraries was "risky
territory
", it was a better approach than logging to
stderr; "Especially if it accompanies a drawn-out multicycle
change,
it could be a noticeable impact dampener.
"
The problem with logging, however, is that it may not be available to the library, depending on the execution environment of the program, Berrangé said:
Security policies applied to [processes] (SELinux, seccomp, namespaces or containerization in general) can easily prevent ability to use journald, syslog, or opening of log files, leading to messages not being visible. At worst, with seccomp an attempt to use the blocked feature may lead to an abort of the application.
Sosedkin was concerned
that applications might be reopening the stderr file descriptor in
strange ways, resulting in "writing our scary messages who knows
where
", but Berrangé
said that is
fairly unlikely precisely because various libraries already print warnings and
errors to stderr when needed. "Even glibc will print to
stderr when it sees malloc corruption,
and stuff in stderr will end up in the systemd journal for any
system services.
"
SSH?
Older SSH protocols use SHA-1, so Richard W.M. Jones wondered if
the change would cause problems using SSH. "This broke before, requiring
us to advise users to set the global policy for the machine to LEGACY
(thus ironically weakening crypto for everything).
" He also noted
that he has some "ancient network equipment
" that requires the
LEGACY setting on Fedora in order to be able to connect to it. He asked if
the changes Sosedkin wants to make will further impact SSH.
SSH uses SHA-1 as a hash-based
message authentication code (HMAC), Sosedkin said,
"which doesn't rely on collision resistance
", so no change
needs to be made. Obenour agreed
that using SHA-1 as an HMAC is reasonable, though there are alternatives
that perform better; "There are no
known attacks on HMAC-SHA-1
".
Goals
Along the way, Sosedkin listed his objectives in raising the issue:
I want to know, ultimately, how can I break it for devs but not the users so thatHope these are all reasonable things to wish for.
- the need for it is communicated early,
- all the relevant places are timely identified,
- maintainers have a plenty of time to resolve it for good or opt out,
- and the users are affected as late and as smoothly as possible.
As yet, it does not seem that there is a clear plan that would meet those goals. It is a difficult problem, but one that seems likely to recur over time. Cryptographic algorithms and protocols change over time, generally because weaknesses are found in them; once they become entrenched in a distribution, and the internet itself, they are hard to excise. Finding a way to remove SHA-1 smoothly—as smoothly as possible, at least—will be a useful thing to figure out. It seems likely that more discussion will be required to get there.
Toward a better list iterator for the kernel
Linked lists are conceptually straightforward; they tend to be taught toward the beginning of entry-level data-structures classes. It might thus be surprising that the kernel community is concerned about its longstanding linked-list implementation and is not only looking for ways to solve some problems, but has been struggling to find that solution. It now appears that some improvements might be at hand: after more than 30 years, the kernel developers may have found a better way to safely iterate through a linked list.
Kernel linked lists
C, of course, makes the creation of linked lists relatively easy. What it does not do, though, is help in the creation of generic linked lists that can contain any type of structure. By its nature, C lends itself to the creation of ad hoc linked lists in every situation where they are needed, resulting in boilerplate code and duplicated definitions. Every linked-list implementation must be reviewed for correctness. It would be far nicer to have a single implementation that was known to work so that kernel developers could more profitably use their time introducing bugs into code that is truly unique to their problem area.
The kernel, naturally, has a few solutions for linked lists, but the most commonly used is struct list_head:
struct list_head { struct list_head *next, *prev; };
This structure can be employed in the obvious way to create doubly linked lists; a portion of such a list might look like:
struct list_head can represent a linked list nicely, but has one significant disadvantage: it cannot hold any other information. Usually, this kind of data structure is needed to link some other data of interest; the list structure by itself isn't the point. C does not make it easy to create a linked list with an arbitrary payload, but it is easy to embed struct list_head inside the structure that the developer actually wants to organize into a list:
This is how linked lists are typically constructed in the kernel. Macros like container_of() can be used to turn a pointer to a list_head structure into a pointer to the containing structure. Code that works with linked lists will almost always use this macro (often indirectly) to gain access to the larger payload.
One final detail that is worthy of note is that the actual head of the list tends to be a list_head structure that is not embedded within the structure type of interest:
For a real-world example of how this infrastructure is used, consider struct inode, which represents a file within a filesystem. Inodes can be on a lot of lists simultaneously, so struct inode contains no less than five separate list_head structures; unfortunately, your editor's meager artistic skills are not up to the task of showing what the resulting data structure looks like. One of those list_head structures, i_sb_list, is used to associate the inode with the superblock of the filesystem it belongs to. The list_head structure that anchors this list is the s_inodes field of struct super_block. That is the one list_head structure in this particular list that is not embedded within an instance of struct inode.
Traversal of a linked list will typically begin at the anchor and follow the next pointers until the head is found again. One can, of course, open-code this traversal, but the kernel also provides a long list of functions and macros for this purpose. One of those is list_for_each_entry(), which will go through the entire list, providing a pointer to the containing structure at each node. Typical code using this macro looks like this:
struct inode *inode; /* ... */ list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { /* Process each inode here */ } /* Should not use the iterator here */
Within the loop, the macro uses container_of() to point inode at the containing inode structure for each list entry. The problem is: what is the value of inode on exit from the loop? If code exited the loop with a break statement, inode will point to the element under consideration at that time. If, however, execution passes through the entire list, inode will be the result of using container_of() on the separate list head, which is not contained within an inode structure. That puts the kernel deeply into undefined-behavior territory and could lead to any of a number of bad things.
For this reason, the rule for macros like list_for_each_entry() is that the iterator variable should not be used outside of the loop. If a value needs to be accessed after the loop, it should be saved in a separate variable for that purpose. It's an implicit rule, though; nobody felt the need to actually note this restriction in the documentation for the macros themselves. Unsurprisingly, this rule is thus more of a guideline at best; the kernel is full of code that does, indeed, use the iterator variable after the loop.
The search for a safer iterator
When we last looked at this issue, Jakob Koschel had posted patches fixing some of these sites; he continued this project afterward. Linus Torvalds, however, thought that this approach was inadequate because it did nothing to prevent future problems from being introduced:
So if we have the basic rule being "don't use the loop iterator after the loop has finished, because it can cause all kinds of subtle issues", then in _addition_ to fixing the existing code paths that have this issue, I really would want to (a) get a compiler warning for future cases and (b) make it not actually _work_ for future cases.Because otherwise it will just happen again.
Along the way, the developers came to the realization that moving to a newer version of the C standard might help, since it would allow the declaration of the iterator variable within the loop itself (thus making it invisible outside of the loop). Torvalds made an initial attempt at a solution that looked like this:
#define list_for_each_entry(pos, head, member) \ for (typeof(pos) __iter = list_first_entry(head, typeof(*pos), member); \ !list_entry_is_head(__iter, head, member) && (((pos)=__iter),1); \ __iter = list_next_entry(__iter, member))
This version of the macro still accepts the iterator variable as an argument, keeping the same prototype as before; this is important, since there are thousands of instances of this macro in the kernel source. But it declares a new variable to do the actual iteration, and only sets the passed-in iterator within the loop itself. Since the loop itself may never be executed (if the list is empty), the possibility exists that it will not set the iterator, so it could be uninitialized afterward.
This version was quickly followed by a
second attempt, described as "a work of art
":
#define list_for_each_entry(pos, head, member) \ for (typeof(pos) pos = list_first_entry(head, typeof(*pos), member); \ !list_entry_is_head(pos, head, member); \ pos = list_next_entry(pos, member))
Now the loop-scope iteration variable is declared with the same name as the outer variable, shadowing it. With this version, the iterator variable declared in the outer scope will never be used within the loop at all.
Torvalds's hope with both of these attempts was that this would cause the compiler to generate warnings if the (outer) iterator was used outside the loop, since it will no longer have been initialized by the loop itself. That did not work, though; there are places in the code that explicitly initialize the iterator now and, in any case, the "use of uninitialized variable" warning is disabled in the kernel due to excessive false positives.
James Bottomley suggested a different approach:
#define list_for_each_entry(pos, head, member) \ for (pos = list_first_entry(head, typeof(*pos), member); \ !list_entry_is_head(pos, head, member) && ((pos = NULL) == NULL; \ pos = list_next_entry(pos, member))
This version would explicitly set the iterator variable to NULL on exit from the loop, causing any code that uses it to (presumably) fail. Torvalds pointed out the obvious problem with this attempt: it changes the semantics of a macro that is widely used throughout the kernel and would likely introduce bugs. It would also make life difficult for developers backporting patches to stable kernels that didn't have the newer semantics.
Yet another approach was proposed by Xiaomeng Tong:
#define list_for_each_entry_inside(pos, type, head, member) \ for (type * pos = list_first_entry(head, type, member); \ !list_entry_is_head(pos, head, member); \ pos = list_next_entry(pos, member))
Tong's patch set created a new set of macros, with new names, with the idea
that existing code could be converted over one usage at a time. There
would be no externally declared iterator at all; instead, the name and type
of the iterator are passed as arguments, and the iterator is declared
within the scope of the loop itself. Torvalds, however, disliked
this approach as well. Its use leads to long, difficult-to-read lines
of code in almost every use and, he said, puts the pain in the wrong place:
"We should strive for the *bad* cases to have to do extra work, and
even there we should really strive for legibility
".
A solution at last?
After having rejected various solutions, Torvalds went off to think about what a good solution might look like. Part of the problem, he concluded, is that the type of the containing structure is separate from the list_head structure, making the writing of iterator macros harder. If those two types could be joined somehow, things would be easier. Shortly thereafter, he came up with a solution that implements this idea. It starts with a new declaration macro:
#define list_traversal_head(type,name,target_member) \ union { struct list_head name; type *name##_traversal_type; }
This macro would be used to declare the real head of the list — not the list_head entries contained within other structures. Specifically, it declares a variable of this new union type containing a list_head structure called name, and a pointer to the containing structure type called name_traversal_type. The pointer is never used as such; it is just a way of tying the type of the containing structure to the list_head variable.
Then, there is a new iterator:
#define list_traverse(pos, head, member) \ for (typeof(*head##_traversal_type) pos = list_first_entry(head, typeof(*pos), member);\ !list_entry_is_head(pos, head, member); \ pos = list_next_entry(pos, member))
Code can walk through a list by using list_traverse() instead of list_for_each_entry(). The iterator variable will be pos; it will only exist within the loop itself. The anchor of the list is passed as head, while member is the name of the list_head structure within the containing structure. The patch includes a couple of conversions to show what the usage would look like.
This, Torvalds thinks,
is "the way forward
". Making this change is probably a
years-long project; there are over 15,000 uses of
list_for_each_entry() (and variants) within the kernel. Each of
those will eventually need to be changed, and the declaration of the list
anchor must also change at the same time. So it is not a quick fix, but it
could lead to a safer linked-list implementation in the kernel in the long
run.
One might argue that all of this is self-inflicted pain caused by the continued use of C in the kernel. That may be true, but better alternatives are in somewhat short supply. For example, since the Rust language, for all of its merits, does not make life easy for anybody wanting to implement a linked list, a switch to that language would not automatically solve the problem. So kernel developers seem likely to have to get by with this kind of tricky infrastructure for some time yet.
Random numbers and virtual-machine forks
One of the key characteristics of a random-number generator (RNG) is its unpredictability; by definition, it should not be possible to know what the next number to be produced will be. System security depends on this unpredictability at many levels. An attacker who knows an RNG's future output may be able to eavesdrop on (or interfere with) network conversations, compromise cryptographic keys, and more. So it is a bit disconcerting to know that there is a common event that can cause RNG predictability: the forking or duplication of a virtual machine. Linux RNG maintainer Jason Donenfeld is working on a solution to this problem.The kernel's RNG maintains an "entropy pool" from which random numbers are derived. As randomness from the environment is harvested, it is mixed into the pool, keeping the level of entropy up. Every running system has its own pool, naturally, with its own internal state. If two systems were to somehow end up with their entropy pools containing the same data, they would produce the same sequence of random numbers, for a while at least. That is something that should never happen.
But, as Donenfeld pointed out in a patch set first released in February, there is a way that two systems can end up with the same entropy-pool content. If a running virtual machine is somehow duplicated, the entropy pool will be duplicated with it. This can happen if a machine is checkpointed and restored, or if it forks for any reason. Once the restored or forked machine starts running, it will reproduce the sequence of random data created by the previous instance until the addition of new entropy perturbs the pool.
Microsoft, it seems, has already addressed this concern in Windows; the solution takes the form of a "virtual-machine ID" made available via the ACPI firmware. When a virtual machine forks or is restarted, the ID is changed. The kernel can watch this value and, on noticing that it has changed, conclude that some sort of virtual-machine fork has occurred and that action is necessary to keep the random-data stream unique. Some virtualization systems, including QEMU, have implemented this functionality, so it makes sense for Linux to make use of it as well.
The patch set thus adds a new "vmgenid" driver that makes the virtual-machine ID available to the kernel. When this driver is notified (by the ACPI firmware) of a change, it checks the ID and, if that has changed, calls a new function (crng_vm_fork_inject()) to inform the RNG that it needs to muddy up the entropy pool. This is done by mixing in that same virtual-machine ID. It is not claimed to be the ultimate in security, but it does address the immediate problem; Donenfeld intends to merge this work for 5.18.
This project does not stop there, though; in a later email, Donenfeld described the
changes he would like to make next. He started by complaining about the
design of Microsoft's solution, which has a race condition designed into
it. The kernel cannot respond to a virtual-machine fork until it notices
that the generation ID has changed; the new virtual machine could run for
some time before that happens, and it will generate duplicate random
numbers during that time. It would have been better, he said, to provide a
simple "generation counter" that could be quickly polled by the CPU every
time random data is requested; that would allow a change to be caught
immediately. "But that's not what we have, because Microsoft didn't
collaborate with anybody on this, and now it's implemented in several
hypervisors
".
Having gotten that off his chest, he proceeded to the real task at hand: propagating the news about a virtual-machine fork to other interested kernel subsystems. He originally envisioned creating a new power-management event to serve as a sort of notifier, but that was seen as an awkward fit; a virtual-machine fork isn't really related to power management in the end. So Donenfeld posted a new solution creating a separate notifier (using the kernel's existing notifier mechanism) to inform subsystems. The first user is, unsurprisingly, the WireGuard VPN, which needs to know about such events:
When a virtual machine forks, it's important that WireGuard clear existing sessions so that different plaintext is not transmitted using the same key+nonce, which can result in catastrophic cryptographic failure.
User-space code may benefit from knowing about virtual-machine forks as well; for example, the Samba server may want to reset sessions in that situation. For user space, Donenfeld proposes adding a new virtual file that programs can poll for VM-fork notifications. This file is currently located in /proc/sys, even though it is not a true sysctl knob in that it cannot be written to as a way of tuning system behavior.
Response to this work has been positive, overall; kernel developers tend to
have little appetite for catastrophic cryptographic failure. That said,
Greg Kroah-Hartman did observe: "It
seems crazy that the 'we just were spawned as a new vm' notifier is based
in the random driver, but sure, put it here for now!
" So this work,
too, seems destined for merging for the 5.18 kernel release. That should
help to close a vulnerability that many of us may not have ever been aware
existed.
Triggering huge-page collapse from user space
When the kernel first gained support for huge pages, most of the work was left to user space. System administrators had to set aside memory in the special hugetlbfs filesystem for huge pages, and programs had to explicitly map memory from there. Over time, the transparent huge pages mechanism automated the task of using huge pages. That mechanism is not perfect, though, and some users feel that they have better knowledge of when huge-page use makes sense for a given process. Thus, huge pages are now coming full circle with this patch set from Zach O'Keefe returning huge pages to user-space control.Huge pages, of course, are the result of larger page sizes implemented by the CPU; the specific page sizes available depend on the processor model and its page-table layout. An x86 processor will normally, for example, support a "base" page size of 4KB, and huge pages of 2MB and 1GB. Huge pages dispense with the bottom layer (or layers) of the page-table hierarchy, speeding the address-translation process slightly. The biggest performance advantage that comes from huge pages, though, results from the reduced pressure on the processor's scarce translation lookaside buffer (TLB) slots. One 2MB huge page takes one TLB slot; when that memory is accessed as base pages, instead, 512 slots are needed. For some types of applications the speedup can be significant, so there is value in using huge pages when possible.
That said, there are also costs associated with huge pages, starting with the fact that they are huge. Processes do not always need large, virtually contiguous memory ranges, so placing all process memory in huge pages would end up wasting a lot of memory. The transparent huge pages mechanism tries to find a balance by scanning process memory and finding the places where huge pages might make sense; when such a place is found, a range of base pages is "collapsed" into a single huge page without the owning process being aware that anything has changed.
There are costs to transparent huge pages too, though. The scanning itself takes CPU time, so there are limits to how much memory the khugepaged kernel thread is allowed to scan each second. The limit keeps the cost of khugepaged within reason, but also reduces the rate at which huge pages are used, causing processes that could benefit from them to run in a more inefficient mode for longer.
The idea behind O'Keefe's patch set is to allow user space to induce huge-page collapse to happen quickly in places where it is known (or hoped) that use of huge pages will be beneficial. The idea was first suggested by David Rientjes in early 2021, and eventually implemented by O'Keefe. Beyond allowing huge-page collapse to happen sooner, O'Keefe says, this work causes the CPU time necessary for huge-page collapse to be charged to the process that requests it, increasing fairness.
It also allows the process to control when that work is done. Data stored in base pages will be scattered throughout physical memory; collapsing those pages into a huge page requires copying the data into a single, physically contiguous, huge page. This, in turn, requires blocking changes to those pages during the copy and uses CPU time, all of which can increase latency, so there is value in being able to control when that work happens.
A process can request huge-page collapse for a range of memory with a new madvise() request:
int madvise(void *begin, size_t length, MADV_COLLAPSE);
This call will attempt to collapse length bytes of memory beginning at begin into huge pages. There does not appear to be any specific alignment requirement for those parameters, even though huge pages do have alignment requirements. If begin points to a base page in the middle of the address range that the huge page containing it will cover, then pages before begin will become part of the result. In other words, begin will be aligned backward to the proper beginning address for the containing huge page. The same is true for length, which will be increased if necessary to encompass a full huge page.
There are, of course, no guarantees that this call will succeed in creating huge pages; that depends on a number of things, including the availability of free huge pages in the system. Even if the operation is successful, a vindictive kernel could split the huge page(s) apart again before the call returns. If at least some success was had, the return code will be zero; otherwise an error code will be returned. A lack of available huge pages, in particular, will yield an EAGAIN error code.
Support for MADV_COLLAPSE is also added to process_madvise(), allowing one process to induce huge-page collapse in another. In this case, there are a couple of flags that are available (these would be the first use of the flags argument to process_madvise()):
- MADV_F_COLLAPSE_LIMITS controls whether this operation should be bound by the limits on huge-page collapse that khugepaged follows; these are set via sysctl knobs in existing kernels. If the calling process lacks the CAP_SYS_ADMIN capability, then the presence of this flag is mandatory. It is arguably a bit strange to require an explicit flag to request the default behavior, but that's the way of it.
- MADV_F_COLLAPSE_DEFRAG, if present, allows the operation to force page compaction to create free huge pages, even if the system configuration would otherwise not allow that. This flag does not require any additional capabilities, perhaps because the cost of compaction would be borne by the affected process itself.
The end result, O'Keefe says, is a mechanism that allows user space to take control of the use of huge pages, perhaps to the point that the kernel need no longer be involved:
Though not required to justify this series, hugepage management could be offloaded entirely to a sufficiently informed userspace agent, supplanting the need for khugepaged in the kernel.
First, though, this work would need to make it into the mainline kernel.
Most of the review comments thus far are focused on details, but David
Hildenbrand did take
exception to one aspect of this new operation's behavior. In the
current patch series, huge pages will be created for any virtual memory area, even those
that have been explicitly marked to not use huge pages with an
madvise(MADV_NOHUGEPAGE) call. That, he said, "would break
KVM horribly
" on the s390 architecture. This behavior will thus
need to change.
The current patch set only works with anonymous pages; the plan is to add support for file-backed pages at a later time. Since one of the stated justifications for this patch is to be able to quickly enable huge pages for executable text, support for file-backed pages seems important, and the developers are likely to want to see it before giving this work the go-ahead. The feature looks like it will be useful for some use cases, though, so it seems likely to find its way into the mainline sooner or later.
Page editor: Jonathan Corbet
Next page:
Brief items>>