Leading items

Welcome to the LWN.net Weekly Edition for August 3, 2023

This edition contains the following feature content:

GIL removal and the Faster CPython project: it looks like the CPython project will proceed with the removal of its global interpreter lock. This is the story of how this decision came about and where things may go from here.
Flags for fchmodat(): after many years, the fchmodat() system call will finally be fully supported in the kernel.
Unmaintained filesystems as a threat vector: what is the best way to handle old filesystems that nobody can be bothered to fix?
A virtual filesystem locking surprise: a case study in how locking assumptions can prove to be wrong.
Challenges for KernelCI: Nikolai Kondrashov talks about the state of the KernelCI project and how things can be made better.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

GIL removal and the Faster CPython project

By Jake Edge
August 2, 2023

The Python global interpreter lock (GIL) has long been a barrier to increasing the performance of programs by using multiple threads—the GIL serializes access to the interpreter's virtual machine such that only one thread can be executing Python code at any given time. There are other mechanisms to provide concurrency for the language, but the specter of the GIL—and its reality as well—have often been cited as a major negative for Python. Back in October 2021, Sam Gross introduced a proof-of-concept, no-GIL version of the language. It was met with a lot of excitement at the time, but seemed to languish to a certain extent for more than a year; now, the Python Steering Council has announced its intent to accept the no-GIL feature. It will still be some time before it lands in a released Python version—and there is the possibility that it all has to be rolled back at some point—but there are several companies backing the effort, which gives it all a good chance to succeed.

After its introduction in 2021, and the discussion around that, the next public appearance for the feature was at the 2022 Python Language Summit in April. Gross gave a talk about his no-GIL fork in the hopes of getting some tacit agreement on proceeding with the work. That agreement was not forthcoming, in part because the full details and implications of a no-GIL interpreter were not really known. Meanwhile, the Faster CPython project, which came about in mid-2021, had been working along on its plan to increase the single-threaded performance of the interpreter. Mark Shannon reported on the status of that effort at the 2022 summit as well. He also authored PEP 659 ("Specialized Adaptive Interpreter") that describes the kinds of changes being made, some of which have found their way into Python 3.10 and 3.11.

At this year's PyCon, two of the Faster CPython team gave talks describing the techniques they have been using to improve the performance of the interpreter: Brandt Bucher looked at adaptive instructions, while Shannon described memory layout improvements and other optimizations. Given the GIL, nearly all existing Python programs are single threaded, so improving the performance of those programs will effectively speed up the entire Python world. One of the concerns that has been heard about no-GIL Python is what its impact on single-threaded programs would be.

PEP 703

In January 2023, Łukasz Langa posted the first version of PEP 703 ("Making the Global Interpreter Lock Optional in CPython") that is authored by Gross; Langa is sponsoring the PEP as a core developer. As might be guessed, that set off a lengthy thread, with, once again, a lot of excitement. There were also some concerns expressed with regard to the implications of not having a GIL, especially for Python extensions written in C; since the GIL protects that code from many concurrency problems, removing it might well lead to bugs.

One thing that everyone wants to avoid is another "flag day" transition like that of Python 2 to 3. The huge and unfortunate impact of Python 3 being incompatible with its predecessor was not foreseen—the core developers vastly underestimated the growing popularity of the language, for one thing—but that mistake will not be repeated. Any switch to remove the GIL will need to smoothly work with code that is not (yet) ready for it.

There was a question from Shannon about "what people think is a acceptable slowdown for single-threaded code". To a large extent, that question went unanswered in the thread, but he had estimated an impact "in the 15-20% range, but it could be more, depending on the impact on PEP 659".

Another Faster CPython team member, Eric Snow, posted a lengthy analysis with a bunch of questions, which he summarized as: "tl;dr I'm really excited by this proposal but have significant concerns, which I genuinely hope the PEP can address." He noted that he was the author of a "competing" concurrency option in PEP 684 ("A Per-Interpreter GIL"), along with the related PEP 683 ("Immortal Objects, Using a Fixed Refcount"), though he does not truly see multiple sub-interpreters, each with their own GIL, as being incompatible with the no-GIL work. Much of his concern was focused on the impacts on the C extensions (which is also a problem for PEP 684, though to a lesser extent), but single-threaded performance was also mentioned. Gross replied that the impact on the extensions was not completely negative:

There are also substantial benefits to extension module maintainers. The PEP includes quotes from a number of maintainers of widely used C API extensions who suffer the complexity of working around the GIL. For example, Zachary DeVito, PyTorch core developer, wrote "On three separate occasions in the past couple of months… I spent an order-of-magnitude more time figuring out how to work around GIL limitations than actually solving the particular problem."

Updated PEP

The thread had mostly run its course by the end of January. In early May, Gross posted an updated version of PEP 703, along with an implementation based on the in-progress Python 3.12. There was just one response early on (which Gross replied to). On May 12, Gross asked the steering council to decide on the PEP. As it turned out, there was still a lot more discussion to go before any decision would be made.

On June 2, Shannon posted a performance assessment of the PEP with some pretty eye-opening numbers (that were disputed) on the impact of the changes; his estimates of the impact ranged from 11 to 30%. He also noted that removing the GIL had some negative impacts on the existing and planned Faster CPython work:

The adaptive specializing interpreter relies on the GIL; it is not thread-friendly. If NoGIL is accepted, then some redesign of our optimization strategy will be necessary. While this is perfectly possible, it does have a cost. The effort spent on this redesign and resulting implementation is not being spent on actual optimizations.

Shannon has noted that he is not a fan of the free-threading, shared-memory concurrency model; his assessment ends with a suggestion that sub-interpreters provide a better concurrency solution with fewer of the performance and other concerns that no-GIL brings. Others, including steering council member Gregory P. Smith found that analysis to be somewhat oversimplified. Langa posted benchmark numbers that showed considerably less impact than Shannon's estimates. Langa followed that up with some additional results that correspond closely with what Gross had reported in the PEP.

Guido van Rossum, who heads up the Faster CPython team, wanted to ensure that everyone learned from the mistakes made in the past:

If there's one lesson we've learned from the Python 2 to 3 transition, it's that it would have been very beneficial if Python 2 and 3 code could coexist in the same Python interpreter. We blew it that time, and it set us back by about a decade.
Let's not blow it this time. If we're going forward with nogil (and I'm not saying we are, but I can't exclude it), let's make sure there is a way to be able to import extensions requiring the GIL in a nogil interpreter without any additional shenanigans – neither the application code nor the extension module should have to be modified in any way [...]

Meanwhile, Smith replied to Gross's steering-council request (and copied it to the forum thread):

The steering council is going to take its time on this. A huge thank you for working to keep it up to date! We're not ready to simply pronounce on 703 as it has a HUGE blast radius.
[...] That does not mean "no" to this. There is demand for it. (personally, I've wanted this since forever!) It's just that it won't be easy and we'll need to consider the entire ecosystem and how to smoothly allow such a change to happen without breaking the world.
I'm glad to see the continued discuss thread with faster-cpython folks in particular piping up. The intersection between this work and ongoing single threaded performance improvements will always be high and we don't want to hamper that in the near term.

Gross largely disagreed with Shannon's assessment and, in particular, with his characterization of threading. He was also, seemingly, somewhat unhappy with Smith's reply:

You wrote that the Steering Council's decision does not mean "no," but the steering council has not set a bar for acceptance, stated what evidence is actually needed, nor said when a final decision will be made. Given the expressed demand for PEP 703, it makes sense to me for the steering committee to develop a timeline for identifying the factors it may need to consider and for determining the steps that would be required for the change to happen smoothly.
Without these timelines and milestones in place, I would like to explain that the effect of the Steering Council's answer is a "no" in practice. I have been funded to work on this for the past few years with the milestone of submitting the PEP along with a comprehensive implementation to convince the Python community. Without specific concerns or a clear bar for acceptance, I (and my funding organization) will have to treat the current decision-in-limbo as a "no" and will be unable to pursue the PEP further.

That obviously put pressure on the council, as did the users who were clamoring for a no-GIL Python, but the decision is clearly not a simple one. On June 14, more pressure was applied from the Faster CPython team. Van Rossum described some of the costs of no-GIL, but also expressed concern about waiting for a decision:

We've had a group discussion about how our work would be affected by free threading. Our key conclusion is that merging nogil will set back our current results by a significant amount of time, and in addition will reduce our velocity in the future. We don't see this as a reason to reject nogil – it's just a new set of problems we would have to overcome, and we expect that our ultimate design would be quite different as a result. But there is a significant cost, and it's not a one-time cost. We could use help from someone (Sam?) who has experience thinking about the problems posed by the new environment.
[...] In the meantime we're treading water, unsure whether to put our efforts in continuing with the current plan, or in designing a new, thread-safe optimization architecture.

Fast, free threading

The next day, Shannon started a new thread (titled: "A fast, free threading Python") that described three possible options for a way forward. It started with a lengthy description of the tradeoffs for optimization of a dynamic language like Python. Of the three aspects that he thinks need to be considered, single-threaded performance, parallelism, and mutability, the last has mostly been glossed over in earlier discussions, "but it is key":

It isn't quite:
Performance, parallelism, mutability: pick two.
but more like:
Performance, parallelism, mutability: pick one to restrict.

He also cautioned that there are some unknowns:

Performing the optimizations necessary to make Python fast in a free-threading environment will need some original research. That makes it more costly and a lot more risky.

The options for the steering council amount to choosing a fast single-threaded interpreter as currently planned, a no-GIL free-threading interpreter with an unknown (but non-zero) impact on single-threaded performance, or both at the same time. His preference is for both, but he is concerned that the council might choose no-GIL without also committing to the rest of the work needed:

Please don't choose option 2 [no-GIL] hoping that we will get option 3 [both], because "someone will sort out the performance issue". They won't, unless the resources are there.
If we must choose option 1 [current Faster CPython plans] or 2, then I think it has to be option 1. It gives us a speedup at much lower cost in CPUs and energy, by doing things more efficiently rather than just throwing lots of cores at the problem.

Marc-André Lemburg asked about a phased approach, where, effectively, GIL or no-GIL were chosen at the command line; over time, the two could slowly be merged. "Or would this not be feasible because the 'slow merge' would actually require redesigning the whole specialization approach?" Smith replied that he thinks that is more or less what PEP 703 is proposing; even though Shannon basically recommended against it, Smith thinks pursuing both at once is possible:

I'd more or less expect work on specialization for to proceed in parallel without worrying if those benefits cannot yet be available in a free threaded build for a few of releases. Turning it mostly into an additional code maintenance and test matrix burden on the CPython core dev side to keep both our still-primary single threaded GIL based interpreter and the experimental free threaded build working.
I figure this is basically exactly what Mark claims not to want. Presumably due to the interim added build and maintenance complexity. But also seems like the most likely way to get to his "both" option 3 that I suspect we all magically wish would just happen.

Smith followed that up by noting that free threading will need to addressed at some point; even if the Faster CPython plans work out and Python 3.15 is five times faster than Python 3.10, nobody will "be satisfied at 'just 5x' in the end". Van Rossum agreed, but was also concerned that the council "might be betting on hope as a strategy" by choosing no-GIL and hoping for the best.

Like Mark, I hope that you're choosing (3) – like Mark says, it's clearly the best option. But we will need to be honest about it, and accept that we need more resources to improve single-threaded performance. (And, as I believe someone already pointed out, it will also be harder to do future maintenance on CPython's C code, since so much of it is now exposed to potential race conditions. This is a problem for a language that's for a large part maintained by volunteers.)

The talk of "more resources" led Itamar Oren to wonder what that means: "It's not clear to me to what extent the SC [steering council] is in a position to tie PEP acceptance or rejection to allocation of funding." Van Rossum replied that Microsoft was committed to continue funding the team and that "our charter is not limited to single-threaded performance work", but that there is extra work to do in a no-GIL world:

Meanwhile, we can start adapting the specialization and optimization work to a no-GIL world, with the goal of obtaining Mark's Option 3 (free threading and faster per-thread performance). Ideally we would reach a state where we can make no-GIL the one and only build mode without a drop in single-threaded performance (important for apps that haven't been re-architected, e.g. apps that currently use multi-processing, or algorithms that are hard to parallelize).
It is this latter step (getting to Option 3) that requires extra resources – for example, it would be great if Meta or another tech company could spare some engineers with established CPython internals experience to help the core dev team with this work.
Finally, I want to re-emphasize that while Microsoft has a team using the Faster CPython moniker, we don't intend to own CPython performance – we believe in good citizenship and want to contribute in a way that puts our skills and experience to the best possible use for the Python community.

Van Rossum did not just choose Meta out of a hat, here; Gross works for the company, which presumably funded his no-GIL work, and the Cinder CPython fork is maintained by a team at Meta. Carl Meyer said that he expected the Cinder team to work on no-GIL Python. In fact, on July 7, Meyer announced that Meta would fund work on the no-GIL interpreter:

If PEP 703 is accepted, Meta can commit to support in the form of three engineer-years (from engineers experienced working in CPython internals) between the acceptance of PEP 703 and the end of 2025, to collaborate with the core dev team on landing the PEP 703 implementation smoothly in CPython and on ongoing improvements to the compatibility and performance of nogil CPython.

On July 19, Anaconda followed suit. Stan Seibert said that the company would fund work on the "packaging challenges that will be associated with adopting PEP 703, including any work on pip, cibuildwheel, and conda-forge that will be needed to get nogil-compatible packages into the hands of the Python community". Some of that funding commitment likely helped the council reach a verdict, but the results of a core-developer poll on no-GIL also pushed the council in the direction of accepting the PEP. That poll showed 87% of 46 voters thought that free-threaded Python should be actively pursued and 63% of 38 voters said that they were willing to help support and maintain a no-GIL Python based on PEP 703.

Steering council decision

On July 28, council member Thomas Wouters announced that the council would be accepting PEP 703, though it was "still working on the acceptance details". The idea would be to introduce the no-GIL version of the interpreter in order to give everyone a chance to figure out what pieces are missing, so that they can be filled in before no-GIL becomes the default and, eventually, the only, version of Python. The time frame for that transition is estimated to be around five years, but there will be no repeat of earlier mistakes:

We do not want another Python 3 situation, so any changes in third-party code needed to accommodate no-GIL builds should just work in with-GIL builds (although backward compatibility with older Python versions will still need to be addressed). This is not Python 4. We are still considering the requirements we want to place on ABI compatibility and other details for the two builds and the effect on backward compatibility.

As was noted in the various discussions, there is more to removing the GIL than simply adopting a PEP. Wouters made it clear that the core developers will need to gain experience with no-GIL Python so that they can lead the rest of the community:

We will probably need to figure out new C APIs and Python APIs as we sort out thread safety in existing code. We also need to bring along the rest of the Python community as we gain those insights and make sure the changes we want to make, and the changes we want them to make, are palatable.

If the Python community finds that the switch is "just going to be too disruptive for too little gain", the council wants to be able to change its mind anytime before declaring no-GIL as the default mode for the language. He outlined the steps that the council sees, starting with a short-term (perhaps for Python 3.13, which is due in October 2024) experimental no-GIL build of the interpreter that core developers and others can try out. In the medium term, no-GIL would be a supported option, but not the default; when that happens depends a lot on how quickly the community adopts and supports the no-GIL build. In the long term, no-GIL would be the default build and the GIL would be completely excised ("without unnecessarily breaking backward compatibility"). Along the way, periodic reviews will be needed:

Throughout the process we (the core devs, not just the SC) will need to re-evaluate the progress and the suggested timelines. We don't want this to turn into another ten year backward compatibility struggle, and we want to be able to call off PEP 703 and find another solution if it looks to become problematic, and so we need to regularly check that the continued work is worth it.

As might be guessed, that spawned multiple congratulatory and excited-for-the-future responses, though there are a few who think that keeping the GIL would be a better choice for the language. The announcement presumably also sent the Faster CPython folks back to their drawing boards; though there were some accusations of turf wars in the discussions, that did not really seem to be the case. The Faster CPython team simply wanted to ensure that all of the costs were taken into consideration; overall, the team seems quite excited to work on surmounting the challenges of producing a no-GIL interpreter, with minimal (or, ideally, no) performance impact on single-threaded code.

It is quite a turning point in the history of the language, but the work is (obviously) not done yet. There is a huge amount of researching, coding, testing, experimenting, documenting, and so on between here and a no-GIL-only version of the language in, say, Python 3.17 in October 2028. One guesses that the work will not be done, then, either—there will be more optimizations to be found and applied if there is still funding available to do so. Meanwhile, we have yet to dig into the details of the PEP itself; that will come soon. We will be keeping an eye on the no-GIL development process as it plays out over the coming years as well.

Comments (23 posted)

Flags for fchmodat()

By Jonathan Corbet
July 27, 2023

The fchmodat() system call on Linux hides a little secret: it does not actually implement all of the functionality that the man page claims (and that POSIX calls for). As a result, C libraries have to do a bit of a complicated workaround to provide the API that applications expect. That situation looks likely to change with the 6.6 kernel, though, as the result of this patch series posted by Alexey Gladkov.

The prototype for fchmodat() is defined as:

    int fchmodat(int fd, const char *path, mode_t mode, int flag);

Its purpose is to change the permissions of the file identified by path to the given mode. In the style of all the *at() system calls, fd can be an open file descriptor referring to a directory; if path is relative, the lookup process will start at the directory indicated by fd rather than the current working directory. The flag argument can be either zero or AT_SYMLINK_NOFOLLOW.

Support for fchmodat() was added to the Linux kernel for the 2.6.16 release in 2006 as part of a series from Ulrich Drepper adding a number of the *at() calls. That version of fchmodat(), though, did not include the flag argument, a situation that continues to the present. As a result, the kernel's fchmodat() implementation is not compliant with the specification, and is not what application developers will expect. That, in itself, is not entirely unusual; applications do not (usually) invoke system calls directly. Instead, they use wrappers in a low-level library, usually the C library, which do what is needed to provide the expected API. That is what happens here, but the result is not ideal.

The POSIX specification defines the behavior of the AT_SYMLINK_NOFOLLOW flag as: "If path names a symbolic link, then the mode of the symbolic link is changed". That behavior differs from the default, where the mode of the file pointed to by that link will be changed instead. There are two reasons why one might want a flag like this: to actually change permissions on a symbolic link, and, more importantly, to prevent the changing of permissions on a real file by way of a symbolic link. Attackers have been known to use symbolic links to confuse a privileged program into changing file modes that should not be changed; using this flag will prevent such an outcome.

If one looks at the (functionally identical) fchmodat() implementations in the GNU C library and musl libc, two things jump out: implementing AT_SYMLINK_NOFOLLOW in user space is inelegant at best and, due to limitations in Linux itself, neither library is able to implement exactly what the specification says (but they are able to provide the important part).

The C-library implementations start by opening the file indicated by the fd and path arguments to fchmodat() as an O_PATH file descriptor. Such a descriptor allows metadata operations, but cannot be used to read or write the file; thus, it does not require read or write permission on the file to open. That open() call also uses the O_NOFOLLOW flag; if the path ends with a symbolic link, that will cause the link itself to be opened, rather than the file pointed to.

At this point, the C libraries do an fstatat64() call to determine what kind of file has just been opened; if the new file descriptor turns out to be a symbolic link, an EOPNOTSUPP failure status will be returned to the caller. The Linux kernel does not support changing the permission bits on a symbolic link in general (those bits have no real meaning anyway), so neither C-library implementation even tries.

If the target is not a symbolic link, the library could just issue a normal fchmodat() call with the given parameters and no flag. That, however, could open the door to a time-of-check-to-time-of-use vulnerability, where an attacker would replace the file with a symbolic link between the check and the mode change. So, instead, the library must change the mode bits on the file that it actually opened in the first step, without using the path name again. Unfortunately, the obvious way (using fchmod()) won't work, because that system call cannot operate on O_PATH file descriptors in many filesystems. So, instead, the C library generates the path for the open file descriptor under /proc/self/fd, then passes that to chmod() to effect the mode change.

This sequence seems unlikely to be the most efficient way to prevent the following of a symbolic link for an fchmodat() call. It also will fail to work in settings where /proc is not available. A much nicer solution would be to just implement the AT_SYMLINK_NOFOLLOW flag in the kernel, which already has the needed machinery to do so in an atomic and efficient manner.

That is what Gladkov's patch series does: it creates a new fchmodat2() system call that implements the AT_SYMLINK_NOFOLLOW flag. Once this system call is available in released kernels, the C-library implementations can use it for their implementation of fchmodat(), bypassing the current workarounds. The result should be a faster and more robust implementation. Chances are that change will happen soon; VFS maintainer Christian Brauner has applied the series and routed it into linux-next, meaning that it should be pushed during the 6.6 merge window.

Interestingly, this is not the first attempt to add an fchmodat2() implementation; there were patches posted by Rich Felker in 2020 and Greg Kurz in 2017. It is not entirely clear why the patches were not accepted at that time; it may be simply because VFS patches have occasionally tended to fall through the cracks over the years. The previous failure may be part of why Felker responded rather negatively to a suggestion from David Howells that, perhaps, it would be better to add a new set_file_attrs() system call, with a number of new features, rather than completing fchmodat(). That suggestion has not gained much support, so Gladkov's attempt appears to be the one that will actually succeed; after 17 years in the kernel, fchmodat() should finally get in-kernel AT_SYMLINK_NOFOLLOW support.

Comments (13 posted)

Unmaintained filesystems as a threat vector

By Jonathan Corbet
July 28, 2023

One of the longstanding strengths of Linux, and a key to its early success, is its ability to interoperate with other systems. That interoperability includes filesystems; Linux supports a wide range of filesystem types, allowing it to mount filesystems created by many other operating systems. Some of those filesystem implementations, though, are better maintained than others; developers at both the kernel and distribution levels are currently considering, again, how to minimize the security risks presented by the others.

HFS (and HFS+) in the kernel

Back in January, the syzbot fuzzing system reported a crash with the HFS filesystem. For those who are not familiar with HFS, it is the native filesystem used, once upon a time, by Apple Macintosh computers. Its kernel configuration help text promises that users "will be able to mount Macintosh-formatted floppy disks and hard drive partitions with full read-write access". It seems that, in 2023, there is little demand for this capability, so the number of users of this filesystem is relatively low.

The amount of maintenance it receives is also low; it was marked as orphaned in 2011, at which point it had already seen some years of neglect. So it is not all that surprising that the syzbot-reported problem was not fixed or, even, given much attention. At the end of the brief discussion in January, Viacheslav Dubeyko, who occasionally looks in on HFS (and the somewhat more modern HFS+ filesystem as well), said that there was nothing to be done in the case where a filesystem has been deliberately corrupted.

On July 20, Dmitry Vyukov (who runs syzbot) restarted the discussion by pointing out that the consequences of a bug in HFS can extend beyond the small community of users of that filesystem: "Most popular distros will happily auto-mount HFS/HFS+ from anything inserted into USB (e.g. what one may think is a charger). This creates interesting security consequences for most Linux users". There is an important point in that message that is worth repeating: users may not be aware that the device they are plugging into their computer contains a filesystem at all. One often sees warnings about plugging random USB sticks into a computer, but any device — or even a charging cable — can present a block device with a filesystem on it. If the computer mounts that filesystem automatically, "interesting security consequences" may indeed follow.

The new round of discussion still has not resulted in the problem being fixed. Instead, some developers called for the removal of the HFS and HFS+ filesystems entirely. Matthew Wilcox said: "They're orphaned in MAINTAINERS and if distros are going to do such a damnfool thing, then we must stop them". Dave Chinner argued that the kernel community needs to be more aggressive about removing unmaintained filesystems in general:

We need to much more proactive about dropping support for unmaintained filesystems that nobody is ever fixing despite the constant stream of corruption- and deadlock- related bugs reported against them.

Linus Torvalds, though, was unimpressed, saying that, instead, distributors should just fix the behavior of their systems. The lack of a maintainer, he added, is not a reason to remove a filesystem that people are using; "we have not suddenly started saying 'users don't matter'". That brought the discussion to an end, once again, with no fix for the reported bug in sight.

Distribution changes

As the conversation was reaching an end on the linux-kernel list, it picked up on debian-devel. There, Marco d'Itri asked the kernel developers to simply blacklist HFS and HFS+ from being used for automounting filesystems. Matthew Garrett, though, pointed out that the kernel, which cannot completely block automounting without disabling the filesystem type entirely, was probably the wrong place to solve the problem. Instead, he suggested, a udev rule could be used to prevent those filesystems from being automounted, while keeping the capability available for users who manually mount HFS or HFS+ filesystems.

Shortly thereafter, Garrett raised the issue on the Fedora development list as well, suggesting the addition of a udev rule once again. There, some participants saw that rule as perhaps improving the situation, but others, including Zbigniew Jędrzejewski-Szmek and Michael Catanzaro, pointed out that, if a user wants to see the files contained within a a filesystem image, they will do what is needed to mount it, even if that mounting does not happen automatically. Solomon Peachy suggested that adopting this policy would only result in an addition to the various "things to fix after installing Fedora" lists telling users how to turn automounting back on.

Nobody mentioned the possibility that the user was not expecting a given device to have a filesystem at all. Forcing such a filesystem to be mounted manually would presumably address that problem since, presumably, most users would not go to the trouble of mounting a filesystem that they did not expect to be there in the first place. But, as Demi Marie Obenour pointed out, a malicious filesystem image could be employed willingly by a user to take control of a locked-down system:

Unfortunately, this original threat model is out of date. kernel_lockdown(7) explicitly aims to prevent root from compromising the kernel, which means that malformed filesystem images are now in scope, for all filesystems. If a filesystem driver is not secure against malicious filesystem images, then using it should fail if the kernel is locked down, just like loading an unsigned kernel module does.

In that case, it seems, disabling automounting would not be a sufficient fix; the vulnerable filesystem type would need to be disabled entirely.

There is an aspect of the problem that has not received as much attention as it might warrant, though Eric Sandeen did touch on it: the number of filesystem implementations in Linux that are robust in the face of a maliciously corrupted image is quite close to zero. Many filesystems can deal with corruption resulting from media errors and the like; checksums attached to data and metadata will catch such problems. Malicious corruption, instead, will have correct checksums, entirely bypassing that line of defense. Filesystem developers who have thought about this problem are mostly unanimous in saying that it cannot readily be solved — the space for possible attacks is simply too large.

So, while unmaintained filesystems like HFS may provide a sort of low-hanging fruit for attackers, they are not the sole cause of the problem. Intensively maintained filesystems, including ext4, Btrfs, and XFS, are also susceptible to malicious filesystem images. So even removing support entirely for the older, unmaintained filesystem types would not solve the problem.

In the Debian discussion, Garrett suggested that risky filesystems could be mounted as FUSE filesystems in user space, thus making it much easier to contain any ill effects — "but even though this has been talked about a bunch I haven't seen anyone try to implement it". On the Fedora side, Richard W. M. Jones suggested that libguestfs, which mounts filesystems within a virtual machine, could be used. Once again, that would contain the results of any sort of exploitation attempt.

If the objective is truly to make it safe for users to mount untrusted filesystems, some sort of isolation will almost certainly prove to be necessary. Making most filesystem implementations robust against malicious filesystem images just does not seem to be an attainable goal in the near future — even if resources were being put toward that goal, which is not happening to any great extent. It is not a simple solution, and the result will have a performance cost, but security often imposes such costs.

Comments (50 posted)

A virtual filesystem locking surprise

By Jonathan Corbet
July 31, 2023

It is well understood that concurrency makes programming problems harder; the high level of concurrency inherent in kernel development is one of the reasons why kernel work can be challenging. Things can get even worse, though, if concurrent access happens in places where the code is not expecting it. The long story accompanying this short patch from Christian Brauner is illustrative of the kind of problem that can arise when assumptions about concurrency prove to be incorrect.

Within the kernel, struct file is used to represent an open file. It contains the information needed to work with that file, including an extensive operations vector, a reference count, a pointer to the associated inode, the current read/write position, and more. Since there can be multiple references to an open file, there must be a way to serialize access to this structure. The f_lock spinlock is used in most cases, but there is also a mutex called f_pos_lock that is used for access to the file position.

Acquiring and releasing locks has a cost of its own. Many I/O operations affect the file position, so an I/O-intensive workload can end up repeatedly taking and releasing f_pos_lock, increasing the overhead imposed by the kernel. As it happens, though, having multiple references to an open file is a relatively rare occurrence. If there is only a single reference to a given file, concurrent access to the file position cannot happen and that lock overhead is wasted. To avoid this waste, the function that acquires f_pos_lock (__fdget_pos()) contains an optimization:

    if (file_count(file) > 1)
        mutex_lock(&file->f_pos_lock);

(The code has been simplified slightly to highlight the relevant part). The idea here is simple enough: if there is only a single reference to the file, concurrent access cannot happen and there is no point in taking the lock, so the mutex_lock() call is skipped.

The io_uring subsystem has been under intensive development since its introduction in 2019; it is rapidly becoming an independent interface to much kernel functionality. There are currently efforts underway to add io_uring operations corresponding to waitid(), futexes, and getdents(). That last patch, making the getdents() system call available in io_uring, is relevant here because getdents() relies heavily on the file position (and, possibly, state kept by the filesystem implementation) to allow a process to read through a long directory in multiple calls.

The "fixed files" feature of io_uring is also relevant here; it lets a file be used numerous times in io_uring operations without the per-call overhead required with regular system calls. That overhead, which includes acquiring a reference to the file and validating the process's access to it, can be significant in I/O-heavy applications; fixing a file makes it possible to pay that cost only once, improving performance. When a file is fixed into io_uring, a new reference is created, so the reference count will increase. The process can, however, close its own file descriptor after fixing it in io_uring, leaving the fixed-file reference as the only one. The reference count will, as a result, drop back to one. It will also stay there while I/O operations on the file are underway in io_uring; the whole point of fixing the file is to avoid the cost of repeatedly gaining and releasing references.

Brauner pointed out a problem in the getdents() patch: if a file has been fixed in io_uring, and its reference count is one, it will be possible to run multiple getdents() operations concurrently within io_uring, each of which will access f_pos without taking the lock. The results of this concurrency are highly unlikely to be what the developer was hoping for. One might argue that this is a "then don't do that" sort of situation but, as Brauner described in his patch addressing the problem, io_uring is not the only way to run into trouble.

In 2020, the kernel acquired an interesting system call named pidfd_getfd(), which allows a suitably privileged process to extract an open file descriptor from a running process. This operation can be useful for, among other things, enabling a privileged supervisor process to perform operations that another process cannot perform on its own; opening a file outside of a container might be one example. For this to work, the file descriptor created by pidfd_getfd() must refer to the same open file structure as the descriptor in the target process. It creates a second reference to that structure, and the reference count is duly incremented to reflect that.

A problem arises, though, if the target process has a getdents() call underway when its file descriptor is grabbed by pidfd_getfd(). Since, when getdents() was called, the file's reference count was one, the target process will not have acquired f_pos_lock. If the process that obtained the file descriptor with pidfd_getfd() also passes it to getdents(), things can go wrong. The second call will see the elevated reference count and acquire f_pos_lock but, since the first call did not acquire that lock, that acquisition will succeed immediately and the two getdents() call will run concurrently, once again with something other than the intended results.

The fix is easy enough: simply remove the check on f_count and acquire f_pos_lock unconditionally. That will impose a performance cost, but nobody seems to have been worried enough about it to actually measure it. Linus Torvalds applied the patch for the 6.5-rc4 release after editing the changelog (which he described as "*way* too much", but which your editor found most useful). He also complained about how pidfd_getfd() shares the file structure, saying it would have been better to simply reopen the file (creating a new file structure); that would defeat the purpose for pidfd_getfd(), though, since the new file descriptor would no longer be usable to perform actions on the other process's behalf.

Torvalds remains grumpy about the shared access to struct file created by pidfd_getfd(), but it seems like it is here to stay. In any case, this problem has been fixed, clearing the way for the (eventual) use of getdents() on fixed files in io_uring. But it provides an example about how subtle assumptions regarding concurrency can go wrong in surprising ways.

Comments (10 posted)

Challenges for KernelCI

By Jake Edge
August 1, 2023

EOSS

Kernel testing is a perennial topic at Linux-related conferences and the KernelCI project is one of the larger testing players. It does its own testing but also coordinates with various other testing systems and aggregates their results. At the 2023 Embedded Open Source Summit (EOSS), KernelCI developer Nikolai Kondrashov gave a presentation on the testing framework, its database, and how others can get involved in the project. He also had some thoughts on where KernelCI is falling short of its goals and potential, along with some ideas of ways to improve it.

Kondrashov works for Red Hat on its Continuous Kernel Integration (CKI) project, which is an internal continuous-integration (CI) system for the kernel that is also targeting running tests for kernel maintainers who are interested in participating. CKI works with KernelCI by contributing data to its KCIDB database, which is the part of KernelCI that he works on. He noted that he was giving the talk from the perspective of someone developing a CI system and participating in KernelCI, rather than as a KernelCI maintainer or developer. His hobbies include embedded development, which is part of why he was speaking at EOSS, he said.

There are a number of different kernel-testing efforts going on, including the Intel 0-day testing, syzbot, Linux Kernel Functional Testing (LKFT), CKI, KernelCI, and more. Each system has its own report email format and its own dashboard for viewing results. KernelCI has its own set of results from tests run in the labs it works with directly, while those results and the results from other testing efforts flow directly into KCIDB. Having a single report format and dashboard for all of the myriad of testing results is one of the things that the KCIDB project is working on. "Conceptually it is very simple", he said; various submitters send JSON results from their tests, the failures get reported by email to those who have subscribed to receive them, the results also get put into a database, which then can be displayed in the dashboard.

Currently, KCIDB gets around 300K test results per day, from roughly 10K different builds. He briefly put up a screen shot of the Grafana-based KCIDB dashboard (which can be seen in his slides or the YouTube video of the presentation). He also showed an example of the customized report email that developers and maintainers can get; the one he displayed aggregated the results from four different CI systems, with links to the dashboard for more information.

CI metrics

He gave a somewhat simplified definition of CI for the purposes of his talk: test every change to a code base, or as many changes as possible, and provide feedback to the developers of that code. Given that, there are four metrics that can be defined for a CI system, starting with its coverage, which is how much functionality is being tested. The latency metric is based on how quickly the feedback is being generated, while the reliability measure is about "how much can you trust the results"—does a reported failure correspond to a real failure and does a "pass" really mean that? The last metric is accessibility, which is "how easy it is to understand the feedback"; that is, whether the reports provide enough information of the right kind to allow developers to easily track down a problem and fix it.

Using those metrics, the ideal CI system would cover everything, provide instant feedback as changes are created, the feedback "is always true", and the report "just says what's broken, so you don't have to figure it out". On the flipside, the worst CI system "is not covering anything useful, takes forever, and never tells you the right thing, and you cannot understand what it is saying". That CI system "is worse than no CI", Kondrashov said.

KernelCI obviously falls between those two extremes, so he wanted to look at the project in terms of those metrics. For coverage, "nobody really seems to know" how much of the kernel is being tested by the various systems that report to KernelCI. Each testing project has its own set of tests that it runs and there is no entity coordinating all of the testing. He has heard that some of the testing projects have done some measurements, but those results are not really available. He put up an LCOV coverage report that was generated from the CKI tests for Arm64. It showed an overall coverage of 12%, but that was only for the most important subset of the kernel tree.

In an unscientific sampling of the mailing lists, he sees latencies of several hours after a change is posted, "which is quite good", up to "a few weeks"; the latencies are typically faster for changes that are actually merged into a public branch. Pre-merge testing is rare overall.

The results are not particularly reliable, however. Many people who run CI systems have to do manual reviews of the test results before sending them to developers, "because things go wrong quite often". The accessibility of the results "is quite good in places" as some CI systems make an effort for their results to be understandable and actionable.

There are certain "hard limits" on what can actually be accomplished. In terms of coverage, the amount of hardware that is available to be used for testing is a hard limit; the kernel is an abstraction over hardware, so it needs to be tested on all sorts of different hardware. The latency of feedback is also limited by hardware availability; more hardware equates to more tests running in parallel, which reduces the time for producing feedback.

The reliability of the tests is governed by the reliability of the hardware and of the kernel, "but tests contribute to improving kernel reliability, so that's good". The reliability of the tests themselves would also seem to be a big factor here, though Kondrashov did not mention it. The limits of accessibility are partially governed by hardware availability, yet again, because it is difficult to fix a bug that is reported on hardware that the developer does not have access to. The complexity of the kernel also plays a role in limiting the accessibility of the results.

Challenges

He thinks that there are a lot of people who want to write tests, a lot of tests already in the wild, and a lot of companies that have test suites, all of which could lead to more coverage, but integrating those new tests is being held back by other problems.

Doing CI on code that has not yet been merged is dangerous. Anybody can post to the mailing list, so picking up those patches to test can cause problems: "you don't want them to start mining Bitcoin and you don't want them to wreck your hardware". The need for "slow human reviews" of the results also contributes to the latency problems.

He thinks that a big reason why the tests can be unreliable is because they get out of sync with the kernel being tested. Kernels change, as do the tests, but the lack of synchronization means that a test may not be looking for the proper kernel behavior. That leads to tests that repeatedly fail until the two get back in sync; meanwhile, the maintainers do not want to hear about the repeated failures that are not actually related to real bugs in their code. Nobody wants to "waste their time investigating a problem that they had nothing to do with".

The main challenge he sees for accessibility is the proliferation of report formats and dashboards, which makes it difficult for developers. That is something that he thinks KCIDB can improve.

The challenges also compound: low reliability and accessibility lead to low developer trust in what the CI system produces. If a developer knows that the tests often fail due to problems completely outside of their control, "their trust and interest for these results plummets". Likewise, if the reports are hard to understand because the developer does not have access to the hardware where it broke, or the reports leave out important information, they will be ignored. That means the results will not be used for gating patches into the kernel. Since the results are generally ignored, the test developers do not get feedback about the tests, so the tests do not improve, and any actual bugs that the tests do find are not acted upon; the whole improvement feedback loop breaks down.

High latency also leads to a lack of gating; you cannot wait a week for test results to decide whether a patch is sensible to be merged. That leads to bugs getting into the kernel that would have been caught in a lower-latency system. That all leads to greater latency as time is wasted on finding and fixing bugs that could have been detected; the extra time spent cannot go into improving the tests. "It's a vicious loop", he said.

He summarized his takeaway from the challenges with a meme: "Feedback latency is too damn high!" After that, though, he wanted to move on to what can be done to fix the problems: "that's enough gloom".

What to do?

First up was a look at what cannot be done, however. The kernel community is not a single team, working for a single employer; that is also the case with most other open-source projects. It all means that open-source developers cannot be forced to look at test results. In a company, you can bootstrap the testing into the development process by getting the tests just good enough to start gating merges on them; after that, the tests start improving and the positive feedback loop initiates. "After a bit of fighting and stalling, it starts up." In an open-source project, though, the tests need to be in good shape in order to gain developer trust; "without developer trust, it's not going to work".

Turning to things that can be done, he started with coverage. Companies have the most hardware, so attracting more companies into the testing fold will lead to more hardware, more tests, and more results, thus more coverage. Companies that have their own CI system and want to contribute to KernelCI can send their results to KCIDB. Another way to contribute is by setting up a LAVA lab and connecting it with KernelCI; developers will be able to submit trees and tests to be run on the hardware. The right place to get started with either of those is the KernelCI mailing list.

Kondrashov said that he thinks more pre-merge testing is needed to try to head off bugs before they get into public code and to shorten the feedback loop for developers. There are multiple approaches to doing pre-merge testing; some are using Patchwork to pick up patches from the mailing list for testing, which is working well. There is still a problem with authentication, however, since anyone can send a patch to the list; some patches could be malicious.

There are around 50 entries in the kernel MAINTAINERS file that refer to a GitHub or GitLab repository. Those systems provide a way to authenticate the patches that are submitted and connect them to a CI system. Something that KernelCI is exploring is to add integration with those "Git forges" so that, for example, there could be a GitHub Action that submits a patch to KernelCI and gets back a success or failure indication. The benefit is that those patches can be tested on real hardware as part of the pre-merge workflow.

If that all can be made to work, he would like to encourage more maintainers to use the forges. "I know this is controversial, it's been discussed to death in the community." But a few kernel trees are already using the pull-request-based workflow; he thinks more could benefit from doing so. The "selling point" is the CI integration and early testing of pull requests.

In order to get the process going, "CI systems have to start talking to maintainers". A CI system can offer to test a staging branch from the maintainer's repository; the maintainer's merge of a patch into their branch provides the authentication. That is not pre-merge testing, but is a starting point to help prove that the CI system and its tests are reliable and useful. To start with, a few of the most stable tests can be chosen. The KCIDB subscription feature will allow developers to get reports of other, related test results; users can filter the reports that they get on a variety of criteria, such as Git branch, architecture, compiler, tester, and so on.

There are so many tests and so many failures that manually reviewing all of them is inefficient. CI systems are starting to set up automatic triaging to analyze the results in order to more efficiently find real problems. KCIDB is working on such a system, but other CI efforts, such as the Intel GFX CI (for Intel graphics), CKI, and syzbot, already have working versions for this triage. The best triaging is currently done by syzbot in order to not emit multiple reports of what are the same underlying bug by analyzing the kernel log output of the crash.

Another controversial suggestion that he has is to avoid the synchronization problem between the kernel and its tests by moving more tests into the kernel tree. That allows fixes or changes to the kernel functionality to come with the corresponding changes to the tests. He suggested starting with popular, well-developed tests, such as those from the Linux Test Project (LTP). In order to make that work, though, it needs to be integrated into the kernel documentation and best practices, so that the tests become a "more official" part of the kernel workflow.

Currently, LTP is being run on mainline, stable, and other kernels, so it has to be able to handle all of the test differences among them. If those tests got integrated into the kernel tree, that would no longer be needed; the tests in the tree would (or should) simply work for that branch. If a fix that gets backported to a stable branch changes a test somehow, the test change would be backported with it; that would greatly simplify the tests. In order to keep those in-tree tests functioning well, they would need to be prioritized in the CI systems, so that the feedback loop for the tests themselves is shortened as well.

Accessibility can be improved by standardizing the report formats and dashboards. The KCIDB project is working on some of that, but needs feedback from maintainers and developers. He also encouraged people to get involved with the development of KCIDB to help make it better.

In the Q&A after the talk, several attendees agreed with Kondrashov's analysis and suggestions. There were also invitations to work with other testing efforts, such as for the media subsystem. Finding a way to allow regular developers to test their code on a diverse set of hardware was also deemed important, but depends on being able to authenticate the requester, Kondrashov reiterated; the Git forges provide much or all of the functionality needed for that. He closed by noting that there are few who are working on KCIDB right now, largely just Kondrashov—who is busy with other Red Hat work—and an intern, so there is a real need for more contributors; he has lots of ideas and plans, but needs help to get there.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for assisting with my travel to Prague.]

Comments (7 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>