Leading items
Welcome to the LWN.net Weekly Edition for October 12, 2023
This edition contains the following feature content:
- Remote execution in the GNOME tracker: a series of failures leads to a remotely exploitable desktop vulnerability.
- Progress on no-GIL CPython: wherein the project of removing the CPython global interpreter lock runs into one of the hardest problems in computer science.
- GCC features to help harden the kernel: compiler features that can help developers avoid bugs and vulnerabilities.
- The challenge of compiling for verified architectures: BPF is not like any other target for the compiler.
- Rethinking multi-grain timestamps: this 6.6 feature was reverted after causing user-space regressions; now developers are trying to find a better approach.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Remote execution in the GNOME tracker
While the vulnerability itself is pretty run-of-the-mill, the recently disclosed GNOME vulnerability has a number of interesting facets. The problem lies in a library that reads files in a fairly obscure format, but it turns out that files in that format are routinely—automatically—processed by GNOME if they are downloaded to the local system. That turns a vulnerability in a largely unknown library into a one-click remote-code-execution flaw for the GNOME desktop.
libcue vulnerability
The bug was found by Kevin Backhouse, who reported it in an admirably detailed disclosure on the GitHub blog; Backhouse is a security researcher at GitHub. The cue sheet format, which stores information about the tracks on a compact disc (CD), is at the heart of the problem. The actual flaw was found and fixed in the libcue parsing library for the format. Its bison-based parser uses the venerable (and problematic) atoi() function to convert a string to an integer—without any kind of integer overflow check.
That means that an entry in the cue file, which could be coming from an untrusted source, can produce a negative number in an unexpected place. In and of itself that is not a security problem, but the value is used (as i) in the track_set_index() function without a proper sanity check:
if (i > MAXINDEX) { fprintf(stderr, "too many indexes\n"); return; } track->index[i] = ind;A negative i will obviously cause a write outside of the array bounds at an attacker-controlled location. In addition, the value of ind comes from the cue file, so an attacker can use it to write a value of their choosing to a location they control. That's a recipe for compromise, of course. The fix is simple: also check for a negative i in the test—or use an unsigned value.
Normally, though, one would not really expect to use either the file format or libcue except in applications that are dealing with CDs in various ways: rippers, music players, and the like. If those applications use libcue and processed a malicious cue file, they were vulnerable, but the scope of the problem would have been far more limited. There is another user of the library that operates in a less-obvious fashion, however: the GNOME desktop tracker search engine automatically parses files that get stored in certain directories.
Tracker makes it easier for users to search for things on their desktop, but it also exposes the parsing process to potentially malicious content. Clicking a link in a browser that results in downloading a cue file will result in the tracker miner running a tracker-extract process for the cue format, which uses libcue. Beyond just that format, though, there may be vulnerabilities lurking in parsers and libraries for other formats that are automatically parsed by tracker.
But, as reported by Backhouse, the GNOME tracker developers were at first surprised that his exploit (which he shows in a video, but the code is not yet released) was able to escape the sandbox that is being used for the tracker miners. He had encountered the sandbox while working on his exploit, but did not recognize what it was; he hit an error message about a disallowed system call, but tried another path that avoided the call, which worked. The biggest problem he needed to solve was how to route around the address-space layout randomization (ASLR) applied to the tracker-extract program.
Sandbox weakness
It turns out that the sandbox used by tracker-extract had a known weakness, as tracker developer Carlos Garnacho reported on his blog. It is based on seccomp() filters and is quite restrictive in what it allows:
Tracker took the most restrictive approach, making every system call fail with SIGSYS by default, and carving the holes necessary for correct operation of the many dependencies within the operational parameters established by yours truly (no outside network access, no filesystem write access, no execution of subprocesses). The list of rules is quite readable in the source code, every syscall there has been individually and painstakingly collected, evaluated, and pondered for inclusion.
But those policies were not applied throughout, unfortunately; there were things that the tracker needed to be able to do that could not easily be allowed while still maintaining tight restrictions on the rest. So the tracker developers took a shortcut and only applied the seccomp() restrictions to the thread that did the actual parsing and not to the main dispatcher thread. That way, the main thread could still perform the other system calls, but, of course, all of the threads share the same address space. So, the vulnerability in libcue overwrote memory in a way that caused the main thread to misbehave. As Garnacho noted, rather apologetically, that led to this highly visible vulnerability:
While not perfectly airtight, the sandbox in this case is already a defense in depth measure, only applicable after other vulnerabilities (most likely, in 3rd party libraries) were already successfully exploited, all that can be done at that point is to deter and limit the extent of the damage, so the sandbox role is to make it not immediately obvious to make something harmful without needle precision. While the "1-click exploit in GNOME" headline is surely juicy, it had not happened for these 7 years the sandbox existed, despite our uncontrollable hose of dependencies (hi, GStreamer), or our other dependencies with a history of CVEs (hi Poppler, libXML et al), so the sandbox has at least seemed to fulfill the deterrence role.But there is that, I knew about the potential of a case like this and didn't act in time. I personally am not sure this is a much better defense than "I didn't know better!".
Garnacho has since made a series
of changes to include the main thread in the
seccomp()-protected sandbox. He recognizes that there will be
fallout from that change: "one does not go willy-nilly adding paranoid
sandbox restrictions to a behemoth like GStreamer and call it an early
evening
". He is hopeful that, after a release or two, the reports of
problems will "settle back to relatively boring
"; he also has some
ideas for future security enhancements and suggests that the "rewrite it in
Rust" folks consider working on some of the tracker-extract dependencies.
While this is a serious vulnerability that has put a lot of systems at risk, it seems to have been handled well by the libcue and tracker developers; the bug was also managed well by Backhouse, according to Garnacho. There are no reports of it being exploited in the wild and fixes should be available "everywhere" at this point, so the community has largely dodged a bullet. The reports by Backhouse and Garnacho are useful for understanding the bugs, and for hopefully avoiding some of these kinds of problems in the future. An upcoming post from Backhouse, which will include the proof-of-concept exploit and some additional information about his efforts, should prove instructive as well.
There is a lot of noise about rewriting the world in memory-safe languages, which is certainly a noble goal. But, in the meantime, we are all running lots of code from non-memory-safe languages on our systems. In order for that to change, these rewrites are going to have to integrate well with the existing infrastructure—or we are going to have to wait for all of that infrastructure to be rewritten as well. Finding ways to work in memory-safe replacements for existing components in the interim may well prove to be a larger challenge than the already-huge job of doing the rewrites.
Something that should be considered, perhaps, is scrutinizing all of these "convenient" things that our desktops do behind the scenes. Those actions may be buying security headaches that are worse than the convenience being gained—at least for some users. Desktops that automatically mount USB devices, potentially leading to kernel crashes or compromises, are a case in point. Even with proper sandboxes, or replacement with memory-safe equivalents, there are probably more surprises of an unpleasant variety awaiting processes that auto-parse untrusted input. There is a balance to be struck between security and convenience, for sure, but one wonders at times if the pendulum has moved too far in the convenience direction.
Progress on no-GIL CPython
Back at the end of July, the Python steering council announced its intention to approve the proposal to make the global interpreter lock (GIL) optional over the next few Python releases. The details of that acceptance are still being decided on, but work on the feature is proceeding—in discussion form at least. Beyond that, though, there are efforts underway to solve that hardest of problems in computer science, naming, for the no-GIL version.
ABI concerns
In mid-September, Sam Gross, who authored PEP 703 ("Making the Global Interpreter Lock Optional in CPython"), posted a message to the Python discussion forum about the interaction of the PEP and the CPython stable ABI. In part, the stable ABI is used by extensions to enable their binary wheels to work with multiple CPython versions, thus not require rebuilds when CPython is released. The PEP envisions a path toward an eventual single CPython version without the GIL, but, in the meantime, there will be a build of the interpreter (using ‑‑disable‑gil) that can be used to test no-GIL operation.
Gross noted that extensions built for the stable ABI will not actually work with a no-GIL CPython 3.13 (which is due in October 2024), but he is proposing some changes so that extensions will work with both types of CPython builds from that version onward. Extensions that only call into the "limited" C API result in binaries that use the stable ABI, so Gross suggested a few additions and changes to the API in order to facilitate extension binaries that can work with both interpreter types (GIL and no-GIL). In part, it adopts the existing plan to switch some macros to function calls, in particular for incrementing and decrementing the object reference counts that are used for garbage collection.
Victor Stinner, who has done a lot of recent work on the stable ABI, replied that he thought there should a simple solution for extensions that can work on both interpreter types; he is concerned that the no-GIL experiment will fail otherwise. Because extensions that are built for the stable ABI on CPython 3.12 and earlier will not be compatible with no-GIL builds from 3.13 onward, he thinks it may make sense to create a new ABI version to differentiate the cases. Currently, the stable ABI is "abi3", but that could be bumped to abi4—even without moving to a Python 4. Many people seem to believe the ABI number and CPython major number are linked, but that is not the case.
Gross is less
concerned about the need for extensions that want to support
no-GIL mode having to build two separate binary wheel versions as he described
in his original message. He believes that building two wheels is a
manageable price to pay
and is more worried about tying the no-GIL project to fixes or upgrades to
the C API and stable ABI. Alex Gaynor agreed
about the price;
he has multiple packages with abi3 wheels (and "is very excited about
the glorious nogil future
") that would be affected, but creating two
wheels is not
overly burdensome as a one-time thing. He does want to ensure that
existing and future versions of pip do the right thing in choosing
between them, however.
Brett Cannon said that existing and older versions of pip would not do the right thing, unless a change to abi4 was made; the logic used in pip today would not be able to distinguish between the versions. Gross suggested that supporting older pip versions may not really be needed, since it is not an actual change in behavior:
I don't think we should worry too much about whether old versions of pip work properly for the experimental, --disable-gil build of CPython 3.13. As it is, old versions of pip frequently do not work with new versions of Python. For example, pip==23.1.1 and older (from just 5 months ago) will break if installed in CPython 3.13 (missing pkgutil.ImpImporter).
pip maintainer Paul Moore pointed out a difference, though: breakage is different from silently doing the wrong thing:
I think it's fine if an older version breaks for newer Python. But I'm less sure that silently installing the wrong package is acceptable. People do use older versions of pip, and loud breakages aren't the same as silent errors.
He noted that pip has a policy that users should always upgrade to
the latest version, but that the project has no specific policy on silent
errors of this sort. He is concerned that those who want to experiment
with the no-GIL (or "free-threaded") builds will be put off if they end up
"having to debug ABI compatibility issues like this
". Gaynor agreed,
noting that "pip silently doing the wrong thing creates a flood
of issues
" for the packages affected.
Barry Warsaw asked about whether there were plans for allowing installation of both interpreter types on the same system. Gross said that the situation was the same as having two different versions of Python installed. Warsaw thought that was reasonable and that it is not too difficult handle parallel installations. In the message linked above, Cannon said that one solution could be a "fat" wheel that had both binaries, as long as the names of those binaries in the wheel were distinct.
Naming
But the naming discussion got spun off into its own thread—once it clearly headed in that direction, anyway. Moore said that it was important to be able to install both interpreters so that people can easily test the no-GIL mode; if that is not straightforward to do, in Windows, macOS, Linux, and others, it will negatively impact the no-GIL project:
This links back to the packaging question, because how easy it is for users to try out nogil builds, and how easy it is for a user to select between nogil and gil, will strongly affect how much demand there is for nogil builds, and hence how much pressure there will be on package maintainers to provide nogil-compatible wheels in the first place.
Barry Scott wondered what names would be used, noting that the "shebang line" (i.e. "#!/usr/bin/python" or similar at the top of a script) needs to indicate which interpreter to invoke, which should the same across all of the platforms if possible:
The gil version of python executable is "python3". What is the name of the nogil executable? "python-nogil3", "python-nogil3.13" etc?
But the no-GIL build for 3.13 is clearly meant as an experimental feature,
Gregory P. Smith said,
which means that distributions should not be putting that build on
the default $PATH, at least in his opinion. Smith is a member of
the steering council, but noted that he was using his "personal
reply hat
". Having a lengthy name for the interpreter that lingers for
long periods afterward because it remains in shebang lines is not
desirable, so he suggested waiting:
In effect this defers any potential install naming decision until 3.14 or later once we've got some practical knowledge of how things are working out - we'll presumably know better at that time if there's a need for parallel installs. It's okay for some things like this to be decided later, in effect that is why PEP-703 has an Open Issues section not fully specifying this area today.
But Fedora developer Petr Viktorin pointed out that distributions are likely to want to package the no-GIL interpreter for their users to experiment with. Moore agreed that users are going to want that:
I would like to be able to write scripts with #!/usr/bin/env python3.13-nogil (or something like that) in order to get the free threaded build without needing to hard-code a long and probably non-intuitive path (which on Windows is also user-specific).
In another thread that was started by Steve Dower, who creates the official
Python packages for Windows, Smith noted
that the steering council has agreed that it wants to avoid the name
"nogil" to describe the build. With his council hat on, he
said that there were two reasons behind this decision: it "does not
communicate well to most non-core devs, people need not know what a GIL
is
" and it contains a negative. There is a suggested alternative:
A more appropriate term for this experimental in 3.13 build if you have a need to provide builds is "free-threading". We realize "free-threading" doesn't roll off the tongue like two-syllable "nogil" does, but it should be more understandable to people-who-are-not-us.
That set off some predictable bikeshedding on naming, with several in the
thread noting that they think "nogil" is the better choice—or at least that
they
are perfectly happy with that name. Gross thought
that the suggested alternative was confusing: "While I understand the
objections to 'nogil', I don't think 'free-threading' is likely to be
understood by people-who-are-not-us.
" He noted that the term is not
widely used elsewhere. Most participants generally just wanted to use
something short—"nogil" was the clear "winner" in that department—but none
of the suggestions made there seemed to resonate. The one concrete change
was to switch the ABI
tag for the no-GIL builds from "n" to "t" (for "threading"). Excising
"nogil" at this point is going to be something of an uphill climb, it would
seem.
Proposing abi4
Back in the original thread, there was some discussion, mostly between Viktorin and Gross, about problem areas in the API changes that Gross proposed. That led to a new proposal that incorporated the feedback and adopted the idea of creating a new ABI, abi4. Gross has developed a prototype of the new ABI. Viktorin is generally in agreement with the approach, though some details still need to be worked out.
One of those details is that a PEP for abi4 is needed, as Stinner pointed
out. Viktorin agreed
("This is a pre-PEP discussion.
") and it would seem that the mid-October
core developer sprint will be the venue for some face-to-face discussions
on the topic.
In particular, there is confusion about the compatibility guarantees
provided by the combinations of various limited API versions and abi3,
which will impact what will happen with abi4. Research on those issues is ongoing.
So, work on the no-GIL (or free-threaded) version of CPython is proceeding
apace, but the final acceptance of the PEP is still
pending. It is somewhat surprising that there has been such a lengthy
delay, but PEP 703 and its knock-on effects are likely to dominate
CPython development—and ecosystem—over the next five or more
years, so the
steering council wants to make its acceptance criteria clear. As council
member Thomas Wouters put
it: "we're ironing out the exact acceptance text (we want to clarify
a lot of our decisions)
". Some of that work may happen at the sprint
as well.
GCC features to help harden the kernel
Hardening the Linux kernel is an endless task, with work required on multiple fronts. Sometimes, that work is not done in the kernel itself; other tools, including compilers, can have a significant role to play. At the 2023 GNU Tools Cauldron, Qing Zhao covered some of the work that has been done in the GCC compiler to help with the hardening of the kernel — along with work that still needs to be done.The Kernel self-protection project is the home for much of the kernel-hardening work, she began. Hardening can be done in a number of ways, starting with the fixing of known security bugs, which may be found by static checkers, fuzzers, or code inspection. Fixing bugs is a never-ending task, though; it is far better, when possible, to eliminate whole classes of bugs entirely. Thus, much of the work in the kernel has focused on getting rid of problems like stack and heap overflows, integer overflows, format-string injection, pointer leaks, use of uninitialized variables, use-after-free bugs, and more. Effort is also going into blocking methods of exploitation, including the ability to overwrite kernel text or function pointers.
The GCC 11 release (April 2021), she said, included the ability to zero the
registers used by a function on return from that function; that can help
prevent the leakage of information. It is now on by default. GCC 12
(May 2022), instead, added the automatic initialization of stack variables;
that, too, has been turned on by default in kernel builds. GCC 13
(April 2023) added more strict treatment of flexible-array members.
Zhao briefly mentioned some of the features that the kernel community would like to see in future compiler releases. These include better support for flexible-array checking, a reduction in false-positive warnings with the ‑warray‑bounds option, better integer-overflow checking, support for control-flow integrity checking, and more.
Returning to flexible arrays, Zhao pointed out that out-of-bounds array accesses are a major source of vulnerabilities in the kernel. These can be prevented by bounds checking — if the size of the array in question is known. For fixed arrays, the size is known at compile time, so the checking of array accesses can be done, either at compile time (if possible) or at run time. For dynamically sized arrays, though, the problem is harder. In C, these arrays take two forms: variable-length arrays and flexible-array members in structures; only the latter are used in the kernel at this point.
A flexible-array member is an array embedded within a structure as the final element. It is often declared as having a dimension of either zero or one (though the latter tends to be a frequent source of bugs), or just as array[]. When space for an instance of the structure is allocated, it must be sized large enough to hold the actual array, which will vary in length from one instance to the next.
In GCC 12, all arrays that are defined as the final member of a structure are considered to be flexible, regardless of the declared size of the array. So even the array here:
struct foo { int int_field; int array[10]; };
would be deemed by the compiler to be flexible in size even though that was (probably) not the developer's intent; as a result, no bounds checking is performed on accesses to those arrays. In GCC 13, the ‑fstrict‑flex‑arrays option gives control over which arrays are considered to be flexible arrays; this article gives an overview of how it works. The result is that bounds checking can be more easily applied to arrays that were never meant to vary in size.
There are still some problems, though; Zhao mentioned the case of a structure containing a flexible-array member that is, in turn, nested into another structure type:
struct s1 { int flex_array[0]; }; struct s2 { type_t some_field; struct s1 flex_struct; }
Even if the flexible structure is the final member of the containing structure (s2 above), versions of GCC less than 14 will incorrectly treat the array as fixed. Zhao has contributed a fix for that particular problem. A separate problem arises when the flexible structure is not the final field of the containing structure. In this case, it's not clear what the compiler should do, but GCC has accepted such structures. The new ‑Wflex‑array‑member‑not‑at‑end option will warn about such code.
Flexible-array members in unions are yet another problem; GCC will accept such members when declared as array[0], but the (legal) array[] form is not accepted. That makes it impossible to create unions that will compile under the strictest ‑fstrict‑flex‑array mode. Unions containing only flexible array members raise a different issue: they could end up being a zero-length object, which is not something the C standard allows. Adding a fixed-length member resolves that issue for now; there may be an attempt to allow fully flexible unions as a future GCC extension.
Use of flexible arrays currently defeats bounds checking, but the actual length of any given array is (or at least should be) known to the code as it is running. If that size can be communicated to the compiler, bounds checking can be added. There are two potential ways of declaring that information; one would be to add a new syntax to embed it within the type itself:
struct foo { size_t len; char array[.len*4]; };
This syntax allows the use of expressions (in this case, "four times the value of the len field"). It is the cleaner option, she said, but it has the potential to break ABIs for existing code by changing the dimension of the array. That makes it harder to adopt, as does the syntax change, which is sure to require a lot of discussion before it would find acceptance.
An alternative is to add an attribute to the flexible-array member instead. That preserves the existing ABI, is easier to adopt, and can also be extended to other types (pointers, for example). On the other hand, it is harder to extend to more complex expressions. The counted_by() attribute has been added for GCC 14 without expression support; it can only refer to another field in the same structure for now.
struct foo { size_t len; char array[] __counted_by(len); };
In this case, the len field can only be used to dimension array directly, no expressions allowed. This attribute only works for the size of the flexible array itself for now; future work may get it to the point where, for example, it can warn when the allocation size for the structure is not sufficient to hold the array.
There is some talk of extending this checking to pointer values as well; Apple has a proposal for a more elaborate ‑fbounds‑safety flag (for LLVM) implementing this idea. It is a superset of the existing counted_by() behavior; it would be more effort to implement and adopt, but will be considered later if it takes off.
Bounds checking is only useful if the checks are correct, so the existence of false-positive warnings is a problem. Specifically, code that is optimized with jump threading can create false positives. One aspect of this problem has been fixed in GCC 13, while another is still open. This issue is preventing ‑Warray‑bounds from being enabled by default in kernel builds. There are some ideas circulating for how to mark code where jump threading has been used and suppress the resulting warnings.
A separate issue entirely is integer-overflow detection. In the C standard, overflow is defined for unsigned integer values, but undefined for signed values and pointers. For the undefined case, GCC provides options to either define the expected behavior or to detect the overflow. There is no option, though, for unsigned overflow, since the behavior is well defined. But unsigned overflow is often unintentional and would be good to detect. Perhaps, she said, there needs to be a new option to allow for detection in this case.
Back to signed overflow, she noted that the ‑fwrapv option makes the behavior defined; the variable will wrap around when it overflows. But, while the kernel needs to have overflows trap most of the time, there are occasional spots where it should be allowed. Florian Weimer pointed out that there is a built-in mechanism now that can be used to disable checking for specific operations; Zhao said that she would look into it.
At this point time ran out, and Zhao was unable to get into the discussion of control-flow integrity options. The picture that came out of the session was clear, though. Quite a bit of work has gone into improving GCC so that it can help in the hardening of the kernel (and other programs too, of course). But, like so many other jobs, the task of defending the kernel against attackers never seems to end. There will be plenty for developers, on both the compiler and kernel sides, to do for the foreseeable future.
The challenge of compiling for verified architectures
On its surface, the BPF virtual machine resembles many other computer architectures; it has registers and instructions to perform the usual operations. But there is a key difference: BPF programs must pass the kernel's verifier before they can be run. The verifier imposes a long list of additional restrictions so that it can prove to itself that any given program is safe to run; getting past those checks can be a source of frustration for BPF developers. At the 2023 GNU Tools Cauldron, José Marchesi looked at the problem of compiling for verified architectures and how the compiler can generate code that will pass verification.Marchesi, who has been working on a BPF backend for GCC, started by saying that the problem is inherently unsolvable. The BPF verifier is trying to do something that, in the theoretical sense, cannot be done; to make the problem tractable it imposes a number of restrictions on BPF programs, with the result that many correct programs still fail to pass verification. The verifier is a moving target as it evolves with each kernel release, and it is not the only verifier out there (the Windows BPF implementation has its own verifier that imposes a different set of restrictions). It is a challenging task but, perhaps, something can be done in practice to at least improve the situation.
The "world domination plan" for BPF support in GCC has several stages,
Marchesi said. The initial goal was to achieve parity with LLVM, which is
still used for all BPF programs currently. The LLVM BPF backend, he noted,
was created by the same developers who were working on the BPF virtual
machine, at the same time; each side could be changed to accommodate the
other. The GCC developers, instead, have to deal with the virtual machine
as it exists. So that task took a few years, but it has been largely
achieved.
The next step is to be able to compile and run all of the kernel's BPF tests; that is what is being worked on now. Compiling the tests is mostly done, but running them is trickier for the reasons described above: although the programs are known to be correct, the code generated by GCC runs afoul of the verifier. Once that problem has been overcome, Marchesi said, the next object will be to compile and run all existing BPF programs.
A verified target like the BPF virtual machine is not the same as a sandbox, he said. A sandbox involves putting a jail around code to block any attempted bad behavior. The verifier, instead, proves to itself that a program is not harmful, then runs it in a privileged environment. That verification imposes a number of "peculiarities" that a compiler must cope with. For example, BPF programs have disjoint stacks, where each frame is stored separately from the others. But the real problem is generating verifiable code. As an example here, he noted that the verifier will complain about backward jumps in BPF code; developers can work to avoid them, but optimizing compilers can do interesting things, and it's not always clear where a backward jump might have been introduced. There are many other ways to fail verification, and the list changes from one kernel release to the next.
Verification problems, he said, come in two broad categories. One is that certain types of source constructs are simply not verifiable; these include computed gotos and loops without known bounds. The other is transformations from optimization; a developer can avoid problematic constructs and generally write with verification in mind, but an optimizing compiler can mangle the code and introduce problems. To be useful, a compiler for the BPF target must take extra pains to produce verification-friendly code — not a simple objective.
Paths toward a solution
Marchesi then presented a long list of approaches that could be taken to this problem, starting with "approach zero": "do nothing". That, he said, is the current strategy with the GCC work. It is good enough to be able to compile the DTrace BPF support routines, but far from sufficient in the general case. About 90% of the BPF self-tests, he said, do not currently get past the verifier when compiled with GCC. So this approach clearly is not good enough even for existing programs, much less those that might appear in the future.
Approach one is to disable optimizations entirely. This comes at a cost to both performance and program size; size is important, he said, because of the size limit imposed by the verifier. Another problem, it seems, is that some programs simply will not work without -O2 optimization; they rely on the constant folding and propagation that the optimizer provides. So this does not appear to be a viable option for a real-world compiler.
Approach two is to disable optimizations more selectively, finding the ones that result in unverifiable code and disabling them. This is a task that would have to be handled automatically, he said. It could be addressed by formalizing the constraints imposed by the verifier, then testing the code produced after each optimization pass to ensure that the constraints are satisfied. If a problem is introduced, the changes from that pass could be discarded. Some passes, though, can introduce a lot of different optimizations; throwing them out would lose the good changes along with the bad one. Some problems, he said, are also caused by combinations of passes.
A different way, approach three, would be "antipasses" that would explicitly undo code transformations that create problems. This is the current approach used by LLVM; it has the advantage of only undoing problematic changes, with no effect on the rest of the compilation process. But antipasses are fragile and easily broken by changes elsewhere in the compiler; that leads to "maintenance hell". There was, he said, frustration expressed at this year's LSFMM+BPF Summit by developers who find that they can only successfully build their programs with specific LLVM versions. As a result, the LLVM developers are adding new features, but users are sticking with older releases that actually work and are not benefiting from those features.
Approach four would be target-driven pass tailoring, or the disabling of specific transformations by hooking directly into the optimization passes. LLVM is trying to move in this direction, he said, but other compiler developers are resisting that approach. It leads to entirely legal (from the language sense) transformations becoming illegal and unusable. This is, though, a more robust approach to the problem.
A variant would be approach five: generic pass tailoring. A new option (-Overifiable, perhaps) could be added to the compiler alongside existing options like -Osmall or -Ofast. With this option, optimizations would be restricted to those that make relatively unsurprising transformations to the code. This option would not be restricted to any one backend and could be useful in other contexts. Some companies, he said, have security teams that run static analyzers on binary code. A more predictable code stream created by an option like this (he even suggested that -Opredictable might be a better option name) would be easier for the analyzers to work with.
Approach six would be language-level support; this could take the form of, for example, #pragma statements providing loop bounds. It was quickly pointed out, though, that the verifier cannot trust source-level declarations of this kind, so this approach was quickly discarded. Approach seven, adding support to the assembler, was also quickly set aside as unworkable.
Marchesi concluded the talk by saying that, in the end, a combination of the above approaches will probably be needed to get GCC to the point where it routinely creates verifiable BPF code. He would like to build a wider consensus on the details of what is to be done before proceeding much further. To that end, he is working with the BPF standardization process, and would like to get other GCC developers involved as well; he is also coordinating with the LLVM developers. The problem may be unsolvable in the general sense, but it should still be possible to make things better for BPF developers.
Rethinking multi-grain timestamps
One of the significant features added to the mainline kernel during the 6.6 merge window was multi-grain timestamps, which allow the kernel to selectively store file modification times with higher resolution without hurting performance. Unfortunately, this feature also caused some surprising regressions, and was quickly ushered back out of the kernel as a result. It is instructive to look at how this feature went wrong, and how the developers involved plan to move forward from here.Filesystems maintain a number of timestamps to record when each file was modified, had its metadata changed, or was accessed (though access-time updates are often turned off for performance reasons). The resolution of these timestamps is relatively coarse, measured in milliseconds; that is usually good enough for users of that information. In certain cases, though, higher resolution is needed; a prominent case is serving files via NFS. Modern NFS protocols can cache file contents aggressively for performance, but those caches must be discarded when the underlying file is modified. One way of informing clients of modifications is through the modification timestamp, but that only works if the resolution of the timestamp is sufficient to reflect frequent changes.
In theory, recording timestamps at higher resolutions is straightforward, as long as filesystems have space for the extra data. The strength of higher-resolution data is also a problem, though; a low-resolution timestamp will change relatively infrequently, but a timestamp that changes more often must be written back to the filesystem more often. That can increase I/O rates, especially for filesystems that perform journaling, where each metadata update must go through the journal as well. The cost of increased resolution is significant, which is especially problematic since the higher-resolution data will almost never be used.
The solution was multi-grain timestamps, where higher-resolution timestamps for a file are only recorded if somebody is actually paying attention. Normally, timestamp data is only stored at the current, relatively low resolution, meaning that a lot of metadata updates can be skipped for a file that is being written to frequently. If somebody (a process or the kernel NFS server, for example) queries the modification time for a specific file, though, a normally unused bit in the timestamp field will be set to record the fact that the query took place. The next timestamp update will then be done at high resolution on the theory that the modification times for that file are of active interest. As long as somebody keeps querying the modification time for that file, the kernel will continue to update that time in high resolution.
That is the functionality that was merged for 6.6. The problem is that this algorithm can give misleading results regarding the relative modification times of two files. Imagine the following sequence of events:
- file1 is written to.
- The modification time for file2 is queried.
- file2 is written to.
- The modification time for file2 is queried (again).
- file1 is written again.
- The modification time for file1 is queried.
After this sequence, the modification time for file1, obtained in step 6 above, should be later than that for file2 — it was the last file written to, after all. But, since its modification time had not been queried, the modification timestamp will be stored in low-resolution. Meanwhile, since there had been queries regarding file2 (step 2 in particular), its modification timestamp (set in step 3 and queried in step 4) will use the higher resolution. That can cause file2 to appear to have been changed after file1, contrary to what actually happened. And that, in turn, can confuse programs, like make, that are interested in the relative modification times of files.
Once it became clear that this problem existed, it also became clear that multi-grain timestamps could not be shipped in 6.6 in their initial form. Various options were considered, including hiding the feature behind a mount option or just disabling it for now. In the end, though, as described by Christian Brauner, the decision was made to simply revert the feature entirely:
While discussing various fixes the decision was to go back to the drawing board and ultimately to explore a solution that involves only exposing such fine-grained timestamps to nfs internally and never to userspace.As there are multiple solutions discussed the honest thing to do here is not to fix this up or disable it but to cleanly revert.
The feature was duly reverted from the mainline for the 6.6-rc3 release.
The shape of what comes next might be seen in this series from Jeff Layton, the author of the multi-grain timestamp work. It begins by adding the underlying machinery back to the kernel so that high-resolution timestamps can be selectively stored as before. Timestamps are carefully truncated before being reported to user space, though, so that the higher resolution is not visible outside of the virtual filesystem layer. That should prevent problems like the one described above.
The series also contains a change to the XFS filesystem, which is the one that benefits most from higher-resolution timestamps when used in conjunction with NFS (other popular filesystems have implemented "change cookie" support to provide the information that NFS clients need to know when to discard caches). With this change, XFS will use the timestamp information to create its own change cookies for NFS; the higher resolution will ensure that the cookies change when the file contents do.
Layton indicated that he would like to see these changes merged for the 6.7 release. They have been applied to the virtual filesystem tree, and are currently showing up in linux-next, so chances seem good that it will happen that way. If so, high-resolution timestamps will not be as widely available as originally thought, but there is no real indication that there is a need for that resolution in user space in any case; Linus Torvalds was somewhat critical of the idea that this resolution would be necessary or useful. But the most pressing problem — accurate change information for NFS — will hopefully have been solved at last.
Page editor: Jonathan Corbet
Next page:
Brief items>>