|
|
Subscribe / Log in / New account

Leading items

Welcome to the LWN.net Weekly Edition for June 20, 2024

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Adding a JIT compiler to CPython

By Jake Edge
June 18, 2024

PyCon

One of the big-ticket items for the upcoming Python 3.13 release is an experimental just-in-time (JIT) compiler for the language; the other is, of course, the removal of the global interpreter lock (GIL), which is also an experiment. Brandt Bucher is a member of the Faster CPython project, which is working on making the reference implementation of the language faster via a variety of techniques. Last year at PyCon, he gave a talk about the specializing adaptive interpreter; at PyCon 2024 in Pittsburgh, he described the work he and others have been doing to add a copy-and-patch JIT compiler to CPython.

Background and history

He began with a little bit about himself; "I started using Python about seven years ago and everything since then has been a kind of in a whirlwind". He started contributing code to Python, then joined the bug-triage team, and became a core developer after he worked on the structural-pattern-matching feature. He currently works on the Microsoft Faster CPython team ("it's my literal 'dream job'"); he helped to make CPython 3.11 25% faster for most workloads. More recently, he implemented the JIT compiler in order to continue improving the performance of the language.

[Brandt Bucher]

A bit of recent history was next up, he said, to explain why the team thought there was a need for a JIT compiler now, along with describing some of the steps taken to get to this point. The specializing adaptive interpreter (from PEP 659) was added for Python 3.11; it "sort of listens to your program as it runs" in order to type-specialize the code and to cache important values. That meant that the interpreter was gathering valuable data about the program as it ran; "the exact sort of data that we would care about if we wanted to [...] compile it later".

In Python 3.12, there was a "quality of life improvement" for maintainers. The specializing adaptive interpreter had added complexity to what was otherwise a fairly simple interpreter, so the team took some time to add infrastructure that allowed automatically generating the interpreter from a specification in a domain-specific language (DSL). This simplified the code by generating much of the boilerplate for reference-count maintenance, error handling, and more. It also meant that the specification could be used to generate multiple interpreters or, even, things that are not interpreters, Bucher said.

For Python 3.13, which had been feature-frozen just before PyCon and is due in October, a new micro-op interpreter was added. It is an "entirely separate execution pipeline" that analyzes the running code to determine where the hot paths are; those paths are then lowered (i.e. translated into a lower-level format) into new bytecode operations, called micro-ops, which are optimized and executed in another dedicated interpreter. All of this is only possible, Bucher said, because of the steps taken in 3.11 and 3.12.

An example

He put up an iterative Fibonacci function (from his slides) that he would be using to demonstrate how the micro-op interpreter works:

    def fibonacci(n):
        a, b = 0, 1
        for _ in range(n):
            a, b = b, a + b
        return a

Since the interpreter would be spending most of its time in the for loop, that is where he wanted to focus. For those who might not be fully up to speed on Python, the _ is often used as a placeholder for a value that is not needed, and the range() function returns an iterator that produces n successive values starting at zero (in this case). The body of the loop uses tuple syntax to do two assignments: a=b and b=a+b. The bytecode operations for the loop look like the following (from his slides):

    for _ in range(n):      FOR_ITER
        a, b = b, a + b     STORE_FAST
                            LOAD_FAST_LOAD_FAST
                            LOAD_FAST
                            BINARY_OP
                            STORE_FAST_STORE_FAST
                            JUMP_BACKWARD
The Python bytecode instructions are described as part of the documentation for the dis module, which can be used to disassemble compiled bytecode. Some of the newer specialized bytecode instructions are not yet documented, however.

Python has a stack-based virtual machine, Bucher reminded attendees; when the bytecode above starts executing, there is already an iterator (for range()) on the stack. The FOR_ITER bytecode is responsible for getting the next item from the iterator and pushing it onto the stack. But getting the next item for an iterator in Python is a surprisingly complex operation (looking up the object type, calling its __next__() member function, and so on) due to Python's object model. If it is known that the iterator is using range(), much of that can be simplified. So the interpreter will recognize that it is and will specialize the bytecode after a few iterations to FOR_ITER_RANGE, which has a fast path to generate the needed numbers directly. The STORE_FAST instruction will store the number generated, which it pops from the stack, into the _ variable. The LOAD_FAST_LOAD_FAST instruction just does two LOAD_FAST instructions (which push the value of a local variable onto the stack), pushing b, then a; the next LOAD_FAST then pushes b again.

The BINARY_OP instruction pops two values from the stack, performs a binary operation ("+" in this case), and pushes the result back onto the stack. Performing a binary operation in Python is complicated, though, because the types could be anything that the operator is defined for, so there are lots of steps the virtual machine has to take to execute a generic binary operation. With a chuckle, Bucher said that you literally could define an object where the addition operator "formats your hard drive"; "it is a very overpowered mechanism, especially when you are just adding two integers together". So the BINARY_OP will be specialized to a BINARY_OP_ADD_INT instruction that dispenses with all of the rest of the generic handling and just adds the integers.

Finally, the STORE_FAST_STORE_FAST instruction does two STORE_FAST instructions, storing a+b to b and (the original value of) b to a. JUMP_BACKWARD then simply jumps back to the beginning of the loop, as might be expected.

So after the specialization phase, "we actually have some decent profiling information about your code", he said; the original bytecode did not say much about the data types or the operations being performed, but the specialized instructions do. The next step is to do "micro-op traces", which is a phase that breaks up the bytecode instructions into smaller, more easily optimized, parts, called micro-ops.

Micro-op traces

Once the Python code gets warmed up ("for some definition of 'warm'"), the interpreter starts to translate bytecodes into micro-ops. That translation step is automatic, he said, because of the DSL; the larger instructions can be defined in terms of their constituent micro-ops in the DSL description. So the translation just becomes a table lookup, rather than hand-written code.

He showed the micro-ops that make up each of the bytecode instructions in the example loop. Each bytecode instruction starts with the same micro-op: _CHECK_VALIDITY_AND_SET_IP, which is important to ensure the correctness of the generated code; it is a mechanism to check that "the world hasn't changed too drastically". User code "can do all sorts of crazy things": it can raise exceptions, change the values of local variables in a debugger, monkey-patch the builtins, and more. So that micro-op is important in the general case, but is not needed here, he said, because of the information that has been gathered; it is known that it is a simple range iteration that adds two integers. That means that seven micro-ops are not needed (one per bytecode instruction of the loop).

Next, the FOR_ITER_RANGE translation has an _ITER_CHECK_RANGE as its first micro-op (after _CHECK_VALIDITY_AND_SET_IP was dropped). It ensures that the object being iterated over is still a range, but in this specific case, the code in the loop makes it clear that it will always be a range, so that guard micro-op can be dropped as well.

Similarly, the micro-ops for BINARY_OP_ADD_INT (now) start with a _GUARD_BOTH_INT micro-op, which guards against the types of the operands changing and chooses the regular, slow path if that happens. Here, though, the operand types cannot change, so the guard can be removed. In the end, the 19 micro-ops generated for the seven bytecode instructions were reduced to ten.

That list of ten micro-ops is, effectively, the minimal amount of work needed to "faithfully execute the Python semantics for this for loop". It is all statically typed at this point with no unneeded overhead; all of the dynamic-type handling has been removed, "which is perfect if we want to further lower it". It turns out that further lowering is needed, because the micro-op instruction format, despite its name, is more complex than bytecode instructions; the micro-op format has a lot of information about cached values, de-optimization options, and things of that nature.

Because of that complexity, the decoder in the micro-op interpreter is a lot more expensive than the bytecode interpreter, he said. Even though each micro-op is doing less work, there are more of them than there are bytecode instructions (ten versus seven), so the dispatch overhead is higher. Turning on the micro-op traces, doing the optimization as he described, and running the result in the micro-op interpreter is actually around 20% slower than the bytecode interpreter. That is "super disappointing", because all of this extra work resulted in slower execution, but that is where JIT compilation comes into play.

JIT

The hope is that the JIT compiler can overcome the additional interpreter overhead and gain even more performance by compiling the optimized traces directly to machine code. Three mechanisms are being used to reduce the indirection that causes some performance loss.

The first is to "burn in" constants, caches, and arguments, so they are encoded directly in the machine code without having to be decoded from individual instructions in the interpreter. The second technique is to move data out of stack frames and into registers. The Python interpreter is stack-based, but that means each push and pop operation goes out to memory, so redirecting them to registers provides a large savings. The third mechanism is to move the hot code paths to inline code; "rather than having a generic interpreter that needs to handle every possible micro-op", Python will "create a straight-line sequence of the exact code that we need to execute to reduce that dispatch overhead".

All of this work needs to be weighed against Python's deployment goals, Bucher said. For example, these changes need to have broad platform support; "Python runs a lot of places" and he would not consider the project to be a success if the JIT only runs on one or two operating systems, or only on one architecture. Similarly, he wants to ensure that there are few runtime dependencies for the JIT; some deployments cannot also install a full compiler toolchain—or even just a C compiler. It should remain true that users can just download a binary from python.org and "everything just works".

The most important goal to him is that the JIT have a low implementation complexity. Python is maintained almost exclusively by volunteer effort. The project would not be a success if "our new JIT compiler was so big and so complex that nobody could reason about it, no one could maintain it, no one could fix bugs in it, or improve it, or extend on it". The team is willing to sacrifice some performance in order to ensure that the Python project remains healthy.

Historically, JIT compilers are "very very complex things", which means that the technique is in opposition to the deployment goals. But, it turns out that copy-and-patch compilation "solves a lot of these problems in a very satisfying way". It is a technique for "generating template JIT compilers", which is "really really cool". The paper at the link is quite good, Bucher said, but there is also a blog post by its main author where he implements a Lua JIT from scratch using the technique. "If you don't know how JIT compilers work and you read that blog post, you'll know how JIT compilers work."

The paper shows good performance increases, he said. Compared to the Liftoff WebAssembly baseline compiler, the paper reported code generation that was five times faster and generated code that ran 50% faster. Compared to a traditional JIT toolchain, such as LLVM with a low optimization level (-O0), the code was generated 100 times more quickly and it ran 15% faster using the copy-and-patch technique. As sort of an upper bound on what can be expected, comparing it to an optimizing JIT with hand-written assembly (LuaJIT) showed that copy-and-patch executed more quickly on around one-third of the LuaJIT benchmark tests and was only 35% slower overall.

Thought experiment

He led attendees through a thought experiment on how you might go about implementing a program to compile a sequence of "bytecode instructions to machine code as quickly as possible". The program could step through the bytecode; for each one it could copy some static, pre-compiled machine code into executable memory, then the static code could be patched for anything that needed changing in the template. Patches would include things like jump targets, constants, and cached values.

If you look at those steps, he said, they look like something that we have already had for a long time: relocatable object files. The patching in that case is for the information from relocation records for external symbols and the like. The way copy-and-patch works is analogous to the functioning of the linker and loader for object files.

He used the code for a single micro-op (_LOAD_FAST) as an example of how the JIT works. The code for the micro-op is fairly straightforward:

    case _LOAD_FAST:
        PyObject *value = frame->localsplus[oparg];
        Py_INCREF(value);
        *stack_pointer++ = value;
        break;

It simply gets an object (value) from the local-variables array (localsplus) of the frame, increments its reference count, and pushes it onto the stack (pointed to by stack_pointer).

At CPython build time, the body of the micro-op is extracted into its own function so it can be compiled separately. Of course, the code will not compile because it is missing frame and stack_pointer, but those can be passed as parameters to the function. It is also missing the value of oparg, but it is treated somewhat differently.

When the function is compiled to machine code, the value of oparg will not be known, but when the JITted code is placed into memory for a given usage of _LOAD_FAST, that value will be known. It will be the same each time, so there needs to be a way to patch that value into the compiled code when the final output for the micro-op is constructed. Since "we don't really know how to do that yet", he suggested that people think of oparg being replaced with MAGICALLY_INSERT_OPARG, but he would come back to that.

Similarly, there will be more than just a single micro-op to be JITted, so there needs to be some way to transfer control to the next instruction. He would just "magically hand-wave it away" by inserting a line at the end of the function:

    return MAGICALLY_RUN_NEXT_MICRO_OP(frame, stack_pointer);

The copy-and-patch technique has a "really elegant solution" for handling the two MAGICALLY_* placeholders by using extern declarations. This will cause the C compiler to leave holes in the generated code along with instructions on where and how to patch those holes; those instructions are intended for use by the linker, but CPython can use them for its purposes. In the end, the code for the _LOAD_FAST micro-op looks like the following:

    extern int MAGICALLY_INSERT_OPARG;
    extern int MAGICALLY_RUN_NEXT_MICRO_OP(_PyInterpreterFrame *, PyObject **);

    int
    _LOAD_FAST(_PyInterpreterFrame *frame, PyObject **stack_pointer)
    {
        PyObject *value = frame->localsplus[MAGICALLY_INSERT_OPARG];
        Py_INCREF(value);
        *stack_pointer++ = value;
        return MAGICALLY_RUN_NEXT_MICRO_OP(frame, stack_pointer);
    }

At CPython build time, that code gets compiled at a high-optimization level; there is no need to worry about the amount of time that takes since it is not done at run time. That generates an ELF object with the machine code and the relocation information for the two magical symbols. That code can actually be further optimized by CPython because it knows more than the C compiler about the domain of oparg, which always fits in 32 bits, so it can use a different instruction with a 32-bit immediate value. CPython also knows how the code will be used; since the JITted code for the micro-ops will be placed one right after another in memory, the jump emitted for MAGICALLY_RUN_NEXT_MICRO_OP would effectively be a jump of zero bytes, so it can be eliminated.

The byte values for the instructions can be extracted from the ELF, then put into a function that looks like the following:

    static void
    emit__LOAD_FAST(unsigned char *code, _PyUOpInstruction *uop)
    {
        const unsigned char code_body[] = {
              0xb8, 0x00, 0x00, 0x00, 0x00, 0x48, 0x8b, 0x44,
              0xc7, 0x48, 0x8b, 0x08, 0xff, 0xc1, 0x74, 0x02,
              0x89, 0x08, 0x48, 0x89, 0x06, 0x48, 0x83, 0xc6,
              0x08,
        };
        memcpy(code, code_body, sizeof(code_body));
        memcpy(code + 1, &uop->oparg, 4);
    }

The code_body is the sequence of bytes emitted by the C compiler, with the four zero values as the hole to be filled with oparg, which is what the second memcpy() call does. This technique makes the JIT run extremely quickly. "It actually takes almost twice as long to allocate the memory as it does to JIT the memory; that's not because allocation is slow, that's because copy-and-patch is so fast."

Fitting into CPython

Currently, the JIT build of Python 3.13 uses around 1000 lines of complex Python and 100 lines of complex C code for the copy-and-patch implementation. That is the code that builds the C templates, extracts the generated code from the ELF object, fixes up some individual micro-ops, and then writes out the generated C code. There is a dependency on LLVM, but that is only at build time; users of CPython do not need LLVM at run time.

The run-time code for the JIT compiler is around 400 lines of simple-ish hand-written C. The core of it is a loop over a micro-op trace calling emit_*() functions for each. There are also around 9000 lines of simple generated C code, which is "truly just as simple as what I showed you on the previous slide" (emit__LOAD_FAST() above). There are no run-time dependencies for users.

Because the CPython JIT uses LLVM, "we have incredible platform support right out of the box". It runs on all of the most popular operating systems (Linux, macOS, and Windows) for x86_64, as well as 32-bit Windows on x86; it also supports the three OSes on Arm 64-bit processors. It is mind-blowing to him that the project was able to support so many different platforms, so quickly, and with little additional effort.

As he noted earlier, turning on the micro-op interpreter (without enabling the JIT) is about 20% slower. That mode is mostly useful for developers on the project, because it is much easier to debug interpreted code than JITted code. It also uses around 1% more memory to store the micro-op traces.

The goal for 3.13 was to get a basic, working JIT compiler into CPython so that people could try it out, test deploying it, and allow the project to shake out more bugs—"something that doesn't crash and doesn't perform horribly". The JIT mode is experimental, but if it is configured into the binary, it is basically 0% slower; it has regained all of what was lost to the dispatch overhead in the micro-op interpreter. Meanwhile, most of the optimizations that he described in the talk have not been implemented yet, so the plan is to use the 3.14 cycle to improve on that; "I am confident that we will be able to".

The JIT compiler uses around 10% more memory, which is kind of an upper bound, Bucher said, because it is based on benchmarks that have a high ratio of code to data. Typically, if you are worried about memory usage it is because you are processing a lot of data, so the amount of JITted code will be a small portion of the memory usage. In addition, the JIT developers have not spent any time on reducing the memory used, so that figure is likely to be a high-water mark.

For further information, he recommended his previous talk and PEP 659, both linked above, to get the background on the specializing adaptive interpreter. He recently wrote PEP 744 ("JIT Compilation"), which covers many of the same things as were in the talk, but also gives more background, describes some of the design decisions, and looks to the future.

Those who want to build the JIT into their Python can simply pass a flag to either configure (--enable-experimental-jit) or build.bat (--experimental-jit). That should work for cross-compilation as well. Since the JIT is part of CPython, Bucher asked that any bugs found be filed as GitHub issues in the main CPython repository. With that, his half-hour session ran out of time.

[I would like to thank the Linux Foundation, LWN's travel sponsor, for travel assistance to Pittsburgh for PyCon.]

Comments (2 posted)

How free software hijacked Philip Hazel's life

By Joe Brockmeier
June 19, 2024

Philip Hazel was 51 when he began the Exim message transfer agent (MTA) project in 1995, which led to the Perl-Compatible Regular Expressions (PCRE) project in 1998. At 80, he's maintained PCRE, and its successor PCRE2, for more than 27 years. For those doing the math, that's a year longer than LWN has been in publication. Exim maintenance was handed off around the time of his retirement in 2007. Now, he is ready to hand off PCRE2 as well, if a successor can be found.

Punch cards to flat screens

Hazel's tenure as a free-software developer is exceptional, if not record-breaking in its length. Linus Torvalds began working on Linux in 1991 as a college student and is still leading its development 33 years later with no signs of slowing. However, as Hazel wrote in his technical memoir From Punched Cards To Flat Screens, he began contributing to free software "nearer the end than the start" of his career.

At the start of his career, Hazel was introduced to computers as an undergraduate when the University of Cape Town (UCT) received its first computer: an International Computers and Tabulators (ICT) 1301, with a dazzling capacity to read as many as 600 punched cards per minute. He attended introductory programming lectures, read the Manchester Autocode programming manual from cover to cover, and began writing test programs encoded onto paper punch cards:

If you were lucky and your program worked, the output appeared on the printer; more often that not (at least at the start) all you got was an error message, which meant that you had to fix your program, return to the back of the queue, and hope to get another run before the end of the session. The more attempts it took to get a program to work, the more useful scrap paper one accumulated.

He later moved on to the University of Cambridge as a PhD student. Cambridge was blessed with four computers when Hazel arrived in 1967: a Titan, IBM 1130, IBM 360, and PDP-7. Because Hazel knew the Manchester Autocode language, which is similar to Titan Autocode, he was given an account on the Titan "and that was the start of the slippery slope down which I have been sliding ever since".

That slope, as well-detailed in Hazel's memoir (and mercilessly abbreviated here), ultimately led to his joining the Cambridge Computing Service as a software developer in 1971. During his tenure at Cambridge, and at home, Hazel had the opportunity to work with a number of interesting systems including the PDP-11, Ultrix, Sinclair ZX-81, and BBC Micro. In 1990 the Computing Service decided to set up a cluster of Unix systems for the use of staff and graduate students, and he began working with Sun boxes running SunOS. This coincided with the transition from the X.25 wide-area network standard to the Internet Protocol (IP).

Hazel was responsible for the email service, and found it difficult to configure Sendmail to choose between IP and X.25 to deliver email. A colleague suggested that he try the free-software Smail MTA instead. Hazel took him up on the suggestion and added two features to Smail to manage the DNS lookups and decide whether to send mail via SMTP or X.25. While managing Smail for Cambridge, he submitted additional features to cope with messages that had bad senders:

If such a message could not be delivered, the bad sender meant that an error report could also not be delivered, and the postmaster (me) had to intervene to sort things out. By rejecting messages whose sender address could not be verified, I pushed the onus of dealing with this problem out to the sending MTA. This kind of checking is now standard MTA practice (along with many other checks for spam and viruses).

This was a small start that led Hazel to thinking about more checking and verification that would be needed for MTAs. He considered, but decided against, trying to modify Smail. It was written in pre-standard C and carried "a lot of UUCP baggage". Hazel wanted to write an MTA for modern (at the time) operating systems with standard C compilers and runtimes, and permanent connections to a TCP/IP network. He began working on the Experimental Internet Mailer (Exim) in March 1995. By November it was "just about able to send and receive emails".

Exim

Hazel had informed a colleague, Piete Brooks, that he was working on Exim. Brooks wanted to try it out, but Hazel demurred because he had not written any documentation. (This may be the first known case of a programmer refusing to distribute undocumented code...) Brooks response was "I don't want the documentation, I want the code." Even so, Hazel insisted on writing a first cut of the Exim manual before packaging up the code and sending it off to meet its fate.

Brooks put the code into service right away, and started telling others about it. Hazel put Exim releases on a public FTP site, and gave a talk on Exim at a "Campus Mail Day" in Aberdeen, which led to more usage. A request from Richard Stallman persuaded him to switch from a homegrown license for Exim to the GPL. It began being ported to other operating systems, and eventually found its way to Linux in 1997 or 1998, "which by then had become an operating system that could be used in production environments". Over time, Exim would become (and still is) a popular MTA used all over the planet. Its adoption far exceeded Hazel's expectations:

I never expected commercial sites to get involved, nor for it to become the default MTA in any operating systems. In short, I did not foresee that it would grow into the fully-fledged open source development project that it has.

PCRE

Regular expressions play a rather large part in managing mail, and Hazel wanted to have more flexible regular expressions for Exim than were present in Smail. He chose a regular-expression library written by Henry Spencer for early versions of Exim, but found it limiting compared to Perl's extended pattern-matching features. He thought about lifting Perl's implementation, but it was too integrated to be easily adapted to Exim. "In the end, I solved the problem the way programmers generally solve their problems: by writing something myself." That, of course, became PCRE.

Hazel bundled PCRE with Exim and also released it as a standalone library. Like Exim, it filled a need that he did not even realize existed. PCRE was adopted by Apache HTTPD, and Hazel was "particularly gratified" to find that it had been included with the Postfix MTA. After a while, PCRE looked beyond Perl to include features such as recursion inside a regular expression, and named subpatterns (borrowed from Python), as well as other ideas taken from other regular-expression implementations.

In 2014, Hazel decided that PCRE's API was "not capable of much further extension" and began working on a new, incompatible, version of the code. The first PCRE2 release (starting at version 10.00 to avoid confusion with then-current PCRE 8.x releases) was announced in January 2015. Hazel continued supporting PCRE until its final release in 2021. PCRE2 moved to GitHub in 2022. The most recent release of PCRE2, 10.44, came out on June 7, 2024.

Hazel wrote that PCRE may now be even more important than Exim was "because of its widespread use in many different applications and operating systems". Indeed, just looking to see the installed software that depends on the PCRE2 library on Fedora 40 turns up use by Git, Grep, MariaDB, nmap, SELinux's command-line tools, Sway, Wget 2, and quite a few others.

In his memoir, Hazel had written that it felt right to pass maintenance of Exim on to others after so long. "More than a decade on one project is long enough." In his 2017 update to the memoir, he noted that the sentence had come back to bite him: He was still working on PCRE 19 years after he started to write it. Little did he suspect he would still be maintaining it in 2024. He would like to hand it off while he can help with the transition.

Passing the torch

I emailed Hazel with a few questions about his thoughts on long-term free-software maintenance, changes to the industry, and what he might do once PCRE was handed off. Hazel wrote that he does not have any post-PCRE plans. All of his projects, with the exception of the base to presentation form (B2PF) library for converting Unicode characters for printing, were started while he was still working. "Since I retired I haven't felt a lack of anything enough to spur me into writing something new". He added that he would continue to maintain his non-PCRE2 projects "if necessary" and would help with PCRE2 if needed.

When asked how new generations of tools, and developers, had changed his work habits, Hazel replied that he has changed little. "I am sufficiently old that I am well stuck in my ways, and haven't changed how I do things since the 90s" with the exception of moving development to GitHub. He admitted that he was not good at anticipating or keeping up with trends in development. For example, he said he was not aware of Unicode when he began PCRE development, so it was written to support ASCII and with a limit of 64K for compiled patterns. Others had to persuade him to extend PCRE with Unicode support, and the option for larger patterns. Contributors, he said, have not changed. The world has, though:

When I was born, there were no digital computers, though their predecessors, for example the Colossus at Bletchley Park, were around. The world has gone from nothing to where we are now in just one lifetime. Incredible! What particularly amazes me more than the CPU power is the amount of storage that I carry in my pocket.

Hazel did offer some advice for those starting new software projects, though he noted he was not the first to make this point:

It's worth remembering that the effort needed to maintain a piece of successful software over its lifetime far outweighs the effort of writing it in the first place. In the case of PCRE there have been several major re-designs and re-writes of the internals as well as continual extensions to the user-visible features.

He also suggested that developers think about how software would be tested as it is designed: "Effort put into building test harnesses is never wasted."

I asked Hazel, given the recent XZ backdoor attempt, how he intended to vet any prospective PCRE2 maintainers. He replied that it was a good question "to which I have no answer. I will have to see who (if anyone) makes an offer". To date, he said he had received "no communications whatsoever" about taking over the project. Perhaps, once the word gets out more widely, a qualified maintainer will step forward to take PCRE2 into the future.

Comments (16 posted)

Aeon: openSUSE for lazy developers

By Joe Brockmeier
June 14, 2024

The openSUSE project recently announced the second release candidate (RC2) of its Aeon Desktop, formerly known as MicroOS Desktop GNOME. Aside from the new coat of naming paint, Aeon breaks ground in a few other ways by dabbling with technologies not found in other openSUSE releases. The goal for Aeon is to provide automated system updates using snapshots that can be applied atomically, removing the burden of system maintenance for "lazy developers" who want to focus on their work rather than desktop administration. System-tinkerers need not apply.

The idea behind Aeon, as with other immutable (or image-based) Linux distributions, is to provide the core of the distribution as a read-only image or filesystem that is updated atomically and can be rolled back if needed. Google's ChromeOS was the first popular Linux-based desktop operating system to follow this model. Since the release of ChromeOS a number of interesting immutable implementations have cropped up, such as Fedora Silverblue, Project Bluefin (covered here in December 2023), openSUSE's MicroOS (covered here in March 2023), and Ubuntu Core.

What makes up the core software and how the immutable bits are composed, deployed, and managed varies quite a bit between distributions. Aeon uses a utility called transactional-update (with openSUSE's Zypper package manager under the hood) and Btrfs subvolumes to create and update system snapshots. Basically, instead of installing and updating a system with individual packages while it is running, the updates are applied in the background to a separate Btrfs snapshot and then the system is rebooted into that snapshot. The /home and /var directories are, of course, writable Btrfs volumes; /etc uses overlayfs to apply local changes on top of the default configuration files.

User-installed software for these distributions is separated from the rest of the system software on a mutable filesystem using some type of application-containerization technology like Flatpak, Podman, or Snap. In Aeon's case, user applications are generally installed via Flatpak, or using a traditional package manager inside a containerized Linux distribution of the user's choice that is managed with Distrobox. (On Aeon, Distrobox uses Podman to run containers.)

Installation and updates

The openSUSE YaST team has been working on a new installer called Agama, but Aeon has its own installer, the Transactional Installation Kit (tik). The tik installer is designed to deploy operating system images to UEFI hardware, which means that users with older hardware are (at least currently) out of luck when it comes to installing Aeon. It can also make use of the Ignition and Combustion configuration tools to create users, enable services, install SSH keys, and more, automatically at install time.

As with the rest of the distribution, the philosophy for tik is "minimal". Tik does not ask the user to make any choices about software, how disk partitions should be laid out, or much else, except to confirm that the installation should proceed. Users are prompted to choose which disk to use, when a system has multiple disks. If the target system has a existing MicroOS installation, tik will offer to back up (and restore) existing users and data as long as the USB stick with the installer has more free space than the data stored in the home directory.

After installation, there is a first-run wizard to perform system configuration. It asks for the usual information: the language and keyboard layout to use, wireless network connection information, time zone, and user information. Aeon does not configure a root user, so the first system user is set up as an administrator and can use sudo to perform any administrative tasks.

Truly minimal desktop

Aeon is composed from packages in the openSUSE Tumbleweed and MicroOS repositories, so Aeon RC2 provides users with an up-to-date GNOME 46 desktop using Wayland. (GNOME 46.2, to be exact.) It also includes Linux 6.9.3, systemd v255, and glibc 2.39.

Many Linux distributions offer "minimal" desktop package selections, but openSUSE Aeon takes this farther than most. Aside from some basic applications one would expect with GNOME—such as the file manager, settings application, and other utilities—Aeon comes with almost no software installed as part of the base system. Firefox and GNOME's text editor, calculator, and terminal application are all installed as Flatpaks as a second step after the desktop user logs in the first time. If a user wants a media player, office suite, email application, or even image and PDF viewers, they have to be installed separately. That may be taking minimalism a bit too far, but it does give users a lot of control over what is installed on their system.

[openSUSE Aeon minimal
install]

Aeon is touted as a distribution for users who do not want to hassle with system administration, but its sparse selection of software ensures that some up-front work is required to reach the payoff. For example, aside from having to install expected utilities like a PDF viewer, the man command and man pages are not part of the default software. Probably the best way to deal with that is to create a Distrobox container to do one's work in, though having the Distrobox man pages available would be helpful when doing so. (Distrobox documentation is available online, of course, but I've always reached for man pages first.)

One attractive, and seemingly unique, feature that Aeon offers is automatic updates for the operating system, installed Flatpaks, and any distroboxes that one may have set up. It's not unusual to offer system and Flatpak updates, but updating distroboxes is usually a task left to the user. It can become unwieldy quickly if one has several distroboxes set up that need to be updated manually.

According to the RC2 announcement: there are a few features targeting better performance. Aeon is the first openSUSE edition to use the zram kernel module for system swap. This is supposed to improve system performance by avoiding swapping to disk. Aeon's transactional update system is also supposed to automatically choose packages compiled with x86-64-v3 optimizations for CPUs with Advanced Vector Extensions version two (AVX2) extensions, if they are available and the system supports them.

Overall, Aeon looks like a good choice for users who want a minimal, immutable openSUSE GNOME desktop system. It would be especially attractive for users who want a distraction-free system with just a few applications, like an IDE and web browser for development work. It is probably not a great choice for openSUSE users who are happy with Tumbleweed or Leap and installing software with Zypper.

Readers interested in trying openSUSE Aeon should check out the install guide and overview. The distribution is still in a release-candidate stage, so rough edges are to be expected—but it is solid enough to test drive and report bugs if they're found.

Comments (9 posted)

Nested bottom-half locking for realtime kernels

By Jonathan Corbet
June 17, 2024
Software-interrupt handlers (also called "bottom halves") have a long history in the Linux kernel; for much of that history, developers have wished that they could go away. One of their unfortunate characteristics is that they can add unexpected latency to the execution of unrelated processes; this problem is felt especially acutely in the realtime-preemption community. The solution adopted there has created problems of its own, though; in response Sebastian Andrzej Siewior is proposing a new locking mechanism for realtime builds of the kernel that may have benefits for non-realtime users as well.

In normal kernel builds, a software-interrupt handler will run, if needed, at the earliest opportunity that the kernel finds; usually, that is immediately after the completion of a hardware-interrupt handler or on return from the kernel to user space. Either way, software-interrupt handling can delay the execution of a process that may have nothing to do with the creation of that interrupt. For most systems, that delay is not usually a problem, but realtime kernels are all about response time; a badly timed software-interrupt handler has the potential to cause a realtime task to miss its deadline.

It turns out that the realtime developers are firmly of the opinion that they have not worked on that project for over two decades just to be thwarted by a software-interrupt handler. So those handlers have been made preemptible like nearly everything else in realtime kernels. That change only addresses part of the problem, though. The kernel makes heavy use of per-CPU variables as a way of avoiding contention between processors; as long as no other CPU can access a memory location, there will be no contention for it, and no need for locking to ensure mutual exclusion. Except, of course, if a software-interrupt handler runs and tries to access the same data.

To avoid such problems, kernel code can call local_bh_disable(), which blocks the execution of software-interrupt handlers until local_bh_enable() is called. A call to local_bh_disable() will also disable preemption and migration for the running task, ensuring that it has sole access to the CPU during its critical section. That solves the problem of racing with software-interrupt handlers (or any other kernel code) on the same CPU, but creates another latency problem for realtime kernels; as long as preemption is disabled, a higher-priority process cannot run on that CPU, once again threatening to increase latency for that higher-priority process.

The solution to that problem in the realtime tree is to make tasks preemptible while software-interrupt handlers are disabled. But, since a task may be depending on local_bh_disable() to keep other tasks from accessing its per-CPU data, local_bh_disable() takes a per-CPU lock on realtime kernels. As a result, only one task can be running with software interrupts disabled on any given CPU at a time.

But, it almost goes without saying, there is another problem. If a low-priority process enters a local_bh_disable() section, it can be preempted within that section and prevented from executing (and restoring software interrupts) for a long time. That could block a higher-priority process from completing a local_bh_disable() call of its own. It is, in other words, a classic priority-inversion situation. Here, the problem is worsened by the fact that, in all likelihood, there is no actual contention between the two tasks; they are probably calling local_bh_disable() to protect entirely different data.

This situation highlights a problem with disabling software interrupt handling: it is essentially a big lock that provides no indication of what data it is actually protecting. That, in turn, points to a potential solution: replace the big lock with fine-grained locking that protects a limited and well-defined set of data. That is the approach taken by Siewior's patch set. Specifically, it adds a pair of new macros:

    local_lock_nested_bh(local_lock_t *lock);
    local_unlock_nested_bh(local_lock_t *lock);

Using this mechanism requires auditing each local_bh_disable() section, figuring out which data is protected therein, and adding a local_lock_t (a specialized lock that only prevents access from the same CPU) to that data structure. That lock can then be passed to local_lock_nested_bh() to protect only that structure while not blocking concurrent execution by unrelated code.

Code using this approach must still call local_bh_disable() to prevent access by software-interrupt handlers and to prevent migration to another CPU. But, once all of the local_bh_disable() sections have been audited and adjusted (a job that is reminiscent of the long effort to remove the Big Kernel Lock), it will be possible to remove the lock that realtime kernels take in local_bh_disable(), eliminating a significant source of contention and latency. Benchmark results posted with the patch series show a significant improvement (14.5%) for a networking workload once that lock is removed.

For non-realtime kernels, instead, local_lock_nested_bh() is essentially a no-op, though it does provide information to the locking checker for debugging purposes. Local locks have no effect on non-realtime kernels, and do not require any storage. Thus, this patch satisfies one of the rules that has constrained realtime development from the start: realtime-specific features must not have a performance impact on non-realtime kernels.

This work will have a significant benefit for the rest of the kernel, though, even if it doesn't change the generated code in most cases. With the current local_bh_disable() pattern, there is no indication of what data is being protected. That makes it hard to reason about concurrent access, and makes the introduction of bugs more likely. Once this work is done, the locking rules for the affected data structures will be documented; in many cases, that may make it possible to stop disabling software interrupts entirely in favor of a more focused locking scheme.

This patch set is in its sixth revision. Previous postings have resulted in some significant changes, mostly in how some networking subsystems were changed to use the new mechanism; the core concept has remained mostly the same. A few developers have indicated their acceptance of this work, so chances are good that it will find its way upstream before too long.

Comments (8 posted)

Simplifying the BPF verifier

By Daroc Alden
June 13, 2024

LSFMM+BPF

The BPF verifier is a complex program. This has the unfortunate effect of making it simultaneously more difficult for contributors to work on, and more likely to harbor unknown bugs. Shung-Hsi Yu had two concrete proposals for how to simplify the verifier to make it easier to maintain that he presented at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit. Yu proposed changing how the verifier tracks partially known values and cleaning up the interface to hide the details of the value-tracker's internal representation.

[Shung-Hsi Yu]

One of the core functions of the verifier is value tracking — inferring the set of possible values that a variable can hold, in order to ensure that accesses remain within bounds. Since any value could potentially be used to compute an array index or other quantity the verifier is interested in knowing, the value tracker needs to follow every value in the program. The verifier stores information on possible values in the bpf_reg_state structure, which tracks two related kinds of information. The first is "known bits", which uses a mask to indicate when individual bits of the value are known exactly:

    struct tnum {
        u64 value;
        u64 mask;
    }

The second is the valid range of the value, tracked as both signed and unsigned 32- and 64-bit quantities:

    struct bpf_reg_state {
        ...
        struct tnum var_off;
        s64 smin_value; /* minimum possible (s64)value */
        s64 smax_value; /* maximum possible (s64)value */
        u64 umin_value; /* minimum possible (u64)value */
        u64 umax_value; /* maximum possible (u64)value */
        s32 s32_min_value; /* minimum possible (s32)value */
        s32 s32_max_value; /* maximum possible (s32)value */
        u32 u32_min_value; /* minimum possible (u32)value */
        u32 u32_max_value; /* maximum possible (u32)value */
    }

This choice of what information to track represents a tradeoff between accuracy and efficiency. If computers had overabundant memory, the verifier could just track the set of possible values directly using a generic set data structure. The downside of that approach would be the significantly increased memory overhead compared to the bytes required to store a bpf_reg_state. The downside of the more efficient approach is that it can't represent all possible sets of values, so sometimes the code needs to make a conservative over-approximation, which can snowball and make the verifier fail to figure out bounds that it theoretically could have. For example, the verifier can't currently handle a disjoint range, like a value that must be between one and four or eight and ten. Instead, it would track the range as just one to ten.

In practice, tracking both known bits and possible ranges provides a good tradeoff. Either one alone would fail to capture important properties that the verifier cares about, but together they aren't too large or complex to work with. They can represent possible sets of values such as "a multiple of eight between zero and 64", which is a good fit for tracking the alignment and bounds of an array access.

Track fewer bounds

Yu has a proposal that could simplify the actual implementation of bpf_reg_state significantly, while still keeping the same precision: stop tracking the signed versions of the ranges separately. Right now, whenever the verifier updates one range (such as inferring a new smin_value from a conditional branch), it needs to perform a complex synchronization to make sure the change is reflected in every range. Right now, that synchronization involves propagating information in 20 different directions, Yu said. This is necessary because the code doesn't track which fields have been updated, so synchronizing the bounds after processing a chunk of code involve sharing information from each of the five tracked constraints (four ranges and a tnum), to each of the other four.

Instead of tracking ranges in the current way, Yu proposes tracking ranges using a variant of the approach he discussed in October 2023. Essentially, the maximum would be allowed to be lower than the minimum. The range represented this way always starts at the minimum and ends at the maximum, but it might wrap around part way through. This means that the range (minimum: 0xFFFFFFFC, maximum: 4) represents the signed range (-4, 4) while simultaneously representing the unsigned ranges (0xFFFFFFFC, UINT32_MAX) and (0, 4). The existing code doesn't handle disjoint ranges like that, so Yu plans to add some conversion functions that convert from the new representation for use by the old code.

Storing ranges this way has a few benefits. The biggest one is that there is no need to synchronize knowledge of signed and unsigned bounds — they are automatically synchronized, simply by virtue of the representation. That also cuts down on the amount of information the verifier needs to propagate between the known-bits representation and the range representations, bringing the code down to only six directions of information flow (from each of the three bounds to the two others). Yu hopes that this will make the verifier code that handles value tracking much easier to work with, and also to formally verify.

Yu plans to get the changes into upstream in a few steps; initially, there will be conversion functions and the main verifier code will remain largely unchanged. Then, he plans to change the most critical parts of the value-tracking code to use the new representation natively, followed by adapting the kernel selftests. Finally, the last uses can be removed along with the conversion functions.

Yu's second proposal for simplifying the value tracker is to introduce a more abstract interface for working with tnum and range values. The proposals can be implemented independently, but they would certainly complement one another. Right now, working on the verifier code requires knowledge of the internal details of tnums and ranges; but the most common operations to perform on these values are just intersections and inclusion checks, Yu said. If those were pulled out into their own functions, a lot of the actual value-tracking code could be substantially simplified.

Those aren't the only possible ways to ease maintenance of the verifier, however. The session ended with a discussion of how to improve the documentation, what aspects of the verifier could potentially be standardized, and how these proposals would impact formal verification. The verifier has certainly earned its reputation as a tricky piece of code to maintain, but it seems like the kernel's BPF developers have a plan to start changing that.

Comments (3 posted)

Static keys for BPF

By Daroc Alden
June 17, 2024

LSFMM+BPF

The kernel has a lot of code paths that are normally disabled: debugging print statements, tracepoints, etc. To support these efficiently, there is a common mechanism called static keys that provides a way to enable or disable a code path at run time, with effectively no overhead for disabled branches. BPF programs have not been able to take advantage of static keys so far, because they aren't compiled into the kernel. Now, it looks like BPF may be getting support for a similar mechanism — and the design could also provide one of the components needed to support jump tables, another missing feature. Anton Protopopov presented his plans to add static keys to BPF at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit.

[Anton Protopopov]

Static keys in normal kernel code work by using self-modifying code. An static_branch_unlikely() directive in the source is compiled to no-op instructions of the appropriate size to hold a jump instruction instead (five bytes on x86_64). At run time, when the kernel wants to enable the branch, it overwrites the no-op instruction with a jump instruction to the code for the branch. The same technique doesn't work for BPF for several reasons. For one, the existing static-key support is, as the name would suggest, static. It is configured at build time, which won't work for BPF programs, since they aren't linked into the kernel. For another, there are security concerns with allowing self-modifying code, meaning that the modification should not be implemented in BPF itself.

Protopopov's proposal sidesteps both of those issues. He would like to add two new BPF instructions: goto_or_nop and nop_or_goto, depending on whether the chosen branch is enabled or disabled by default. New instructions are needed because the verifier will need to be taught to still consider both branches, even though the instructions themselves choose one branch or the other unconditionally. Then, the places where the new instructions occur will be associated with some static key. Since BPF represents many things with maps, Protopopov suggested using a new map type. When a map of that type is updated, the relevant instructions will be patched to the other variant.

There are a few complications with this approach; one is the fact that the just-in-time (JIT) compiler doesn't know in advance how long the code it generates will be, so the offset that needs to be patched can't be calculated ahead of time. Protopopov described his plans for how this would be reflected in the new map type. Before the program is loaded, the map would hold BPF instruction offsets. On load, the map becomes read-only to user space, and the locations in the map are updated as relocations and JIT compilation are performed. The verifier will need some way to access the map as well, to check that the modifiable locations only point to the new instructions.

Protopopov finished by describing how user space could make a bpf() syscall to update the state of a static key. David Vernet remarked that the mechanism sounded like it would be useful for sched_ext, especially if the ability to update a static key were made available as a kfunc, noting that other schedulers use static keys a lot. Protopopov replied that the patching code has to take a mutex, so not all code could necessarily use it. Alexei Starovoitov asked about how user space would indicate which static key it was interested in. Protopopov indicated that the instruction map could hold multiple separate keys, so user space would just indicate which one it wanted to update.

Joe Stringer asked whether, if a BPF program were running while the static key was being updated, the code could see it as being in an inconsistent state. Protopopov said that could happen, but that this didn't present too much of a problem for use cases like debugging. For other uses, the programmer will have to be careful not to rely on the value of a switch remaining the same for an entire BPF function.

The discussion then turned to how to make the map available to the verifier. The verifier already receives several file descriptors with information it needs, so it would seem simple to add another. Unfortunately, the data passed to it is a bit messy and makes this more complicated than it perhaps should be, Protopopov explained.

A member of the audience questioned whether the proposal really needed new instructions, noting that the instruction encoding is getting crowded and that the verifier could tell which instructions were special by reading the map. Protopopov pointed out that this would make disassembly pretty confusing, since in that case any goto or nop instruction could potentially be a hidden branch. Starovoitov said that two more instructions would not be a problem for the moment. Another member of the audience asked why two new instructions were necessary; Protopopov said that they were for code that was patched in or patched out by default, respectively.

A security-minded participant said that they worked on a hypervisor that tests static-key updates in the kernel, to check that the values being written make sense. As part of that process, it hooks into the existing kernel mechanism, suggesting that there is already plenty of information available. They asked why BPF needs a separate mechanism to store locations to be patched. Protopopov replied that the existing kernel support was "too static", since it only applies to code that is actually linked into the kernel.

Jump tables

Sometimes, one part of a program can transfer control flow to many other parts. Switch statements are possibly the most common example. Small switch statements are usually compiled to normal conditional branches, but larger switch statements are sometimes compiled to a table of code pointers. The code uses the value being switched on as an index into the table, and then jumps to the resulting location. This technique — called a jump table — can be much faster and more compact, although it is also harder on the branch predictor. BPF doesn't support jump tables, however, because it lacks generalized indirect branches.

Since Protopopov's proposal would add a new type of map with all of the infrastructure for tracking where BPF instructions end up in memory, it could provide a necessary stepping stone toward supporting jump tables in BPF. One possible design would be to add a "goto register" instruction, where the verifier ensures that the register value is loaded from a map of the right type.

Currently, BPF programs that need to have something like a dispatch table need to emulate it with a long chain of if statements. This is inefficient, since many conditions need to be tested to find the right alternative. It also presents a problem for the verifier, which restricts the number of branches that it is willing to consider in order to avoid spending an excessive amount of time on verification. Jump tables would make that limit much less restrictive, because an indirect jump through a table would count as only a single branch.

Jump tables are frequently used to implement switch statements, especially in bytecode interpreters, Protopopov said. Although he has a plan for the BPF side, he's not sure how complicated implementing support for this style of jump table in GCC or LLVM will be. Starovoitov said that he doesn't expect the change to be difficult, since LLVM already represents switch statements in pretty much this way. All that will be needed is having LLVM create a map for all the known targets of a switch statement. Protopopov asked whether adding BTF debugging information to it would be difficult; Starovoitov didn't think that sounded difficult either.

Support for jump tables was one of the features Starovoitov had called out in his earlier session as necessary to the continued growth of BPF. While there were some questions about the design of Protopopov's proposal, it seems likely that something like this will be implemented for BPF.

Comments (none posted)

BPF tracing performance

By Daroc Alden
June 18, 2024

LSFMM+BPF

On the final day of the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit, the BPF track opened with a series of sessions on improving the performance and flexibility of probes and other performance-monitoring tools, in the kernel and in user space. Jiri Olsa led two sessions about different aspects of probes: making the API for BPF programs attached to a probe more flexible, and making user-space probes more efficient.

Olsa introduced an improvement to kprobes; he posted a new way to attach a single program to the entry and exit hooks for a function. This is already technically possible, but it's a pain to work with, because the BPF program has no way to match entries and exits up with one another. Olsa's new API, called kprobe_multi, will give the BPF program a cookie to match calls to the entry and exit hooks with each other, as well as allowing the entry hook to request that the exit hook be skipped if the event is not something the BPF program is interested in.

The API is already merged into the kernel and supported for kprobes and fprobes. User-space probes are a harder problem, but Olsa is working on getting a version of his new API for those upstream as well. There are also some improvements he is planning for the kernel side; he would like to reimplement fprobes on top of the function-graph (fgraph) mechanism. Since kprobe_multi is built on top of fprobe, this would also change how that API is implemented internally.

The reason to refactor the design like this is to consolidate the different mechanisms in the kernel that trace returns from function calls, Olsa said. Currently, there are essentially two parallel implementations: fprobe and fgraph. The fgraph code maintains a shadow stack for each kernel task that it uses to keep track of up to 16 trace functions. The shadow stack serves as an alternative to the rethook mechanism that fprobes currently use. It would require one extra page per process, allocated when the process is first traced, but it would mean kprobe_multi programs, fprobes, and fgraph code could all share the same mechanism.

One audience member asked whether the limit of 16 simultaneous traced functions in fgraph meant that there would only be up to 16 kprobe_multi programs. Olsa clarified that no, the fprobe instances for multiple kprobe_multi programs could share an fgraph slot. One of the changes he has made is to introduce a hash table that is used to make that mapping many-to-one.

Another person questioned whether implementing kprobe_multi using fprobes, that are themselves implemented with fgraph, which is in turn built on ftrace, was really necessary. Olsa explained that there had been a lot of grumbling about having two separate tracing implementations in the kernel, and that consolidating them actually made things simpler. The audience member then asked whether having so many layers added too much overhead from indirect calls. Another person suggested doing what tracepoints already do to solve excessive indirection — patch the traced code to use a static call when there is only one relevant tracepoint.

Overall, there were some doubts about whether the refactoring Olsa proposed was actually warranted, but not much doubt that the kprobe_multi interface was a useful addition.

Faster uprobes

Olsa also led the next session, this time focused on making user-space probes faster. To set a uprobe, the user specifies an inode and an offset. When the kernel loads that file into memory (or has it already loaded), it patches the given offset with a breakpoint instruction that triggers the probe. The exact details however, are specific to a given architecture, so Olsa focused only on x86.

After executing the BPF program attached to the probe, the kernel then needs to execute the original instruction that was replaced. Some instructions can be emulated in the kernel, such as moving values between registers, performing arithmetic, or so on. Other instructions need to be executed in their original context, so the kernel will restore the original instruction, set the CPU's debug flag to single-step instructions, return to the user space program, and then put the breakpoint instruction back and disable the debug flag.

Some probes are handled a bit differently. Return probes use a return trampoline approach: overwriting the return address on the stack with the address of a trampoline that executes a breakpoint instruction and then returns to the original address. In either case, both approaches add a certain amount of overhead to uprobes; overhead that Olsa would like to minimize.

One audience member asked whether these techniques were compatible with user-space applications that do non-standard things like perform their own tracing using a shadow stack. Olsa answered that this would cause problems. For example, programs written in Go have a shadow stack that would notice return probes being inserted and crash the application.

Olsa then explained the benchmarks that he was using to actually measure the overhead of uprobes. He focused on benchmarks involving three different instructions, since those are all common cases and use slightly different implementations: a nop instruction, a push instruction, and a ret instruction. For each one, he patched in a uprobe over the instruction, and then measured the time it took to execute the uprobe. To speed return probes up, Olsa would like to replace the breakpoint instruction with a new uretprobe() system call. On x86, system calls are generally faster than triggering a breakpoint — about a 31% speedup on Intel CPUs, and a 10% speedup on AMD CPUs in this case, he noted. The audience member wondered if the difference was because of the speculative-execution side-channel mitigations required by each platform, since Intel needs mitigations for both Spectre and Meltdown, but AMD is only affected by Spectre.

Other probes will require a different approach. Unlike the return-probe work, which has been implemented and measured, Olsa only has prototypes and ideas for the other probes. One idea is to replace the use of a breakpoint instruction with a jump to a trampoline that makes a system call. If it works, it should provide a roughly equivalent speedup. Unfortunately, there are a lot of complications with the idea.

The shortest usable jump instruction on x86 is five bytes, compared to one byte for the breakpoint instruction, meaning that patching to include the jump instruction could actually overwrite multiple instructions. Additionally, five-byte updates aren't atomic; if another thread were in the middle of executing the patched code while the kernel was editing it to deal with instructions that require single-stepping, there could be unexpected application crashes. Finally, the five-byte jump instruction only takes a 32-bit address — so using that approach would potentially require multiple trampolines placed throughout a process's address space.

Even with so many potential problems, there is a strong motivation to solve them. Olsa noted that if you "ignore those issues and just try", it can result in a 250% speedup. So with that context, he wanted to discuss whether anyone had ideas about how to make the approach work. He opened the discussion by suggesting a hybrid approach: write a breakpoint opcode, write the jump offset, and then write the jmp opcode. An audience member thought that it may not be safe to do that, pointing out that the only approach to editing code as it is being executed that Intel claims is actually supported is writing a breakpoint opcode. He also noted another problem with using a jump instruction: since it is more than one byte, it's entirely possible that code could jump into the middle of the jump instruction.

Andrii Nakryiko noted that if they had to deal with synchronization problems anyway, it could make sense to use a longer jump instruction so that multiple trampolines would not be needed. Shung-Hsi Yu asked whether changing the size of the no-op instructions used would require changing the headers used by the user-space tracing infrastructure. Olsa said that it might, but that something needed to change for performance to improve.

Comments (none posted)

Capturing stack traces asynchronously with BPF

By Daroc Alden
June 19, 2024

LSFMM+BPF

Andrii Nakryiko led a session at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit giving a look into the APIs for capturing stack traces using BPF, and how the APIs could be made more useful. BPF programs can capture the current stack trace of a running process, including the portion in the kernel during execution of a system call, which can be useful for diagnosing performance problems, among other things. But there are substantial problems with the existing API.

The existing way to get stack traces in BPF is to create a BPF map for containing the elements of the trace, and then to call bpf_get_stackid() from the BPF side, which returns a unique ID for the map, Nakryiko explained. Then in user space, the program could do a normal map lookup to retrieve the stack trace. The kernel also captures and stores some related information, such as the ELF build ID and the file offset, which helps identify what program the stack trace corresponds to for offline analysis. This API sounds fairly simple, but unfortunately it has a few quirks, he said.

It can capture either a user-space stack trace or a kernel stack trace, but not both. The kernel supports capturing both, the BPF API just doesn't have a way to indicate that you want both, Nakryiko said. There is also no way to know how large the stack that was captured actually is; programmers just need to hope their stack-unwinding code stops at the correct point. These are both flaws, but they can't be too much of a problem, because nobody has complained.

The final quirk is that the kernel performs automatic stack deduplication. If the captured stack trace matches an existing one, the kernel returns the existing ID. This behavior sounds great in theory, since deduplication can save space, but the map used to store stack traces does not have any way to deal with hash collisions. Stack traces are hashed and placed into the corresponding slot in the map, but each slot can only hold one stack trace. Accordingly, hash collisions (where the hashes are the same, but the traces are not) are both frequent and unavoidable when capturing many stack traces. The API does let the programmer specify whether they wish to retain the old or new stack trace on collision, but that just leaves them with two bad options: lose data or corrupt data. The hash-based stack deduplication also makes clearing out entries from the map inherently racy.

Nakryiko's proposed new API fixes these problems by letting the BPF program handle memory management: it provides a buffer, into which the kernel captures the stack trace, and then the BPF program is free to share that with user space in whatever way makes most sense for the application. All but one use case at Meta has switched to the new API, he noted. There are still some potential improvements to be made, however. Right now, stacks are captured synchronously. This is a problem, because the API can be called from anywhere, including a context where page faults are not permitted.

If part of the program being inspected is paged out, that means the information stored on those pages could be left out of the capture. This only affects user-space stack traces, since the kernel always remains paged in. This is a particular problem for programs that use DWARF-based stack unwinding, since the DWARF debugging information is unlikely to be paged in when the capture is taken.

Nakryiko would like the new API to be asynchronous, so that it can wait for necessary information (for user-space captures) to be paged in. That doesn't work for kernel stack traces, however, because the kernel can't be paused to wait the way that user space can be. On the other hand, capturing a user-space stack trace can be postponed without changing the information returned by waiting until just before the kernel returns to user space, since the process is frozen until then. "Return to user space" is a nice context, he said, since the kernel can wait for memory to be paged in, etc.

All of these separate constraints come together in the proposed API design. Nakryiko proposes having a function that returns a unique ID for a stack trace. The ID acts like a reservation — it is stable, and can be recorded, passed to user space, etc. Once the ID is received, a stack trace is captured into the reserved buffer. Kernel traces are captured synchronously, but user-space traces are captured on return to user space. Looking the stack capture up in the corresponding map returns EAGAIN until the capture is ready. The kernel doesn't perform deduplication, making deleting elements work in a sensible way. One audience member asked whether this meant that there could be identical stack traces with different IDs, and Nakryiko confirmed that this was the case. Daniel Borkmann noted that if the operations were asynchronous, it would also be possible to grow the map itself, which is not true of the existing API.

Nakryiko did say that one part of the API was not yet thought out: how to let the user know when a stack trace is ready. The trivial solution, he said, was to not send a notification and force the users to poll. Slightly more complicated would be a map-wide epoll notification whenever any trace is ready. Alternatively, each slot in the map could have separate epoll notifications — but that would be a wasteful use of file descriptors. Finally, the most efficient approach would be to set up a BPF ring buffer into which IDs are put as they become ready, permitting efficient notification and consumption.

One audience member pointed out that Nakryiko had pretty much just described the mechanism behind io_uring, and suggested that might be an appropriate mechanism. Nakryiko wasn't too sure about whether io_uring would be suitable, but acknowledged the possibility. Another audience member asked whether they could put a growable ring buffer in a BPF arena. Nakryiko thought that would not be helpful, since there are already many ring buffers in the BPF subsystem, and reimplementing one in an arena wouldn't really help. He did note that if they switched to putting the whole stack trace into a ring buffer, they could potentially fit drop notifications in the ring buffer as well.

Yu pointed out that the API could use a callback instead of a notification: run another BPF program on completion, and let the user decide how to notify user space. This prompted an extended discussion about different mechanisms, and how they traded off between flexibility and simplicity. Nakryiko did say that he was against a mechanism that tried to move too much functionality out of the kernel, since the kernel already has good built-in support for stack traces. One example is how user-space return probes can corrupt the stack trace — the kernel has all the necessary information to fix that up, whereas a more flexible mechanism would leave it up to the user.

The discussion didn't reach a solid conclusion before the session time ended, but there are already many mechanisms in BPF and elsewhere in the kernel for notifying user space when an operation completes, so it seems unlikely for that to remain a sticking point of API design. BPF already has fairly good debugging support, given the tight integration with the BPF Type Format (BTF) and support for tracepoints in both the kernel and user space; it looks like some of the work in progress could help extend that support even more.

Comments (3 posted)

Improving control-flow integrity for Linux on RISC-V

June 13, 2024

This article was contributed by Carlos Bilbao


LSS NA

Redirecting execution flow is a common malware technique that can be used to compromise operating systems. To protect from such attacks, the chip makers of leading architectures like x86 and arm64 have implemented control-flow-integrity (CFI) extensions, though they need system software support to function. At the Linux Security Summit North America, RISC-V kernel developer Deepak Gupta described the CFI protections for that architecture and invited community input on the kernel support for them.

RISC-V is an instruction set architecture (ISA) that follows the philosophy of open-source hardware. Its attractiveness lies in its lack of ISA licensing fees and that it is extensible, allowing designers to integrate custom instructions and features tailored to their specific needs. Notably, startups have chosen to adopt RISC-V and the European Union has made substantial investment toward developing its own chips based on the architecture.

Gupta's talk was entitled "Control Flow Integrity on RISC-V Linux"; it is available as a YouTube video for those interested. He began by explaining that CFI techniques address vulnerabilities originating from memory-safety issues in C and C++ code bases; these vulnerabilities can be exploited using the usual suspects (memory attacks, such as stack buffer overflow) to divert the proper execution flow and redirect it somewhere else within the address space.

[Forward vs. backward]

An important thing to remember regarding control flow is the distinction between forward and backward directions of the function-call process. Forward control-flow integrity protects the flow into functions (the green arrow in the diagram). Indirect function calls, such as by calling a function via a pointer, present opportunities for attackers to alter the pointer value to jump to an address of their choosing, thus they need CFI protection. Backward control-flow integrity is concerned with returns from functions (the red arrow in the diagram); the return addresses stored on the stack or in a register can also be altered by attackers. Gupta discussed the protection options for both directions for the RISC-V architecture.

Forward control flow

Previous efforts to protect forward control flow mainly involved keeping track of indirect calls and jumps using the CPU and/or system-software extensions. A few years ago, a popular approach involved validating indirect function pointers in some way to ensure that the flow remained as expected. This was done by checking the function prototype, including the type of the return value and all of the argument types. If the forward flow was hijacked, the only place an attacker could jump to was the start of a function in the kernel that matched the required prototype.

However, this approach had limited success and coarse granularity, since it left many kernel functions that met the requirements as valid options. A more direct approach is modifying the functions directly. In the case of the x86 architecture, this is done through indirect branch tracking (IBT), which introduced new instructions, endbr64 and endbr32, where indirect branches must land. With IBT, the CPU implements a state machine to track indirect jump and call instructions, marking valid target addresses. If a violation is detected, it triggers a Control Protection exception (#CP).

[Indirect branch]

Gupta introduced the Zicfilp CPU extensions, designed to protect RISC-V systems against forward-control-flow threats. He had sent patches to support this feature earlier in the year. Zicfilp makes sure that all indirect branches (shown in the diagram above) must land on the new instruction lpad, which essentially acts as a landing pad. lpad includes a label that must match a corresponding value stored into register t2 by the caller. Enforcing this matching label ensures control-flow integrity: if the CPU fails to find a match, it raises a new software-check exception, indicating a detected error. For example, consider the following snippet from Gupta's talk:

    lui t2, 0x1       # Set up label value in t2
    jalr a5
    (...)
    foo_lpad_loc:
        lpad <label>  # Landing pad with label 0x1

In this code, the lui instruction loads a specific value (in this case, 0x1) into register t2, setting up the label value required to validate the integrity of the flow. The subsequent jalr instruction jumps to the target location indirectly, meaning that it obtains the target address at run time from a register (a5). The execution flow then reaches the designated landing pad at function foo_lpad_loc, where the lpad instruction with the embedded label (0x1) resides and serves as a reference point to ensure proper flow. In this example, the combination of the jalr instruction's indirect jump and the lpad instruction's label validation mechanism ensures the integrity of the forward flow.

Backward control flow

In the case of backward control flow, the most popular attacks to CFI typically involve return-oriented programming (ROP). With ROP attacks, vulnerabilities in backward indirect control transfers are exploited. These transfers are vulnerable because the function return addresses are determined dynamically at run time, allowing execution to be redirected to sequences of existing code fragments.

In RISC-V, as in arm64, the function return address is stored in a register (in x1 based on the RISC-V calling convention), which provides protection to that address from stack-based attacks. However, subsequent calls require pushing the return addresses onto the stack. Such addresses need to be protected with something better than regular store operations, either via software with the compiler adding additional checks (for example, LLVM's CFI protection and SafeStack insert extra security checks into programs) or hardware-based solutions. For this, x86 Control-flow Enforcement Technology (CET) relies on a shadow stack and arm64, similarly, uses its guarded control stack (GCS).

As with the case of forward control flow, a #CP exception may be issued in scenarios where shadow stacks are enabled. The return addresses get pushed onto both the regular and the shadow stack on a function call. Then, when returning from the function call, both stacks are popped. If the addresses retrieved from the two stacks don't match, indicating potential corruption, a #CP exception is raised.

This is the same approach followed by Zicfiss, which is used on RISC-V to protect backward control flow, Gupta said. With Zicfiss, the ISA is extended to include a shadow stack and instructions to safely manipulate the stack (sspush for pushing and sspopchk for popping and checking). The sspush instruction executes a store operation similar to existing store instructions, with the distinction that the base is implicitly the new shadow-stack pointer (ssp). There is also a new instruction, ssrdp, to read the ssp; its width is determined by the word length of the architecture. The virtual address stored in ssp needs a shadow-stack attribute (for more information on this, refer to RISC-V's Shadow Stack Memory Protection documentation). The shadow stack can be located anywhere in the address space and can even be discontiguous (this reduces the memory footprint when multiple threads share an address space). To ensure security, the CPU is informed that a region is a shadow stack through a new page-table bit added by the ISA extension, which ensures that the memory is writable only with the new specialized instructions.

Additional extensions

RISC-V flow control extensions also provide the new instruction ssamoswap for atomically swapping a value on the stack, designed for asynchronous operations within the kernel — similar to the RISC-V amoswap atomic-swap instruction but tailored to the shadow stack. Gupta mentioned that it is needed for signal handling in Linux, where the execution flow can be hijacked, and the stack needs to be switched.

Concurrent operation of multiple threads on the same shadow stack can be problematic and a source of security concerns; for example, an adversarial thread can reuse return addresses placed by the primary thread or inject return addresses into the shadow stack. To switch securely when the shadow stack is enabled, certain tokens are used to verify stack switching at run time: ssamoswap makes this possible by atomically storing these tokens, thereby mitigating any concurrency issues during validation (and raising software check exceptions if needed). The ssamoswap instruction is the RISC-V counterpart to x86's rstorssp and saveprevssp, as well as Arm v9's gcsss1 and gcsss2, providing analogous functionality for asynchronous shadow-stack operations.

Gupta recently submitted patches to test Zicfiss with straightforward kselftests. In these tests, a child process attempts to write to the memory location of a shadow stack (which should result in the process being killed with a SIGSEGV). The parent process then checks if the memory contents have been changed.

The patches for improving RISC-V kernel defenses against control-flow attacks are still in the works. It's exciting to see the community getting behind the effort to make open-source hardware more secure. Moreover, it's worth noting that the kernel work led by Gupta and others shares similarities and builds upon previous methods for control-flow integrity — a characteristic generally desirable in open-source development.

Comments (none posted)

How kernel CVE numbers are assigned

June 19, 2024

This article was contributed by Lee Jones

It has been four months since Greg Kroah-Hartman and MITRE announced that the Linux kernel project had become its own CVE Numbering Authority (CNA). Since then, the Linux CNA Team has developed workflows and mechanisms to help manage the various tasks associated with this challenge. There does however, appear to be a lack of understanding among community members of the processes and rules the team have been working within. The principal aim of this article, written by a member of the Linux kernel CNA team, is to clarify how the team works and how kernel CVE numbers are assigned.

Some early CVE announcements raised questions both on the mailing lists and off. The Linux CNA Team has received messages of firm support, particularly from those dedicating significant time to Linux security. Other messages, largely received from distributors and teams that look after enterprise platforms and attempt to remain stable yet secure by taking the fewest changes possible, have reflected some concern. Some of the stronger points raised were about how the rise in the number of CVEs would increase workload and overwhelm security teams attempting to review them all. Others have suggested that consumers of CVEs at the distribution and enterprise level, particularly those charging for this service, should have been reviewing all stable commits for fixes to relevant security flaws all along. One independent, security-related maintainer was particularly taken aback that paid-for distributions were not reviewing additional stable fixes beyond those identified as CVE candidates as they should have been.

Whichever side of the fence contributors sit on, one thing is almost universally agreed upon: for a plethora of reasons, the old CVE process wasn't working well. LWN listed many of the major points in this recent article. An additional point that deserves attention is that many downstream maintainers (myself included to a point, although Android did have the additional safety net of regular merges from the long-term-support stable kernels) were content with the strategy of cherry-picking all relevant CVEs raised and calling it good in terms of ongoing security updates. This practice, of course, would lead to a false sense of security, since it misses hundreds of security-related fixes and ultimately results in less-secure kernels.

The new process is more exhaustive and aims to identify every commit that fixes a potential security issue. Some people have mentioned that they consider this strategy to be a little overzealous, however since we started this endeavor back in February, it has only resulted in 863 allocations out of the 16,514 commits between v6.7.1 and v6.8.9. That's a mere 5% hit rate.

Negative opinions have been exacerbated by historical thoughts shared by Kroah-Hartman and others, and by a misunderstanding of the current literature. In an article about a 2019 Kernel Recipes talk, Kroah-Hartman is paraphrased as saying: "The next, option, 'burn them down', could be brought about by requesting a CVE number for every patch applied to the kernel." In truth, the plan was never to create a flood of CVE numbers to overwhelm the current system such that it would eventually be abandoned, nor has that come close to becoming a reality. The Linux CNA Team is careful to keep CVE assignments to a minimum, only assigning numbers to flaws that pose a possible risk to security.

Unfortunately, some of the phrases in the documentation haven't helped much to quell these fears. For instance, this section is often quoted:

Because of this, the CVE assignment team is overly cautious and assign CVE numbers to any bugfix that they identify. This explains the seemingly large number of CVEs that are issued by the Linux kernel team.

This section is often misunderstood or taken too literally. The part concerning assigning CVE numbers to "any bugfix", should be expanded to say "any bugfix relevant to the kernel's security posture". For instance, a fix repairing a broken LED driver would never be sensibly considered for assignment. That said, prescribing the exact types of issues that are considered is a slippery slope. A recent attempt at doing so was submitted to security@kernel.org for pre-review and was promptly rejected by the group. However, a non-exhaustive list of considerations I look for are bugs like buffer overflows, data corruption, crashes (BUGs, oopses and panics, including those affected by panic_on_warn), use-after-frees (UAF), lock-ups, double frees, denial of service (DoS), data leaks, etc.

One question that crops up from time to time can be summarized as: "why are we so overzealous and why can't we only create CVEs for severe security fixes?"; the answer is that quality assessment is an impossible task. Since the kernel is infinitely configurable and adaptable, it's not possible to know all the ways in which it could be deployed and utilized. Evaluating potential vulnerabilities and associating generic levels of bug reachability, severity, and impact is infeasible. An unreachable vulnerability on one platform may be trivial to exploit on another. Even if a particular issue could be proven to be universally low-risk, it might still be used as a stepping stone in a more involved, chained attack. For all these reasons and more, we find the most sensible approach is to assume that "security bugs are just bugs" and assign any issue with possible security-related ramifications.

The Linux CNA Team does not take the process of allocating CVE numbers lightly and the process is not automated. Over the first full release (6.7.y), the process consisted of all three members of the team (Greg Kroah-Hartman, Sasha Levin, and myself) manually reviewing every single patch hitting stable and voting on each. If a commit obtained three positive votes, it was allocated a CVE number. Commits with two votes were subjected to a second review, followed by discussion.

The team members review candidates in various ways. One utilizes Mutt in exactly the same way as they would review mainline patch submissions, another is in the process of training a machine-learning (ML) model to identify hard-to-spot issues, and I prefer to use Git output piped through a helper tool that highlights telltale words and phrases for easy "yes" votes. "No" votes take more time to review. The current thinking is that, by using different tools and methods, our positive results would be more robust; at least that's the theory.

Once CVEs are created, they are submitted to the linux-cve-announce mailing list, where interested parties are able to review them at their leisure. The engineers at SUSE deserve a lot of credit here. They have been instrumental in highlighting allocations that ended up being duplicate CVEs raised by previous CNAs in a non-searchable way, or ones that did not merit a CVE assignment. Their input helped shape the way the team now conducts analysis. Anyone is free and even encouraged to review the linux-cve-announce list and respond to CVEs they consider invalid. If the team agrees with the evaluation, the CVE assignment will be promptly rejected. Since the start of this endeavor, 65 such instances have occurred.

Hopefully this helps to clear up some of the current misconceptions in terms of the methods used to review, identify, and process CVE candidates and allays some of those fears people have been communicating to me recently. Specifically, CVE numbers are not assigned in an automated manner, and they are only assigned to bugs that might reasonably be believed to have security implications. The team remains open to constructive feedback and genuine suggestions for improvements; it is committed to its responsibilities and exercises care and due diligence during all phases of the process. If you have any questions or suggestions for us, then you can use the contact details located in the kernel's Documentation/process/cve.rst file. We'd be happy to hear from you.

Comments (71 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>


Copyright © 2024, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds