|
|
Subscribe / Log in / New account

LWN.net Weekly Edition for August 22, 2024

Welcome to the LWN.net Weekly Edition for August 22, 2024

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Python subinterpreters and free-threading

By Jake Edge
August 20, 2024

PyCon

At PyCon 2024 in Pittsburgh, Pennsylvania, Anthony Shaw looked at the various kinds of parallelism available to Python programs. There have been two major developments on the parallel-execution front over the last few years, with the effort to provide subinterpreters, each with its own global interpreter lock (GIL), along with the work to remove the GIL entirely. In the talk, he explored the two approaches to try to give attendees a sense of how to make the right choice for their applications.

Shaw began by pointing attendees to his talk notes, which has a list of prerequisites that should be met or "you need to leave immediately". That list included multiple talks to review, some of which have been covered here (Eric Snow on subinterpreters and Brandt Bucher on a Python JIT compiler), and one that was taking place at the same time as Shaw's talk "that you need to have watched", he said to laughter. He also recommended the chapter on parallelism from his CPython Internals book and his master's thesis "if you're really, really bored".

Parallel mechanisms

Parallelism comes down to the ability to segment a task into pieces that can be worked on at the same time, he said, but not all tasks can be. If it takes him an hour to solve a Rubik's Cube, he asked, how much time would it take four people to solve one Rubik's Cube? Since four (or even two) people cannot sensibly work on the same puzzle at once, adding more participants does not really help.

[Anthony Shaw]

There is another aspect, which he tried to demonstrate in a physical way. He got three volunteers to draw a picture of a dog with him, but before they could do so, he needed to provide them with the work to do (by balling up a piece of paper to throw to them). That is akin to the serialization of data that a computer might need to do (via, say, pickle) to parcel out a task to multiple cores. While he (as "core zero") is sending out that data, he is not drawing a dog, so he can only start drawing after that is complete. Meanwhile, the other "cores" have to unpack the pickle, draw the dog, repack the pickle and send it back. So, if it takes each core ten seconds to draw a dog, drawing four dogs in parallel takes a bit more than that because of the coordination required.

He said that there are four different types of parallel execution available in Python: threads, coroutines, multiprocessing, and subinterpreters. Three of them execute in parallel (or will with the GIL-removal work), which means they can run at the same time on separate cores, while coroutines execute concurrently, so they cooperatively share the same core. Each has a different amount of start-up time required, Shaw said, with coroutines having the smallest and multiprocessing the largest.

He showed a simple benchmark that created a NumPy array of 100,000 random integers from zero to 100, then looped over the array calculating the distance of each entry from 50 (i.e. abs(x-50)). Shaw said that "if I split that in half and executed between two threads in Python 3.12, it would take twice as long". That is mainly because of the GIL.

In Python 3.11, there was one GIL per Python process, which is why multiprocessing is used for "fully parallel code execution". In Python 3.12, subinterpreters got their own GIL, so there was a GIL per interpreter. That means that two things can be running at once in a single Python process by having multiple interpreters each running in its own thread. The operating system will schedule the execution on different cores to provide the parallelism between the interpreters.

He presented some numbers to show, generally, that the workload size has a major effect on the speed of operations. When using threads (with the GIL), calculating pi to 200 digits takes 0.04s, which is roughly the same as with subinterpreters (0.05s). It turns out that calculating pi to that precision is quite fast, so what is being measured is the overhead in starting up threads versus subinterpreters. When 2000 digits are calculated, the thread (plus GIL) test goes to 2.37s, while subinterpreters take only 0.63s. Meanwhile, multiprocessing has much higher overhead (0.3s for 200 digits) but is closer to subinterpreters for 2000 (0.89s). The quick conclusion that can be drawn is that "subinterpreters are really like multiprocessing with less overhead and shared memory", he said.

Moving on to the free-threading interpreter (which is the preferred term for the no-GIL version), he noted that Python 3.13 beta 1, which had been released in early May, a week before the talk, allows the GIL to be disabled. His earlier example of splitting a task up into multiple threads—which ran slower because of the GIL—will now run faster. Meanwhile, Python 3.13 has a bunch of optimizations to make single-threaded code run faster.

The GIL exists to provide thread-safety, which is important to keep in mind. The free-threading version introduces a new ABI for Python, so you need a special build of the CPython interpreter in order to use it; in addition, extension modules that use the C API will need to be built for the free-threaded CPython. Those are important things to consider because they affect compatibility with the existing Python ecosystem.

The steps needed to remove the GIL "are really quite straightforward and I'm not sure why it's taken so long to do so", he said wryly—and to laughter. Step one is to remove the GIL; step two is to change the memory allocator and the garbage collector, which is still a work-in-progress. The final step is to fix the C code that assumes that the GIL exists, which means that it is essentially not thread-safe. That final step "is probably going to drag on for a while".

The compatibility with the C-based extensions is the problem area, both for free-threading and subinterpreters, but also for any other attempt at a fully parallel Python over the years. From what Shaw has seen, pure Python code works fine with either mechanism, "because the core Python team have got sort of full control over what the interpreter is doing". The irony is that one of the main reasons for the existence of all of the C extensions is to optimize Python programs because they could not be run in parallel.

Faster?

He reused the benchmark of pi to 2000 digits on the newly released interpreter, both with and without the GIL. As would be expected, removing the GIL reduced the threaded numbers; on a four-core system, they went from 2.41s with the GIL to 0.75s without. The subinterpreters and multiprocessing versions were slower, however. With the GIL, subinterpreters took 0.76s (and multiprocessing took 1.2s); without it was 0.99s for subinterpreters and 1.57s for multiprocessing. The difference is that regular Python code runs slower without the GIL, Shaw said; there are a number of specializations and optimizations that are available when the GIL is enabled, but are not yet thread-safe, so they are disabled when the GIL is turned off. He expects that the performance for the threaded and subinterpreters versions to "actually be really close in the future".

"Hands up if your production workload includes calculating the number of digits in pi", he said with a chuckle. Benchmarks like the one he showed are not representative of real Python code because they are CPU bound. The performance of web and data-science applications has a major I/O component, so the improvements are not as clear-cut as they are in calculating pi. He thinks a "more realistic example is to look at the way that web applications are architected in Python".

For example, web applications that use Gunicorn typically run in a multi-worker, multithreaded mode, which is managed by Gunicorn. There are multiple Python processes, each of which has its own GIL, of course; typically there is one process per core on the system. Meanwhile, there are generally at least two threads per process, which actually helps because the GIL is dropped for many I/O operations, including reading and writing from network sockets. "So it does actually make sense to have multiple threads, because they can be running at the same time, even though the GIL is there".

He wondered if it made sense to shift that model to have a single process, but with multiple interpreters, each of which had multiple threads. He showed a moderately complicated demo that did just that, though it was somewhat difficult to follow. Interested readers can watch the YouTube video of the talk to see the demo in action. Links to the code can be found in the talk notes.

The upshot of the demo is that the new features make multiple different architectures possible. One question that often comes up is whether it makes sense to switch to a single process with a single interpreter and multiple threads now that a free-threading interpreter is available. The answer is that, as with most architectural questions, it depends on the problem being solved. In addition, "code doesn't use threading out of the box", so the application code will need to be tailored to fit.

Another question that comes up often in light of these changes is which to choose, subinterpreters or free-threading. It is not one or the other, Shaw said, as the two can coexist; each has its own strengths. There are applications or parts of them that can benefit from the isolation provided by subinterpreters, while CPU-intensive workloads can be parallelized using threads.

Something to note is that asyncio does not use threads, he said. That is something that is often misunderstood, but the coroutines that are used by asyncio are not assigned to threads, so the GIL is not interfering with them. He talked with someone who tried benchmarking an asyncio application with the GIL disabled; they were surprised there was no change, but should not have been.

In conclusion, Shaw said, he thinks that "subinterpreters are a better solution for multiprocessing" using long-running workers; "I think for a web app, this is a perfect solution." Free-threading will be a better choice for CPU-heavy tasks that can be easily segmented and only need to do a "quick data exchange between them".

But he is concerned that the Python ecosystem has been implicitly relying on the GIL for thread safety. "Removing it is going to break things in unpredictable ways for some time to come." Python cannot keep putting off that change, though, so he thinks it needs to be done. His conclusions came with a lengthy list of "terms and conditions" (which can be seen in the talk notes as well) that enumerated some of the remaining challenges for both subinterpreters and free-threading.

Questions

One attendee asked about the datetime module, which Shaw had listed as not being thread-safe in the list of terms and conditions; because of that, the Django web framework cannot be used with subinterpreters. Shaw pointed the attendee at the link from the talk notes to a GitHub issue (which was worked on and fixed during the sprints the week following PyCon). In answer to another question, he said that his list would be changing and that many of the problem areas would likely be fixed before Python 3.13 is released in October.

Another attendee noted that the vast majority of C extensions simply make synchronous calls into a C library; they do not use any Python objects, reference counting, or any of the other potential problem areas. He wondered if there had been any thought about migration tools to help the authors of those types of extensions. Shaw said that there has been discussion of that and it is an area that needs work, though people are aware of it; "the GIL only got disabled like last week". He suggested the sprints would be a good place to work on that and that the CPython team would be appreciative; that was met with somewhat scattered applause, likely from members of said team.

Will there always be two Python versions, with and without the GIL, an attendee asked. Shaw suggested that the person standing behind them in line at the microphone would be better able to address that. Guido van Rossum, creator of Python who now works on the Faster CPython project, said that he could not speak for the steering council but that his understanding is that free-threading was accepted experimentally; if it gets fully accepted, "eventually that's all you get". In his question, Van Rossum wondered if the optimal model for using asyncio and subinterpreters is to have a single thread running the asyncio event loop in each subinterpreter and to have one subinterpreter per core; Shaw agreed and noted that it was part of what he was showing in his demo.

The last question that was asked had to do with data transfer between subinterpreters; what are the plans there? Shaw said that "the goal is to provide interfaces to share information between subinterpreters without pickling", which is "extremely slow". You can share immutable types right now, as well as sharing memory using Memory Views. If you need continuous data exchange, though, threading pools may be a better approach.

[I would like to thank the Linux Foundation, LWN's travel sponsor, for travel assistance to Pittsburgh for PyCon.]

Comments (8 posted)

Custom string formatters in Python

By Daroc Alden
August 16, 2024

Python has had formatted string literals (f-strings), a syntactic shorthand for building strings, since 2015. Recently, Jim Baker, Guido van Rossum, and Paul Everitt have proposed PEP 750 ("Tag Strings For Writing Domain-Specific Languages") which would generalize and expand that mechanism to provide Python library writers with additional flexibility. Reactions to the proposed change were somewhat positive, although there was a good deal of discussion of (and opposition to) the PEP's inclusion of lazy evaluation of template parameters.

The proposal

In Python (since version 3.6), programmers can write f-strings to easily interpolate values into strings:

    name = "world"
    print(f"Hello, {name}") # Prints "Hello, world"

This is an improvement on the previous methods for string interpolation, because it makes it clear exactly what is being inserted into the string in each location. F-strings do still have some drawbacks, though. In particular, since the expressions inside braces are evaluated when the string is evaluated, they're not suitable for more complex templating. They also make it easier to write some kinds of security bugs — it's tempting to use them to make SQL queries, even though doing so can make code more susceptible to SQL-injection attacks. The PEP aims to fix both of these, by allowing people to use arbitrary functions as string "tags", taking the place of the "f" in f-strings. For example, it would be possible to write a safe sql() function that could be invoked like this:

    name = "O'Henry"
    # Calls sql().
    # The function inserts 'name' the query properly escaped.
    query = sql"SELECT * FROM Users WHERE name = {name}"
    print(query)
    # Prints "SELECT * FROM Users WHERE name = 'O''Henry'".

Other examples of potential uses include automatically escaping strings in the correct way for other formats (shell commands, URLs, regular expressions, etc.), building custom DSLs that can be embedded in Python programs, or partially replacing the use of templating libraries like Jinja.

The proposed syntax works by calling any function used as a string tag with a sequence of arguments representing fragments of the string and interpolation sites. These are values that implement new Decoded and Interpolation protocols for string components and interpolations, respectively. In the example above, sql() would be called with two arguments: one for the first part of the string, and then a second for the interpolation itself. The function is then free to interpret these values in whatever way it likes, and return an arbitrary object (such as, for example, a compiled regular expression). In particular, the expressions inside braces in the affected string aren't evaluated ahead of time, but are instead evaluated when the function calls the .getvalue() method of an Interpolation object — which the function only needs to call if it needs the value. Interpolation objects also include the original expression as a string, and the optional conversion function or format specification if they were provided.

    def example(*args):
        # This example assumes it will receive a tag string
        # that starts with a string and contains exactly one
        # interpolation
        string_part = args[0] # "Value: "
        interpolation = args[1] # 2 + 3

        # It can reference the original text, and ask for the value
        # to be computed.
        return f"{string_part}{interpolation.expr} = {interpolation.getvalue()}"

    print(example"Value: {2 + 3}")
    # Prints "Value: 2 + 3 = 5"

This does, however, lead to some surprising outcomes, such as making it possible to write code that depends on the assignment of a variable after the place where the tag string is defined. This example shows how that could work, as well as demonstrating the PEP's recommended method to deal with a sequence of Decoded and Interpolation values with a match statement:

    class Delayed:
        def __init__(self, *args):
            self.args = args
        def __str__(self):
            result = ""
            for a in self.args:
                match a:
                    case Decoded() as decoded:
                        result += decoded
                    case Interpolation() as interpolation:
                        result += str(interpolation.getvalue())
            return result

    name = 'Alice'
    fstring = f'My name is {name}' # Always Alice
    delayed = Delayed'My name is {name}'
    for name in ['Bob', 'Charlie']:
        print(delayed)
        # Prints 'My name is Bob' and 'My name is Charlie'

The PEP describes this behavior as lazy evaluation — although unlike true lazy evaluation in languages like Haskell, the library author does need to explicitly ask for the value to be calculated. Despite the potentially unintuitive consequences, lazy evaluation is a feature that the PEP's authors definitely want to see included in the final version, because of the additional flexibility it allows. Specifically, since a tag function could call .getvalue() zero times, one time, or multiple times, a library author could come up with clever uses that aren't possible with an f-string.

Discussing naming

The discussion of the change started with Steve Dower expressing concern that this might cause people to fill up the namespace with short function names, which would be a problem for readability. Dower suggested that perhaps the syntax should be changed to have a single generic string tag that converts a format string into a piece of structured data that could be operated on by a normal function. "It's not quite as syntactic sugar, but it's more extensible and safer to use."

Everitt clarified that the PEP does not propose adding any tag functions to the built-ins or standard library, but acknowledged the point. Baker pointed out that there was no syntactical reason that tag names could not contain dots, which would help with the namespace-pollution problem, though that might be undesirable for other reasons.

If dotted names are going to be allowed, "we should just allow any expression", Brandt Bucher said, noting that the same evolution has already occurred for decorators. Unfortunately, that's not possible, Pablo Galindo Salgado explained, because the Python lexer would have no way to know whether it should switch into the f-string parsing mode (which would be used for all tag strings) without information from the parser.

The current syntax does invite some appealing syntactic sugar, however. Paul Moore gave this example, of a "decimal literal" (for the arbitrary-precision arithmetic decimal library), a frequently requested feature:

    from decimal import Decimal as D
    dec_num = D"2.71828"

Josh Cannon pointed out that the syntax means Python will never get another string prefix, implying that it would make any such changes backward-incompatible. Matt Wozniski agreed, and stated the drawback explicitly. Wozniski later said that Python's existing string prefixes cannot all be implemented in terms of the proposed tag strings.

Discussing semantics

Eric Traut was the first to raise concerns with the lazy evaluation of expressions being interpolated, pointing out that it could potentially lead to surprising and unintuitive consequences. For example, the Delayed class shown above. Traut suggested asking programmers to explicitly write lambda: when they want lazy evaluation:

    tag"Evaluated immediately: {2 + 3}; Not evaluated immediately: {lambda: name}"

Charlie Marsh agreed with Traut, that introducing lazy evaluation would be unintuitive, given that it's a change from how f-strings work. Several people objected to the idea of requiring users to opt into lazy evaluation, though, especially since the suggestion of just using lambda expressions would be fairly verbose. Cornelius Krupp pointed out that the intended use case was to define domain-specific languages — so the expectation is that specific uses will vary, and it's up to callers of a particular tag function to read its documentation, just like with any library function. Moore went so far as to say:

I'll be honest, I haven't really gone through the proposal trying to come up with real world use cases, and as a result I don't yet have a feel for how much value lazy evaluation adds, but I am convinced that no-one will ever use lambda as "explicit lazy evaluation".

Another user asked how lazy evaluation interacts with other Python features such as await and yield — which surprisingly worked inside f-strings without a problem when I tried them. Nobody had a compelling answer, although Dower pointed out another particularly pernicious problem for lazy evaluation: how should it interact with context managers (such as locks)? Allowing unmarked lazy evaluation could make for difficult-to-debug problems. Finally, interactions with type checking are a concern, since lazy evaluation interacts with scoping.

David Lord, a Jinja maintainer, questioned the entire justification for lazy evaluation. Several times during the discussion, people asserted that lazy evaluation was necessary for implementing template languages like Jinja, but Lord didn't think tag strings were actually sufficient for templating. Jinja templates are often rendered multiple times with different values, which lazy evaluation doesn't make easy — the variables at the definition site of the template would need to be changed. Plus, Jinja already has to parse the limited subset of Python that can be used in its templates; if it has to keep parsing the templates in order to handle variable substitutions, lazy evaluation doesn't help with that. Moore agreed that the proposal would not be a suitable replacement for Jinja as it stands, and that he wished "the PEP had some better examples of actual use cases, though."

Everitt shared that the PEP authors chose to keep the more complex examples out of the PEP for simplicity, but "I think we chose wrong." He provided a link to a lengthy tutorial on building an HTML template system using the proposed functionality. Baker added that being able to start sending the start of a template before fully evaluating its latter sections was an important use case for web-server gateway interface (WSGI) or asynchronous-server gateway interface (ASGI) applications. He also said that delayed evaluation let logging libraries easily do the right thing by only taking the time to evaluate the interpolations if the message would actually be logged.

Discussing backticks

Trey Hunner saw how useful the change could be, but, as someone who teaches Python to new users, he was worried that the syntax would be hard to explain or search for.

If backticks or another symbol were used instead of quotes, I could imagine beginners searching for 'Python backtick syntax'. But the current syntax doesn't have a simple answer to 'what should I type into Google/DDG/etc. to look this up'.

The suggestion to use backticks was interesting, Everitt said, noting that JavaScript uses backticks for similar features. Baker said that they had considered backticks, but thought that matching the existing f-string syntax would be more straightforward. Simon Saint-André also said that backticks can be hard to type on non-English keyboards. Dower noted that backticks have traditionally been banned from Python feature proposals due to the difficulty in distinguishing ` from '.

Discussing history

Alyssa Coghlan shared a detailed comparison between this proposal and her work on PEP 501 ("General purpose template string literals"). In short, Coghlan is not convinced that tag strings offer significantly more flexibility than template string literals as proposed in PEP 501, but she does acknowledge that the PEP 750 variety has a more lightweight syntax. Other participants in the discussion were not convinced that simpler syntax was worth the tradeoff. Coghlan has started collecting ideas from the discussion that might apply equally well to PEP 501.

Barry Warsaw, who worked on early Python internationalization efforts, likewise compared the new PEP to existing practice. He pointed out a few ways that tag strings were unsuitable for internationalization, before saying:

I have to also admit to cringing at the syntax, and worry deeply about the cognitive load this will add to readability . I think it will also be difficult to reason about such code. I don't think the PEP addresses this sufficiently in the "How to Teach This" section. And I think history bears out my suspicion that you won't be able to ignore this new syntax until you need it, because if it's possible to do, a lot of code will find clever new ways to use it and the syntax will creep into a lot of code we all end up having to read.

Baker summarized the most important feedback into a set of possible changes to the proposal, including adopting some ideas from PEP 501, and using decorators to make it easier to write tag functions that typecheckers can understand. Coghlan approved of the changes, although at time of writing other commenters have not weighed in. It remains to be seen how the PEP will evolve from here.

Comments (42 posted)

FreeBSD considers Rust in the base system

By Joe Brockmeier
August 19, 2024

The FreeBSD Project is, for the second time this year, engaging in a long-running discussion about the possibility of including Rust in its base system. The sequel to the first discussion included some work by Alan Somers to show what it might look like to use Rust code in the base tree. Support for Rust code does not appear much closer to being included in FreeBSD's base system, but the conversation has been enlightening.

Base versus ports

Unlike Linux, the FreeBSD operating system kernel and user space are developed together as the base system, maintained in the FreeBSD source tree (often referred to as "src"). This means, for the purposes of discussing using Rust as a language for the FreeBSD kernel or other programs/utilities in the base system, the Rust toolchain would need to be present in base as well. Currently, the languages supported for FreeBSD's base system are assembly, C, C++, Lua, and shell scripts written for sh. In the distant past, Perl was also part of the base system, but was removed in 2002 prior to FreeBSD 5.0.

FreeBSD also has a ports collection for third-party software that is not maintained as part of FreeBSD itself. This includes everything from the Apache HTTP Server to Xwayland. Rust is already present in the ports system, as are many applications written in the Rust language. A search on FreshPorts, which lists new packages in the ports collection, turns up more than 500 packages in the ports system that are written in Rust.

A dream of Rust

But Rust is not allowed in the base system. Somers noted this fact with disappointment in a discussion about a commit to fusefs tests in January. Enji Cooper asked Somers why he had not used smart pointers in the change, to which Somers said "it's because I'm not real great with C++". He added that he had stopped trying to improve his C++ skills in 2016, and had focused on Rust instead:

Even when I wrote these tests in 2019, I strongly considered using Rust instead of C++. In the end, the only thing that forced me to use C++ is because I wanted them to live in the base system, rather than in ports.

Somers said that he dreamed about the day when Rust is allowed in the base system, and mentioned several projects he would have done differently if it were allowed. Warner Losh replied that it would require some visible success stories for the FreeBSD community to consider inclusion in base. Rust, he said, "has a lot of logistical issues since it isn't quite supported in llvm out of the box". (By default, Rust uses its own fork of LLVM.) He suggested adding test cases in base that could be run by installing Rust from ports prior to building "to show it's possible and to raise awareness of rust's viability and to shake out the inevitable growing pains that this will necessarily [cause]".

Brooks Davis also suggested bringing in Rust code that would use an external toolchain rather than having the toolchain in base. "There are a bunch of bootstrapping and reproducibility issues to consider, but I think the fixes mostly lie outside the src tree."

The case for (and against) Rust

On January 20, Somers made his case to the freebsd-hackers mailing list on the costs and benefits of including Rust code in FreeBSD's base system, which opened the first lengthy discussion around Rust. He summarized the cost as "it would double our build times", but the benefit would be that "some tools would become easier to write, or even become possible". Losh reiterated his suggestion to start with adding better tests in Rust. That would allow the project to get an external toolchain working to learn if Rust "actually fits in and if we can keep up the infrastructure" or not.

Dimitry Andric wrote that it might be possible to build Rust using FreeBSD's base version of LLVM, but he said that the discussion was going in the wrong direction. The upstream build systems for LLVM and Rust require too many dependencies. Trying to support such tools "is getting more and more complicated all the time". He wanted to know why the project was spending time trying to build toolchain components in the base system at all. He argued, instead, that the focus should be on removing toolchains from the base system. "Let new toolchain components for base live in ports, please." In other words, software written in Rust could live in base, but its toolchain would stay in ports. That sentiment was shared by a number of other participants in the discussion.

Alexander Leidinger asked what kind of impact using Rust would have on backward compatibility. Would it be possible to compile FreeBSD release x.0 in two years on version x.2, for example, as it is with C/C++ code in base? Somers said the short answer is yes. The longer answer, he said, was that the Rust language has editions, similar to C++ editions, that are released every three years. Compiler releases are capable of building the latest edition, and all previous editions:

So if we were to bring Rust code into the base system, we would probably want to settle on a single edition per stable branch. Then we would be able to compile it forever.

Some participants were not convinced that code written in Rust should be allowed in base, even if its toolchain lived outside base. Cy Schubert complained that the language was still evolving quickly. Poul-Henning Kamp said that the pro-Rust argument simply boiled down to "'all the cool kids do it'". He argued that FreeBSD developers should "quietly and gradually look more and more to C++ for our 'advanced needs'":

I also propose, that next time somebody advocates for importing some "all the cool kids are doing it language" or other, we refuse to even look at their proposal, until they have proven their skill in, and dedication to, the language, by faithfully reimplementing cvsup in it, and documented how and why it is a better language for that, than Modula-3 was.

Bertrand Petit agreed with Kamp, and said that adding Rust to base should be avoided at all costs. However, he suggested that if using Rust needs something in base "such as kernel interfaces, linking facilities, etc." to work properly in the ports system, it should be provided in base.

After the discussion had simmered a bit, Somers replied with answers to some of the questions about Rust in base. He said that the comparisons of Rust to Perl were missing the mark. The crucial difference is that Rust is suitable for systems programming, while the others were not. "Rust isn't quite as low-level as C, but it's in about the same position as C++." To Kamp's assertion that developers should just use C++, Somers said that he was far more productive in Rust and his code had fewer bugs, too. He said he had used C++ professionally for 11 years, but was more skilled in Rust after six months. The problem, he said, was C++. "In general, it feels like C++ has a cumbersome mix of low-level and high-level features."

Out of the blue

Ultimately, the discussion trailed off in early February without any concrete plan of adopting Rust. The thread was re-awakened by Shawn Webb on July 31. Webb replied to Somers' original email with the news that the Defense Advanced Research Projects Agency (DARPA) is investigating a program to automate rewriting C code to Rust, called Translating All C to Rust (TRACTOR).

Losh said that he was still waiting for someone to take him up on the suggestion to do build-system integration for Rust tests. "Since the Rust advocates can't get even this basic step done for review, it's going to be impossible to have Rust in the base." Webb replied that he would be willing to find time in September to work on build system integration, if Losh was willing to mentor him, which Losh agreed to do.

Konstantin Belousov replied that it would be better to focus on what useful things could be implemented in Rust, rather than how to integrate code into the build. That caused Kamp to interject with a history lesson about the failures of importing Perl into FreeBSD's base system.

The choice to bring Perl into the base system, he said, was based on arguments identical to those being made for Rust. The project overlooked the fact that Perl was more than a programming language, it was an ecosystem with a "rapidly exploding number of Perl Modules". The goals of rewriting things in Perl went unrealized, he said, once developers realized that FreeBSD base only offered Perl the language and not Perl the ecosystem:

Having Perl in the tree was a net loss, and a big loss, because it created a version gap between "real Perl" and "freebsd Perl", a gap which kept growing larger over time as [enthusiasm] for Perl in the tree evaporated.

Adding Rust to FreeBSD will be the /exact/ same thing!

That left two options, Kamp said. The first is a FreeBSD Rust only intended for use in base without benefit of the Rust ecosystem, the second would be to find a way to allow FreeBSD Rust to "play nice with both Rust in ports and the Rust ecosystem outside ports". He was pessimistic about the second option being possible at all, and even if it was "it is almost guaranteed to be a waste of our time and energy" and would revert to the first option in a few years.

A third option, he said, would be to work on a distribution of FreeBSD based on packages instead of the base/ports system it has now. That could allow FreeBSD to have the benefit of the Rust ecosystem, or Python, or C++, etc.

A demo

On August 4, Somers posted a link to a repository forked from the FreeBSD src tree with examples of new programs written from scratch, old programs rewritten in Rust with new features, and libraries written in Rust. Somers also noted several features that his demo did not include, such as kernel modules ("those are too hard"), integrating the Rust build system with Make, or cdylib libraries (those are Rust libraries intended to be linked into C/C++ programs). He invited anyone with questions about what it would look like to include Rust in base to examine his demo branch.

Kamp replied that Somers's demo was awesome, but asked if it was worth the effort when compared to the idea of a package-based version of FreeBSD where Rust things could be built without all the extra effort. That sparked a back-and-forth about the difficulties of maintaining tests separately from fast-moving features in the kernel. Ultimately, Kamp said that he understood the problem: "I've been there myself with code I have maintained for customers." The problem of maintaining Rust code separately from kernel code only impacts a few dozen developers. "Adding Rust to src would inconvenience /everybody/, every time they do a 'make buildworld'." The solution, he said, is not to add Rust to src, but to "get rid of the 'Src is the holy ivory tower, everything else is barbarians' mentality" that has caused FreeBSD trouble over the years.

Into the black

Once again, the discussion trailed off without any firm resolution. No doubt the topic will come up again, perhaps later this year if Webb and Losh dig into Rust build-system integration. Rust may yet find its way into FreeBSD's base system, unless Kamp's vision of a package-based FreeBSD comes to pass and makes the distinction irrelevant.

Comments (130 posted)

Memory-management: tiered memory, huge pages, and EROFS

By Jonathan Corbet
August 15, 2024
The kernel's memory-management developers have been busy in recent times; it can be hard to keep up with all that has been happening in this core area. In an attempt to catch up, here is a look at recent work affecting tiered-memory systems, underutilized huge pages, and duplicated file data in the Enhanced Read-Only Filesystem (EROFS).

Promotion and demotion

Tiered-memory systems are built with multiple types of memory that have different performance characteristics; the upper tiers are usually faster, while lower tiers are slower but more voluminous. To make the best use of these memory tiers, the system must be able to optimally place each page. Heavily used pages should normally go into the fast tiers, while memory that is only occasionally used is better placed in the slower tiers. Since usage patterns change over time, the optimal placement of memory will also change; the system must be able to move pages between tiers based on current usage. Promoting and demoting pages in this way is one of the biggest challenges in tiered-memory support.

Promotion is usually the easier side of the problem; it is not hard for the system to detect when memory is being accessed and move it to a faster tier. In current kernels, though, this migration only works for memory that has been mapped into a process's address space; the machinery requires that memory be referred to by a virtual memory area (VMA) to function. As a result, heavily used memory that is not currently mapped will not be promoted.

This situation comes about for page-cache pages that are being accessed by way of system calls (such as read() and write()), but which are not mapped into any address space. Memory-access speed can be just as important for such pages, though, so this inability to promote them can hurt performance.

This patch series from Gregory Price is an attempt to address that problem. The migration code in current kernels (migrate_misplaced_folio_prepare() in particular) needs to consult the VMA that maps a given folio (set of pages) prior to migration; if that folio is both shared and mapped with execute permission, then the migration will not happen. Pages that are not mapped at all, though, cannot meet that condition, so the absence of a VMA just means that this check need not be performed. With that change (and a couple of other adjustments) in place, it is simply a matter of adding an appropriate call in the swap code to migrate folios from a lower to a higher tier when they are referenced.

A kernel that is trying to appropriately place memory will always be running a bit behind the game; it cannot detect a changed access pattern without first watching the new pattern play out. Sometimes, though, an application will know that it will be shifting its attention from one range of memory to another. Informing the kernel of that fact might help the system ensure that memory is in the best location before it is needed; at least, that is the intent behind this patch from "BiscuitOS Broiler".

Quite simply, this patch adds two new operations to the madvise() system call. They are called MADV_DEMOTE and MADV_PROMOTE; they do exactly what one would expect. An application can use these operations to explicitly request the movement of memory between tiers in cases where it knows that the access pattern is about to change.

There is nothing technically challenging about this work, but it is also not clear that it is necessary. The kernel already provides a system call, migrate_pages(), that can be used to move pages between tiers; David Hildenbrand asked why migrate_pages() is not sufficient in this case. The answer seems to be that madvise() is found in the C library, but the wrapper for migrate_pages() is in the extra libnuma library instead. As Hildenbrand answered, that is not a huge impediment to its use. So, while making this feature available via madvise() might be convenient for some users, that convenience seems unlikely to be enough to justify adding this new feature to the kernel.

Reclaiming underutilized huge pages

The use of huge pages can improve application performance, by reducing both the usage of the system's translation lookaside buffer (TLB) and memory-management overhead in the kernel. But huge pages can suffer from internal fragmentation; if only a small part of the memory within a huge page is actually used, the resulting waste can be significant. The corresponding increase in memory use has inhibited the adoption of huge pages in many settings that would otherwise benefit from them.

One way to get the best of both worlds might be to actively detect huge pages that are not fully used, split them apart into base pages, then reclaim the unused base pages; that is the objective of this patch series from Usama Arif. It makes two core changes to the memory-management subsystem aimed at recovering memory that is currently wasted due to internal fragmentation.

The first of those changes takes effect whenever a huge page is split apart and mapped at the base-page level, as often happens even in current kernels. As things stand now, splitting a huge page will leave the full set of base pages in its wake, meaning that the amount of memory in use does not change. But, if the huge page is an anonymous (user-space data) page, any base pages within it that have not been used will only contain zeroes. Those base pages can be replaced in the owning process's page tables with references to the shared zero page, freeing that memory. Arif's patch set makes that happen by checking the contents of base pages during the splitting process and freeing any pages found to hold only zeroes.

That will free underutilized memory when a page is being split, which is a start. It would work even better, though, if the kernel could actively find underutilized huge pages and split them when memory is tight; that is the objective of the second change in Arif's patch set.

A huge page, as represented by a folio within the kernel, can at times be partially mapped, meaning that not all of the base pages within the huge page have been mapped in the owning process's page tables. When a fully mapped folio is partially unmapped for any reason, the folio is added to the "deferred split list". If, at some later point, the kernel needs to find some free memory, it will attempt to split the folios on the deferred list, then work to reclaim the base pages within each of them.

Arif's patch set causes the kernel to add all huge pages to the deferred list whenever they are either faulted in or created from base pages by the khugepaged thread. When memory gets tight and the deferred list is processed, these huge pages (which are probably still fully mapped) will be checked for zero-filled base pages; if the number of such pages exceeds a configurable threshold, the huge page will be split and all of those zero-filled base pages will be immediately freed. If the threshold is not met, instead, the page will be considered to be fully used and removed from the deferred list.

It is worth noting that the threshold is an absolute number; for the tests mentioned in the cover letter it was set to 409, which is roughly 80% of a 512-page huge page. This mechanism means that, while this feature can split underutilized PMD-sized huge pages implemented by the processor, it will not be able to operate on smaller, multi-size huge pages implemented in software. On systems using PMD-sized huge pages, though, the results reported in the cover letter show that this change can provide the performance benefits that come from enabling transparent huge pages while clawing back most of the extra memory that would otherwise be wasted.

Page-cache deduplication for EROFS

Surprisingly often, a system's memory will contain multiple pages containing the same data. When this happens with anonymous pages, the kernel samepage merging feature can perform deduplication, recovering some memory (albeit with some security concerns). The situation with file-backed pages is harder, though. Filesystems that can cause a single file to appear with multiple names and inodes (as can happen with Btrfs snapshots or in filesystems that provide a "reflink" feature) are one case in point; if more than one name is used, multiple copies of a file's data can appear in the page cache. This can also happen in the mundane cases where files contain the same data; container images can duplicate data in this way.

The problem with deduplicating such pages is that each page in the page cache must refer back to the file from which it came; there is no concept in the kernel of a page coming from multiple files. If a page is written to, or if a file changes by some other means, the kernel has to do the right thing at all levels. So those duplicate pages remain duplicated.

Hongzhen Luo has come up with a solution for the EROFS filesystem, though — at the file level, at least. EROFS is a read-only filesystem, so the problems that come from possible changes to its files do not arise here.

An EROFS filesystem can be created with a special extended attribute, called trusted.erofs.fingerprint, attached to each file; the content of that attribute is a hash of the file's contents. When a file in the filesystem is opened for reading, the hash will be stored in an XArray-based data structure, associated with the file's inode. Anytime another file is opened, its hash is looked up in that data structure; if there is a match, the open is rerouted to the inode of the file that was opened first.

This mechanism can result in a number of processes holding file descriptors to different files on disk that all refer to a single file within the kernel. Since the files have the same contents, though, this difference is not visible to user space (though an fstat() call might return a surprising inode number). Within the kernel, redirecting file descriptors for multiple identical files to a single file means that only one copy of that file's contents needs to be stored in the page cache.

The benchmark results included with the series show a significant reduction in memory use for a number of different applications. Since this feature is contained entirely within the EROFS filesystem, it seems unlikely to run into the sorts of challenges that often await core memory-management patches. Deduplication of file-backed data in the page cache remains a hard problem in the general case, but it appears to have been at least partially solved for this one narrow case.

Comments (15 posted)

Per-call-site slab caches for heap-spraying protection

By Jonathan Corbet
August 20, 2024
One tactic often used by attackers set on compromising a system is heap spraying; in short, the attacker fills as much of the heap as possible with crafted data in the hope of getting the target system to use that data in a bad way. If heap spraying can be blocked, attackers will lose an important tool. The kernel has some heap-spraying defenses now, including the dedicated bucket allocator merged for the upcoming 6.11 release, but its author, Kees Cook, thinks that more can be done.

A heap-spraying attack can be carried out by allocating as many objects as possible and filling each with data of the attacker's choosing. If the kernel can be convinced to use that data, perhaps as the address of a function to call, then the attacker can gain control. Heap spraying is not a vulnerability itself, but it can ease the exploitation of an actual vulnerability, such as a use-after-free bug or the ability to overwrite a pointer. The kernel's kmalloc() function (along with several variants) allocates memory from the heap. Since kmalloc() is used heavily throughout the kernel, any call site that can be used for heap spraying can potentially be used to exploit a vulnerability in a distant, unrelated part of the kernel. That makes the kmalloc() heap a tempting target for attackers.

kmalloc() makes its allocations from a set of "buckets" of fixed-sized objects; most (but not all) of those sizes are powers of two. So, for example, a 48-byte allocation request will result in the allocation of a 64-byte object. The structure behind kmalloc() is, in a sense, an array of heaps, each of which is used for allocations of a given size range. This separation can make heap spraying attacks easier, since it is not necessary to overwrite the entire heap to target an object of a given size.

The dedicated bucket allocator creates a separate set of buckets for allocation sites that are deemed to present an especially high heap-spraying risk. For example, any allocation that can be instigated from user space and filled with user-supplied data would be a candidate for a dedicated set of buckets. Then, even if the attacker manages to thoroughly spray that heap, it will not affect any other allocations; the attacker's carefully selected data cannot be used to attack any other part of the kernel.

The way to get the most complete protection from heap spraying would be to create a set of dedicated buckets for every kmalloc() call site. That would be expensive, though; each set of buckets occupies a fair amount of memory. Inefficiency at that level is the sort of tradeoff that kernel developers tend to view with extreme skepticism; creating a set of buckets for every call site simply is not going to happen.

This new patch series from Cook is built around one of those observations that is obvious in retrospect: most kmalloc() call sites request objects of a fixed size that will never change. Often that size (the size of a specific structure, for example) is known at compile time. In such cases, providing the call site with a single dedicated slab for the size that is needed would give an equivalent level of protection against heap-spraying attacks. There is no need to provide buckets for all of the other sizes; they would never be used.

The only problem with that idea is that there are thousands of kmalloc() call sites in the kernel. Going through and examining each one would be a tedious and possibly error-prone task, that would result in a lot of code churn. But the compiler knows whether the size parameter passed to any given kmalloc() call is a compile-time constant or not; all that is needed is a way to communicate that information to the call itself. If that information were accompanied by something that identified the call site, the slab allocator could set up dedicated slabs for the call sites where it makes sense.

So the problem comes down to getting that information to kmalloc() in an efficient way. Cook's approach is an interesting adaptation of the code-tagging framework that was merged for the 6.10 release. Code tagging is part of the memory-allocation profiling subsystem, which is meant to help find allocation-related bugs; it ties allocations to the call site that requested them, so developers can find, for example, the source of a memory leak.

Code tagging was not really meant as a kernel-hardening technology, but it does provide the call-site information needed here. Cook's series starts by augmenting the tag information stored for each call site with an indicator of whether the allocation size is constant and, if so, what that size is. That information will be available to the slab allocator when the kmalloc() call is made.

If a given allocation request is at the GFP_ATOMIC level, it will be handled in the usual way to avoid adding any extra allocations to that path. Otherwise, though, the allocator will check whether that call site uses a constant size; if so, a dedicated slab will be created for that site and used to satisfy the allocation request (and all that follow). If the size is not constant, then a full set of buckets will be created instead. Either way, the decision will be stored in the code tag to speed future calls. It is worth noting that this setup is not done for any given call site until the first call is made, meaning that it is not performed for the many kmalloc() call sites that will never execute in any given kernel.

If this series is merged, the kernel will have three levels of defense against heap-spraying attacks. The randomized slab option, merged for 6.6, creates 16 sets of slab buckets, then assigns each call site to one set randomly. Its memory overhead is relatively low, but the protection is probabilistic — it reduces the chance that an attacker can spray the target heap, but does not eliminate it. The dedicated-buckets option provides stronger protection, but is limited by the need to explicitly identify risky call sites and isolate them manually. This new option, instead, provides strong protection against heap spraying, but it will inevitably increase the memory overhead of the slab allocator.

The amount of that overhead will depend on the workload being run. For an unspecified distribution kernel, Cook reported that the number of slabs reported in /proc/slabinfo grew by a factor of five or so. Should the series land in the mainline, it will be up to distributors to decide whether to enable this option or not. When a kernel is going to run on a system that is at high risk of heap-spraying attacks, though, that may prove to be an easy decision to make.

Comments (12 posted)

Modernizing openSUSE installation with Agama

By Joe Brockmeier
August 21, 2024

Linux installers receive a disproportionate amount of attention compared to the amount of time that most users spend with them. Ideally, a user spends only a few minutes using the installer, versus years using the distribution after it is installed. Yet, the installer sets the first impression, and if it fails to do its job, little else matters. Installers also have to continually evolve to keep pace with new hardware, changes in distribution packaging (such as image-based Linux distributions), and so forth. Along those lines, the SUSE team that maintains the venerable YaST installer has decided it's time to start (almost) fresh with a new Linux installer project, called Agama, for new projects. YaST is not going away as an administration tool, but it is likely to be relieved of installer duties at some point.

Brief history

YaST has served SUSE and openSUSE well for nearly 30 years as an installer and configuration tool. Its name is short for "Yet another Setup Tool", which was a nod to the proliferation of installers and administration tools that were already available when it made its debut in 1996.

Initially, YaST was proprietary, but it was released under the GPL in 2004 in the hopes that it would become widely used for Linux system management. That never really came to pass (despite attempts to get traction with projects like YaST4Debian), but YaST has remained the way to install and manage SUSE-based distributions. Over the years, YaST has undergone a number of revisions and rewrites in attempts to modernize it. But, with more than two decades of service, the YaST team has decided that it is too complex and burdened with too much technical debt to meet the needs of installing an increasing number of SUSE and openSUSE variants.

Agama is an attempt to overcome YaST's limitations as an installer and make a more suitable installer for new projects such as SUSE Linux Framework One, which was formerly known as the Adaptable Linux Platform (ALP). Note that the project is not a full replacement for YaST: it is only an installer, and there are no plans to use it for systems management post-installation. The project is also not starting entirely from scratch, as it reuses some of YaST's libraries rather than trying to reimplement storage management or software-package handling (for example).

D-Installer to Agama

Work began in 2022 under the provisional name D-Installer. The goals for the new installer included shortening the installation process, decoupling the user interface from YaST's internals, serving as an installer for multiple versions of SUSE/openSUSE, and adding a web-based interface to allow remote installation. Agama uses Ruby for components that interact with YaST, as well as Rust for some of its other services. Its web-based front end is a React-based application.

The first, quasi-functional, release of D-Installer was announced in late March 2022, with support for x86_64, Arm64, and s390. According to the announcement, it could install openSUSE Tumbleweed "in simple scenarios" but users could not modify software selection or customize disk partitioning. The project was renamed Agama in March 2023, after the insectivorous iguana-like lizard. The Agama 3 release added PowerPC images, better support for custom disk partitioning, and gave users a choice of ISO images for installing Tumbleweed or prototypes of Framework One.

In February of this year, the Agama team published a 2024 roadmap with plans to make changes to the project's architecture over two major milestones. Initially, Agama was built on top of the Cockpit project (LWN covered Cockpit in March). However, the Agama developers decided that the functionality it offered was not enough to offset its dependencies—especially after the Cockpit project rewrote a major portion of its codebase in Python, creating an additional dependency that the team did not want to carry.

The first milestone, Agama 8, dropped Cockpit and switched from using D-Bus to HTTP as the communication protocol between Agama's client and server components. (Agama still uses D-Bus to communicate with the YaST service.) This moved the project backward in some ways: for example, it lost support for direct access storage devices (DASD), zfcp support, and Cockpit's integrated terminal. It did, however, gain a better interface for setting up storage, which made it possible to customize boot options, add disk encryption, and to configure complex disk-partitioning schemes.

Agama 9 release

Agama 9, the most recent release, was announced on June 28. It includes a new web-based user interface, support for automated installation with partial support for "legacy" AutoYaST automated-deployment profiles or using its own YAML-based profiles. While not quite ready to supplant YaST for installation purposes, the project is making good progress and can be used today to install versions of openSUSE Leap, MicroOS, and Tumbleweed. Testing images are available for Arm64, s390, PPC64, and (of course) x86_64.

In its current iteration, Agama is composed of several server-side and client-side components. The core is the Agama server that implements some of the installation logic and communicates over D-Bus with the Agama YaST service that uses YaST libraries for things like software handling and storage configuration. Its client-side components include a command-line interface (agama-cli), a web-based interface, and a systemd service that manages automated installations. The default is to use the web-based interface, even when installing locally: the live CD installer uses a stripped-down Firefox profile to render the web-based interface.

[Agama web-based
interface]

Though Agama is still considered experimental, it was stable and usable in testing. As expected, it is not as full-featured as YaST in some respects. For example, it does not support setting video modes from the boot menu, or selecting source media for the installation, as YaST does. However, it seems to cover most of the bases for users wanting to set up systems with one of the supported versions of openSUSE.

Agama defers some choices that users could make from the YaST installer to post-installation. For instance, the network configuration options in YaST support setting up VLANs, bridge devices, configuring DNS, routing, and much more. Agama treats advanced networking configurations as out of scope for the installation process, and focuses on the network support necessary for installation only.

The installation process using Agama is distilled into choosing the version of openSUSE to be installed, and then confirming or modifying the installer defaults for things like localization (language, keyboard layout, and time zone), storage options, software selection, and setting up networking (if not automatically configured). Generally, the only thing a user is required to do manually is to set up the first user or a root user.

The YaST installer requires users to step through the installation process in a specific order, and only allows navigation forward or backward one screen at a time. If a user gets to the installation overview screen and realizes that they have forgotten to, for example, create a separate partition for /home, YaST requires them to click back through several screens to make changes, then forward again through the rest of the installer options. Agama, on the other hand, allows users to select any of the configuration screens at any time before beginning the installation.

Remote installation via the browser is also fully functional. Agama generates a random root password, which is visible at the console, that will be needed to log in remotely to the installer via a browser. Alternatively, users can set a password on the boot command line with live.password=<password> or add a default password to the ISO image. See the documentation for instructions on how to do that.

Since Agama is in heavy development, users and developers might want to work from a branch that has not yet been delivered as a testing image. Agama can be patched while running from the live medium, using the yupdate script. The documentation recommends a minimum of 4GB of RAM to do this, since patching the installer requires downloading a number of RPM and NPM packages that are stored in memory.

Not out to YaSTure yet

Agama has come a long way in 2024, but it is not ready to step into YaST's installer shoes just yet. It's unclear when the next Agama release will be, but according to the openSUSE release engineering meeting minutes from August 21, it will be the default installer for the upcoming Leap 16 alpha. Lubos Kocman of the openSUSE release team also recently wrote that his long-term plan is to transition to Agama by default for Tumbleweed, Leap, and SUSE Linux Enterprise Server 16 by the second half of 2025 in conjunction with the Leap 16 final release. Interested developers and users should check out the daily builds for progress for now and if all goes well, Agama should be taking on installer duties before YaST celebrates its 30th anniversary in 2026.

Comments (4 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

  • Briefs: Kernel Rust docs; Dual-boot problems; Gentoo releases; uv 0.3.0; Quotes; ...
  • Announcements: Newsletters, conferences, security updates, patches, and more.
Next page: Brief items>>

Copyright © 2024, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds