Leading items

Welcome to the LWN.net Weekly Edition for January 4, 2024

This edition contains the following feature content:

LWN's guide to 2024: in which we continue our foolish tradition of trying to predict what the new year will bring.
Smuggling email inside of email: SMTP smuggling is a new way of spoofing email.
Data-type profiling for perf: a new perf option giving visibility into which data is frequently accessed.
The trouble with MAX_ORDER: a simple change to an integer macro leads to problems.
The Linux graphics stack in a nutshell, part 2: the conclusion of our look at how the Linux graphics stack is put together.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

LWN's guide to 2024

By Jonathan Corbet
January 2, 2024

The calendar has flipped over into 2024 — another year has begun. Here at LWN, we do not have a better idea of what this year will bring than anybody else does, but that doesn't keep us from going out on a shaky limb and making predictions anyway. Here, for the curious, are a few things that we think may be in store for 2024.

The kernel community will begin to move away from email as the core of its development process. Progress will be slow, and many kernel developers will prove strongly resistant to any alternative workflow, often for good reasons, but there will be at least a few developers who are able to get work reviewed, updated, and merged without having to touch a mail client. In a world where even Linus Torvalds is saying that the time has come to make a change, the unthinkable stands a good chance of actually coming to pass.

The next long-term stable kernel will be 6.12, which will be released on December 1, 2024 (unless Linus Torvalds declines to make a release immediately after the US Thanksgiving holiday, in which case it will be one week later).

The first user-visible Rust code will be merged into the kernel, perhaps as soon as the 6.8 release (which will happen in March). That code may not be used on many systems initially, but it still marks an important transition: once Rust is used for user-visible features, the kernel community will no longer have the option of easily dropping support for the language. Merging user-visible Rust code into the kernel will, in other words, be a declaration that the Rust experiment is a success.

As Rust becomes necessary to build a Linux kernel, the lack of a GCC-based Rust compiler will become a bigger problem. The gccrs project is working to fill that void, but the task is large, the target is moving quickly, and the project has progressed slowly with relatively little support. Somebody is going to have to put some resources into that project if it is to succeed; it is not clear where those resources might come from in 2024, though.

The enterprise distribution market will be shaken up in 2024. Last year, vendors working in this space all apparently came to the conclusion that there was little to be done other than offering clones of Red Hat Enterprise Linux (RHEL), essentially ceding control over that part of the market to one company. As Red Hat makes it harder to compete in that space, though, both vendors and users will increasingly ask themselves why they are bothering. Stable Linux does not need to be a copy of RHEL, and there are a number of interesting approaches that could draw parts of the market away from RHEL in the future.

Firefox will reverse its longtime decrease in browser share after gaining a strong impetus from Google's plan to switch to "Manifest V3" in the Chrome browser. That transition will make ad blockers harder, if not impossible, to implement in Chrome. Ad blockers, once called "the biggest boycott in world history" by Doc Searls, are the only thing that makes the World Wide Web tolerable for many users. Switching browsers is not harder than installing the ad blocker in the first place; many people will find the motivation to make that switch when the alternative is a constant stream of advertisements, trackers, and malware.

Of course, this will all go better if Mozilla positions itself well to be the defender of a friendlier web. The recent statement from Mozilla CEO Mitchell Baker suggests that the organization sees AI as a more interesting place to focus its efforts on than saving the world (again) from a browser monopoly. If Firefox languishes while Mozilla pursues shinier projects, an opportunity — perhaps the final one — for Firefox to regain relevance may be lost.

Speaking of AI, open-source generative AI will see a lot of attention this year. Partly that is because some of the open-source projects have proved to be surprisingly competitive with the proprietary efforts, but there is more going on than that. Those proprietary platforms are going to spend the year tangled up in high-profile copyright lawsuits and could run into serious trouble. A system based on free software, running on one's own servers, may continue to be useful even if the large, proprietary solutions find themselves restricted or shut down. We will also certainly see signs of open-source models being used in ways (such as the generation of hate speech, for example) that the commercial models, often with good reason, prohibit. The results seem unlikely to be pleasant.

It will be a big year for BPF, which is not surprising since the last few years have been as well. Projects like the extensible scheduler class seem unlikely to go away; increasing pressure from users may eventually cause its rejection to be reconsidered. Meanwhile, the recently announced acquisition of Isovalent by Cisco may bring new resources to BPF development — or it may, in the way of many corporate acquisitions, succeed in messing up an important BPF-development group.

The first release of a "free threading" Python (without the global interpreter lock or GIL) in October will be a qualified success. There will be rough spots and bumpy patches due to the need for two different binary distributions of the language (with and without the GIL) and its extensions, but the overall direction will be looking good. The free-threaded version will not be the default in 2024, but progress in that direction will be visible by the end of the year.

Goodhart's law, paraphrased, says "when a measure becomes a target, it ceases to be a good measure". Abuse of metrics will become a bigger problem in the coming year, continuing a trend that has been underway for some time. Whether the metric is CVE numbers obtained, regression reports filed, commit counts, patches "reviewed", toots boosted, discussion-forum badges obtained, or any of a number of other things, the pursuit of "more" will lead to trouble.

Consider, for example, this post from Daniel Stenberg on problems with AI-generated security-bug reports in the curl project. As it becomes easier to crank out such reports (or "Tested-by" tags, etc.), and as people perceive value in increasing their personal scores, the amount of abuse will certainly increase. We will have to develop better defenses to avoid being overwhelmed by this type of spam.

Finally, the ongoing maintainer crisis will intensify in 2024. There are many projects in the free-software community that are heavily and widely depended upon, but which receive little support. Those projects, as a result, tend to exhibit slow progress, large amounts of technical debt, security problems, and more. This problem is not new; it is also not hidden to anybody who has been paying attention.

Free software has been a major boon to the corporate world. Any company out there can take advantage of massive amounts of free software with no obligations beyond adhering to the license — and many companies do not even bother with that. Companies can adapt the software to their needs, and they can pool their efforts to minimize the need to reimplement functionality that has been created elsewhere.

This freedom is a good thing; it has created a great deal of benefit for almost everybody involved. But free software does not develop itself; it needs care and support. Often, companies have provided those resources, with the result that we all have access to far more free software than we might have once thought possible. We have all benefited from this massive contribution of resources to our community from the commercial realm.

But companies can be severely short-sighted. It is easy for a manager to justify contributing a driver to the kernel to enable their company's hardware. It is harder to justify contributing resources to the framework within the kernel that makes the driver easy to write, to the rest of the kernel, or to user-space support for that hardware. And it is nearly impossible to get resources to support hardware that the company is no longer selling — even though many users still depend on that hardware.

So companies tend to contribute in ways that create other problems and fail to support the platform as a whole. They hire maintainers in the hope of gaining both skills and some influence over a project, then do not give those maintainers time to do their maintenance work. Companies will often ignore pressing needs in parts of the free-software ecosystem, seeing them as somebody else's problem. As a result, much of the work of keeping our software healthy falls onto the shoulders of the relatively few developers who are able to put some time (perhaps at significant personal cost) into it.

By all appearances, 2024 does not look like a year in which companies have decided to improve support for the software that they depend so heavily upon. That needs to change; free software is not some magical, infinite resource that companies can use to reduce their own development costs without a care for its future. It is a way to share maintenance costs; if it is used as a way to shed those costs entirely, the results will continue to be unpleasant, for both the software and the people who work so hard to keep it going.

Hopefully that vision is too dark, and we will see some progress toward improving the situation. Regardless of how it goes, LWN will be there throughout 2024 to keep you apprised of the state of our community. We hope it is a great year for all of our readers, and thank you for supporting us into our 26th year of operation. It has been a fascinating trip, and we are not done yet.

Comments (61 posted)

Smuggling email inside of email

By Jake Edge
January 3, 2024

Normally, when a new vulnerability is discovered and releases are coordinated with those affected, the announcement is done at a convenient time—not generally right before the end-of-year holidays, for example. The SMTP Smuggling vulnerability has taken a different path, however, with its announcement landing on December 18. That may well have been unpleasant for some administrators that had not yet updated, but it was particularly problematic for some projects that had not been made aware of the vulnerability at all—though it was known to affect several open-source mailers.

Discovery and disclosure

The vulnerability was discovered by Timo Longin of SEC Consult back in June; the company contacted three affected vendors, GMX, Microsoft, and Cisco in July. GMX fixed the issue in August, and Microsoft did so in October, but Cisco responded that the "identified vulnerability is just a feature of the software and not a bug/vulnerability". That led SEC Consult to contact the CERT Coordination Center (CERT/CC) for further assistance, using the Vulnerability Information and Coordination Environment (VINCE) tool:

There, we submitted all our research details and explained the case to the vendors involved. We received feedback from Cisco that our identified research is not a vulnerability, but a feature and that they will not change the default configuration nor inform their customers. Other vendors did not respond in VINCE but were contacted by CERT/CC.
Based on this feedback and as multiple other vendors were included in this discussion through the CERT/CC VINCE platform without objecting, we wrongly assessed the broader impact of the SMTP smuggling research. Because of this assumption, we asked CERT/CC end of November regarding publication of the details and received confirmation to proceed.

A talk about the vulnerability was accepted for the 37th Chaos Communication Congress (37C3), which was held at the end of December. So the vulnerability needed to be announced before that happened. The talk acceptance occurred on December 3 and the announcement came out roughly two weeks later—just before the holidays. The "wrongly assessed" wording in the quote above seems to indicate that SEC Consult recognizes that it made a mistake here. In addition, the talk is said to have contained "a decent apology".

One of the other "vendors" mentioned in the (admirably detailed) blog post is Sendmail, but Postfix is mentioned elsewhere as well. It is clear that SEC Consult was fully aware that those two mailers were vulnerable, but the company apparently relied on CERT/CC to involve additional affected projects, which just as clearly did not happen. For example, Postfix creator and maintainer Wietse Venema called the vulnerability announcement "part of a non-responsible disclosure process"; he softened that stance somewhat in a Postfix SMTP Smuggling page, but still noted that "[critical] information provided by the researcher was not passed on to Postfix maintainers before publication of the attack". The result was "a presumably unintended zero-day attack", he said.

Vulnerability

The Simple Mail Transfer Protocol (SMTP), which is described in RFC 5321, provides a text-based protocol for submitting and exchanging email on the net. At some level, this flaw is another indication that the "robustness principle" (also known as "Postel's law") was not actually all that sound from a security standpoint; being liberal in what is accepted over-the-wire has often led to problems. In this case, the handling of line endings in conjunction with the SMTP end-of-data indication can lead to situations where email can be successfully spoofed—leading to rogue email that passes various checks for authenticity.

The SMTP DATA command is used for the actual text that will appear in an email, including headers and the like; the so-called "envelope", which describes the sender and receiver, is another set of SMTP commands (EHLO, MAIL FROM, RCPT TO) that precede the DATA. The blog post announcing the vulnerability has lots of diagrams and explanations for those who want all the gory details. The DATA command is ended with a line that is blank except for a single period ("."); the line endings for SMTP are defined to be carriage-return (CR or "\r") followed by line-feed (LF or "\n"), so end-of-data should be signaled with "\r\n.\r\n".

It turns out that some mailers will accept just a line-feed as the line terminator, but others will not; there is a difference in interpretation of "\n.\n" that can be used to smuggle an email message inside another:

    EHLO ...
    MAIL FROM: ...
    RCPT TO: ...
    DATA
    From: ...
    To: ...
    Subject: ...

    innocuous email text
    \n.\n
    MAIL FROM: <admin@...>
    RCPT TO: <victim@...>
    DATA
    From: Administrator <admin@...>
    To: J. Random Victim <victim@...>
    Subject: Beware of phishing scams

    Like this one
    \r\n.\r\n

The second set of commands is in the text of the email, if the SMTP server receiving it does not consider "\n.\n" as the termination of the DATA command. That email will be sent to another SMTP server, however, which may see things rather differently. If the SMTP server for the destination sees that line as terminating the data, it may start processing what comes next as entirely new email. It is a vulnerability that is analogous to HTTP request smuggling, which is where the name comes from.

There are also variations on the line endings (e.g. "\n.\r\n") that can be used to fool various mailers. The core of the idea is to have the outbound mail server ignore the "extra stuff" as part of the initial email message and send the mail on to a server that sees the single message as more than one—and acts accordingly. SPF checks can be made to pass and even DKIM can be spoofed by using an attacker-controlled DKIM key in the smuggled headers. DMARC can be used to thwart the smuggling, but common configurations of it are still vulnerable. In addition, because mail servers, especially those of the larger email providers, often handle multiple domains, there is an opportunity to smuggle email that purports to come from different domains inside innocuous-seeming messages to users of those services.

The blog post mostly concentrates on the three vendors identified, but clearly notes that "Postfix and Sendmail [fulfill] the requirements, are affected and can be smuggled to". It provides several lists of domains that can be spoofed via GMX, Microsoft Exchange Online, or Cisco Secure Email Cloud Gateway. In fact, SEC Consult uses the Cisco product itself, so the company has changed its settings away from the default to avoid the problem, which Cisco does not acknowledge as any kind of bug.

Postfix and beyond

Meanwhile, over in open-source land, folks seemed rather astonished that the vulnerability dropped with no warning. Marcus Meissner posted about the vulnerability to the oss-security mailing list: "As if we did not have sufficient protocol vulnerability work short[ly] before Christmas break this year, here is one more". He is likely referring to the Terrapin vulnerability in the SSH protocol, which was announced on December 18—well after coordinating with many different SSH implementation projects. The SMTP Smuggling vulnerability followed a rather different path, as Stuart Henderson pointed out:

I'm a little confused by sec-consult's process here. They identify a problem affecting various pieces of software including some very widely deployed open source software, go to the trouble of doing a coordinated disclosure, but only do that with...looking at their timeline... gmx, microsoft and cisco?

Meissner noted that SUSE was not alerted to the problem via VINCE and that the Postfix timeline (in Venema's post) only started after the announcement. Erik Auerswald speculated that SEC Consult expected CERT/CC to alert other affected projects; instead it would seem that CERT/CC gave the go-ahead to release the findings. After Rodrigo Freire wondered why the problem was considered a vulnerability, Auerswald explained:

Any user of an affected outbound server can spoof email from any user of the same outbound server despite SPF and DKIM (DMARC+DKIM can prevent this in some cases, also more senders can be spoofed in specific cases, for details see the blog post). But for this to work, the inbound server must act as a confused deputy. Both outbound and inbound servers need to be differently vulnerable to enable the attack. This specific attack can be prevented unilaterally on either the outbound or the inbound server.
[...] For email server open source projects, relevant for the oss-security list, the primary vulnerability is to act as a confused deputy inbound server, because users of such email servers usually have a much smaller number of accounts than the big freemail providers. But, in general, they could also possibly act as a vulnerable outbound server, e.g., after a [legitimate] user account has been compromised.

CVEs for Sendmail, Postfix, and Exim have been assigned. Postfix has released updates with a new smtpd_forbid_bare_newline option (that defaults to off, but will default to on in the upcoming Postfix 3.9 release); those who do not want to upgrade can work around most of the problems using existing options. Exim has an advisory, bug report, and bug-fix release (4.97.1) for the problem. Sendmail has a snapshot release (8.18.0.2) with a new option to address the flaw as well. However, other than Postfix updates from SUSE and Slackware, most distributions have not yet issued updates for this problem, which leaves a lot of Linux users vulnerable.

So the major open-source mailer projects scrambled to fix the vulnerability over the holidays, but the process has obviously failed here. We have not (yet?) heard from CERT/CC for its side of the story, but either it or SEC Consult should definitely have contacted those projects in the multiple months that have gone by since the discovery. Given that SEC Consult knew about Sendmail and Postfix, though, it is a little hard to understand how a simple heads-up message to the security contacts for those projects was not sent.

It seems likely that there are smaller mailers and other tools that are still affected by the flaw—though they may only have just heard of it—a month or two of advance notice, especially for small projects, could have made a lot of difference. One can only hope that lessons are being learned here and that any coordination will be better ... coordinated (and communicated) down the road.

Comments (43 posted)

Data-type profiling for perf

December 21, 2023

This article was contributed by Julian Squires

Tooling for profiling the effects of memory usage and layout has always lagged behind that for profiling processor activity, so Namhyung Kim's patch set for data-type profiling in perf is a welcome addition. It provides aggregated breakdowns of memory accesses by data type that can inform structure layout and access pattern changes. Existing tools have either, like heaptrack, focused on profiling allocations, or, like perf mem, on accounting memory accesses only at the address level. This new work builds on the latter, using DWARF debugging information to correlate memory operations with their source-level types.

Recent kernel history is full of examples of commits that reorder structures, pad fields, or pack them to improve performance. But how does one discover structures in need of optimization and characterize access to them to make such decisions? Pahole gives a static view of how data structures span cache lines and where padding exists, but can't reveal anything about access patterns. perf c2c is a powerful tool for identifying cache-line contention, but won't reveal anything useful for single-threaded access. To understand the access behavior of a running program, a broader picture of accesses to data structures is needed. This is where Kim's data type profiling work comes in.

Take, for example, this recent change to perf from Ian Rogers, who described it tersely as: "Avoid 6 byte hole for padding. Place more frequently used fields first in an attempt to use just 1 cache line in the common case." This is a classic structure-reordering optimization. Rogers quotes pahole's output for the structure in question before the optimization:

    struct callchain_list {
        u64                        ip;                   /*     0     8 */
        struct map_symbol          ms;                   /*     8    24 */
        struct {
                _Bool              unfolded;             /*    32     1 */
                _Bool              has_children;         /*    33     1 */
        };                                               /*    32     2 */

        /* XXX 6 bytes hole, try to pack */

        u64                        branch_count;         /*    40     8 */
        u64                        from_count;           /*    48     8 */
        u64                        predicted_count;      /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        u64                        abort_count;          /*    64     8 */
        u64                        cycles_count;         /*    72     8 */
        u64                        iter_count;           /*    80     8 */
        u64                        iter_cycles;          /*    88     8 */
        struct branch_type_stat *  brtype_stat;          /*    96     8 */
        const char  *              srcline;              /*   104     8 */
        struct list_head           list;                 /*   112    16 */

        /* size: 128, cachelines: 2, members: 13 */
        /* sum members: 122, holes: 1, sum holes: 6 */
    };

We can see that there is a hole, and that the whole structure spans two cache lines, but not much more than that. Rogers's patch moves the list_head structure up to fill the reported hole and, at the same time, put a heavily accessed structure into the same cache line as the other frequently used data. Making a change like that, though, requires knowledge of which fields are most often accessed. This is where perf's new data type profiling comes in.

To use it, one starts by sampling memory operations with:

    perf mem record

Intel, AMD, and Arm each have some support for recording precise memory events on their contemporary processors, but this support varies in how comprehensive it is. On processors that support separating load and store profiling (such as Arm SPE or Intel PEBS), a command like:

    perf mem record -t store

can be used to find fields that are heavily written. Here, we'll use it on perf report itself with a reasonably sized call chain to evaluate the change.

Once a run has been done with the above command, it is time to use the resulting data to do the data-type profile. Kim's changes add a new command:

    perf annotate --data-type

that prints structures with samples per field; it can be narrowed to a single type by providing an argument. This is what the output from:

    perf annotate --data-type=callchain_list

looks like before Rogers's patch (with the most active fields highlighted in bold):

    Annotate type: 'struct callchain_list' in [...]/tools/perf/perf (218 samples):
    ============================================================================
    samples     offset       size  field
        218          0        128  struct callchain_list         {
         18          0          8      u64      ip;
        157          8         24      struct map_symbol        ms {
          0          8          8          struct maps* maps;
         60         16          8          struct map*  map;
         97         24          8          struct symbol*       sym;
                                       };
          0         32          2      struct    {
          0         32          1          _Bool        unfolded;
          0         33          1          _Bool        has_children;
                                       };
          0         40          8      u64      branch_count;
          0         48          8      u64      from_count;
          0         56          8      u64      predicted_count;
          0         64          8      u64      abort_count;
          0         72          8      u64      cycles_count;
          0         80          8      u64      iter_count;
          0         88          8      u64      iter_cycles;
          0         96          8      struct branch_type_stat* brtype_stat;
          0        104          8      char*    srcline;
         43        112         16      struct list_head list {
         43        112          8          struct list_head*    next;
          0        120          8          struct list_head*    prev;
                                       };
                                   };

This makes the point of the patch clear. We can see that list is the only field on the second cache line that is accessed as part of this workload. If that field could be moved to the first cache line, the cache behavior of the application should improve. Data-type profiling lets us verify that assumption; its output after the patch looks like:

    Annotate type: 'struct callchain_list' in [...]/tools/perf/perf (154 samples):
    ============================================================================
    samples     offset       size  field
        154          0        128  struct callchain_list         {
         28          0         16      struct list_head list {
         28          0          8          struct list_head*    next;
          0          8          8          struct list_head*    prev;
                                       };
          9         16          8      u64      ip;
        116         24         24      struct map_symbol        ms {
          1         24          8          struct maps* maps;
         60         32          8          struct map*  map;
         55         40          8          struct symbol*       sym;
                                       };
          1         48          8      char*    srcline;
          0         56          8      u64      branch_count;
          0         64          8      u64      from_count;
          0         72          8      u64      cycles_count;
          0         80          8      u64      iter_count;
          0         88          8      u64      iter_cycles;
          0         96          8      struct branch_type_stat* brtype_stat;
          0        104          8      u64      predicted_count;
          0        112          8      u64      abort_count;
          0        120          2      struct    {
          0        120          1          _Bool        unfolded;
          0        121          1          _Bool        has_children;
                                       };
                                   };

For this workload, at least, the access patterns are as advertised. Some quick perf stat benchmarking revealed that the instructions-per-cycle count had increased and the time elapsed had decreased as a consequence of the change.

Anyone who has spent a lot of time scrutinizing pahole output, trying to shuffle structure members to balance size, cache-line access, false sharing, and so on, is likely to find this useful. (Readers who have not yet delved into this rabbit hole might want to start with Ulrich Drepper's series on LWN, "What every programmer should know about memory", specifically part 5, "What programmers can do".)

Data-type profiling obviously needs information about the program it is looking at to be able to do its job; specifically, identifying the data type associated with a load or store requires that there is DWARF debugging information for locations, variables, and types. Any language supported by perf should work. The author verified that, aside from C, Rust and Go programs produce reasonable, though not always idiomatic for the language involved, output.

After sampling memory accesses, data-type aggregation correlates sampled instruction arguments with locations in the associated DWARF information, and then with their type. As is often the case in profiling, compiler optimizations can impede this search. This unfortunately means that there are cases where perf won't associate a memory event with a type because the DWARF information either wasn't thorough enough, or was too complex for perf to interpret.

Kim spoke about this work at the 2023 Linux Plumbers Conference (video), and noted situations involving chains of pointers as a common case that isn't supported well currently. While he has a workaround for this problem, he also pointed out that there is a proposal for inverted location lists in DWARF that would be a more general solution.

For any given program address (usually the current program counter (PC)), location lists in DWARF [large PDF] allow a debugging tool to look up how a symbol is currently stored; it can be a location description, which may indicate the symbol is currently stored in a register, or an address. What tools like perf would rather have is a mapping from an address or register to a symbol. This is effectively an inversion of location lists, but computing this inversion is much less expensive for the compiler emitting the debugging information in the first place. This has been a sore spot for perf in the past, judging from the discussion between Arnaldo Carvalho de Melo and Peter Zijlstra during the former's Linux Plumbers Conference 2022 talk (video) on profiling data structures.

As of this article, Kim's work is unmerged but, since the changes are only in user space, it's possible to try them out easily by building perf from Kim's perf/data-profile-v3 branch. Given the enthusiastic reactions to the v1 patch set from perf tools maintainer Arnaldo Carvalho de Melo, Peter Zijlstra, and Ingo Molnar, it seems likely that it won't remain unmerged for long.

Comments (14 posted)

The trouble with MAX_ORDER

By Jonathan Corbet
January 1, 2024

One might not think that much could be said about a simple macro defining a constant integer value. But the kernel is special, it seems. A change to the definition of MAX_ORDER has had a number of follow-on effects, and the task of cleaning up after this change is not done yet. So perhaps a look at MAX_ORDER is in order.

Like everything else in the earliest releases of the Linux kernel, memory management was simple; it was mostly concerned with the tracking and management of individual pages. Around 1994, though, as memory sizes increased and the kernel became more complex, it became clear that the ability to deal with pages in larger chunks would be useful. The 1.1.0 release in April 1994 included the beginnings of what was to become the kernel's "buddy allocator", which tracks and allocates pages in blocks sized in powers of two.

With this change came the concept of an "order", which is just the base-two logarithm of the number of pages in a block. An order-0 block is a single page, an order-1 block holds two pages, and so on. Blocks of different orders were kept in separate lists in 1.1.0, which included a macro (NR_MEM_LISTS) describing how many of those lists there were. It was given a value of six initially, meaning that the largest block that could be managed was of order five — 32 pages. The 2.1.23 release (January 1997) saw a special case added for the AP1000 CPU, which needed to be able to allocate 8MB (order 11) chunks of memory; NR_MEM_LISTS was set to 12 if the kernel was being built for that architecture.

In 2.3.27 (November 1999), NR_MEM_LISTS got a new name — MAX_ORDER — and was set to ten for all but the AP1000. Support for AP1000 was eventually removed in 2.3.42 in February 2000, but the practice of adjusting MAX_ORDER for specific architectures has only grown more prevalent. The default value for MAX_ORDER can now vary widely between architectures; it defaults to 11 on most systems, though. As a result, on such systems, pages can be allocated in blocks of order zero through ten.

One interesting aspect of MAX_ORDER, as described so far, is that it is not actually the maximum order; if MAX_ORDER is 11, then the maximum order that can be allocated is ten. In March 2023, Kirill Shutemov decided to address this inconsistency after noticing some bugs in the code resulting from a misunderstanding of MAX_ORDER; the result was this patch set that fixed such bugs in nine different places, including a longstanding bug in the floppy driver.

The work did not end there, though; at the end of this series, Shutemov redefined MAX_ORDER to mean what its name suggests: the maximum order an allocation can be. As a result, MAX_ORDER often defaults to ten now, with the result that the kernel (still) maintains 11 lists of free blocks. At the time, the reception for the change was mostly positive, though Mel Gorman did worry that this change could cause problems for stable backports. David Laight suggested that changing the name of MAX_ORDER would help prevent such problems, but that suggestion was ignored and Shutemov's patch series was merged for the 6.4 release in June.

At the time, it seemed that the problem was solved. At the end of September, though, Paolo Bonzini pointed out three changes that, having been developed in parallel with the MAX_ORDER change, were still using the older meaning. The changes were correct when written, but had been broken by the MAX_ORDER change but nobody had noticed. Nearly two months later, Mike Snitzer sent a pull request containing a fix for one of those bugs, introduced by a change to the dm-crypt target merged for 6.5. It turns out that out-of-tree code, like backports, can be silently broken by this change — a concern that nobody had raised when the MAX_ORDER change was being considered.

That bug caused Linus Torvalds to question whether the change should have been made at all. He suggested that he might revert it in the absence of a compelling reason to keep it around; he later added:

Bah. I looked at the commits leading up to it, and I really think that "if that's the fallout from 24 years of history, then it wasn't a huge deal".
Compared to the inevitable problems that the semantic change will cause for backports (and already caused for this merge window), I still think this was all likely a mistake.

He also said that bugs related to MAX_ORDER tend not to be found by testing because it is rare to allocate blocks of the maximum size.

Shutemov answered that the bugs he had fixed in that series were only those that made it into the mainline and had never been found; "I personally spend some time debugging due MAX_ORDER-1 thing". The real mistake, Shutemov continued, was (as Laight had pointed out) in preserving the MAX_ORDER name while changing its semantics: "Just renaming it to MAX_PAGE_ORDER or something would give a heads up to not-yet-upstream code". He followed up with another patch adding a new symbol, NR_ORDERS, defined as MAX_ORDER+1. Torvalds agreed with that change, but insisted that the MAX_ORDER name needed to change in order to avoid future problems.

At the end of December, Shutemov posted a pair of new patches making the changes. The first introduces the new name (now NR_PAGE_ORDERS) that describes the number of allocation orders supported by the page allocator. The second then renames MAX_ORDER to MAX_PAGE_ORDER, with the result that any out-of-tree code that uses MAX_ORDER will now fail to compile. That change should prevent a repeat of the dm-crypt bug; it should also keep MAX_ORDER bugs from being introduced into the stable kernels.

These patches, seemingly likely to be applied once the 6.8 merge window opens, should bring an end to the long history of MAX_ORDER in the kernel. The moral of the story is that names are indeed important; the wrong name can lead to the creation of incorrect code. Just as important, though, is the point that the meaning of a name should not be changed without raising a flag in potentially affected code. If a name is not right, it should probably be left behind entirely.

Comments (12 posted)

The Linux graphics stack in a nutshell, part 2

December 28, 2023

This article was contributed by Thomas Zimmermann

Displaying an application's graphical output onto the screen requires compositing and mode setting that are correctly synchronized among the various pieces, with low overhead. In this second and final article in the series, we will look at those pieces of the Linux graphics stack. In the first installment, we followed the path of graphics from the application, through Mesa, while using the memory-management features of the kernel's Direct Rendering Manager (DRM) subsystem. We ended up with an application's graphics data stored in an output buffer, so now it's time to display the image to the user.

Compositing

User-space applications almost never display their output by themselves, but instruct a screen compositor to do so. The compositor is a system service that receives each application's output buffers and draws them to an on-screen image. The layout is up to the to the compositor's implementation, but stacking and tiling are the most common. The compositor is also responsible for gathering user input and forwarding it to the application in focus.

Compositing, as well as almost everything else in the graphics stack, used to be provided by the X Window System, which implements a network protocol for displaying graphics on the screen. Since the everything-else part includes drawing, mode setting, screen sharing, and even printing, X suffers from software bloat and is hard to adapt to changes in the graphics hardware and Linux system; a lightweight replacement was needed. Its modern successor is Wayland, which is another client-server design where each application acts as a client to the display service provided by the compositor. Wayland's reference compositor is Weston, but GNOME's Mutter or KDE's KWin are more commonly used.

There's no drawing or printing in Wayland; the protocol only provides the functionality required for compositing. A Wayland surface represents an application window; it is the application's handle to display its output and to receive input events from the compositor. Attached to the surface is a Wayland buffer that contains the displayable pixel data plus color-format and size information. The pixel data is in the output buffer that the client application has rendered to. Changing a surface's attached buffer object or its content results in a Wayland-protocol surface-damage message from the application to the compositor, which updates the on-screen content; possibly with the contents of a new buffer object. The application's output buffer becomes an input buffer for the Wayland compositor.

Rendering in the compositor works exactly as described for applications in the first installment. The compositor maintains a list of all of the Wayland surfaces that represent application windows. Those windows and the compositor's interface elements form yet another scene graph. The background contains a wallpaper image, background pattern, or color. On top of the background, the compositor draws the application windows. The easiest way to implement this is by drawing a rectangle for each window and using the application-provided buffer object as a texture image.

On top of the application windows, the compositor draws its own user interface, such as a taskbar where the user can interact with the compositor itself. Finally the topmost layer is an indicator of what the user is currently interacting with; typically a mouse pointer on desktop systems. Like applications, the compositor renders with the regular user-space interfaces, such as Mesa's OpenGL or Vulkan.

The final building block to make all of this possible is the transfer mechanism for buffer objects. In contrast to X, Wayland applications always run on the same host as their compositor. Implementations are thus free to optimize for this case: there's no network encoding, buffer compression, and so on involved.

For transferring a buffer object that resides in system memory, the application creates a file descriptor that refers to the buffer's memory, sends it over the connection's stream socket (in a single, low-cost message), and lets the compositor map the file descriptor's memory pages into its address space. Both the application and the compositor have now established a low-overhead channel for exchanging pixel data. The application draws into the shared-memory region and the compositor renders from it. In practice it's also common to use multiple buffer objects for double buffering. Wayland's surface-damage messages serve as a synchronization method with low overhead.

Transferring data via shared memory is good enough for software rendering but, for high-performance hardware rendering, it is insufficient. The application would have to render on the graphics hardware and read back the result over the slow hardware bus into the region of shared memory.

To avoid that penalty, the graphics buffer has to remain in graphics memory. Wayland provides a protocol extension to share buffer objects via a Linux dma-buf, which represents a memory buffer that is shareable among hardware devices, drivers, and user-space programs. An application renders its scene graph via the Mesa interfaces using hardware acceleration as described in part 1, but, instead of transferring a reference to shared memory, the application sends a dma-buf object that references the buffer object while it is still located in graphics memory. The Wayland compositor uses the stored pixel data without ever reading it over the hardware bus.

Hardware-accelerated rendering is inherently asynchronous and therefore requires synchronization. After the application has sent the current frame's final rendering command to Mesa, it is not guaranteed that the hardware has finished rendering. This is intentional and required for high performance. But having the compositor display the content of a buffer object before the hardware has completed rendering results in distorted output. To avoid this from happening, the hardware signals when it has completed rendering. This is called fencing and the associated data structure is called a fence. The fence is attached to the dma-buf object that the application transfers to the compositor. The compositor waits for the fence to signal completion before it uses the resulting data for generating its own output.

Pixels to the monitor

After rendering the on-screen image, the compositor has to display it to the user. DRM's mode-setting code controls all aspects of reading pixel data from graphics memory and sending it to an output device. To do so, each driver sets up a pipeline that models the pixel data's flow through the graphics hardware. Each pipeline stage represents a piece of hardware functionality that processes pixel data on its way to the monitor.

The minimum stages necessary are the framebuffer, plane, CRTC, encoder, and connector, each of which is described below. For a working display output, there has to be at least one active instance of each. But most hardware provides more than the minimum functionality and allows for enabling and disabling pipeline stages at will. The DRM framework comes with software abstractions for each stage upon which drivers can build.

The pipeline's first stage is the DRM framebuffer. It is the buffer object that stores the compositor's on-screen image, plus information about the image's color format and size. Each DRM driver programs the hardware with this information and points the hardware to the first byte of the buffer object, so that the hardware knows where to find the pixel data.

Fetching the pixel data is called scanout, and the pixel data's buffer object is called the scanout buffer. The number of scanout buffers per framebuffer depends on the framebuffer's color format. Many formats, such as the common RGB-based ones, store all pixel data in a single buffer. With other formats, such as YUV-based ones, the pixel data might need to be split up into multiple buffers.

Depending on the hardware's capabilities, the framebuffer can be larger or smaller than the output's display mode. For example, if the monitor is set to 1920x1080 pixels, it might only show a section of a much larger framebuffer. Or, if the framebuffer is smaller than the display mode, it might only cover a small area of the monitor, leaving some areas blank. Hence, the pipeline's next stage locates the scanout buffer within the overall screen. In DRM terminology, this is called a plane. It sets the scanout buffer's position, orientation, and scaling factors. Depending on the hardware, there can be multiple active planes using different framebuffers. All active planes feed their pixel output into the pipeline's third stage, which is called the cathode-ray tube controller (CRTC) for historical reasons.

The CRTC controls everything related to display-mode settings. The DRM driver programs the CRTC hardware with a display mode and connects it with all of its active planes and outputs. There can also be multiple CRTCs with different settings programmed to them. The exact configuration is only limited by hardware features.

Planes are stacked, so they can overlap each other or cover different parts of the output. According to the programmed display mode and each plane's location, the CRTC hardware fetches pixel data from the planes, blends overlapping planes where necessary, and forwards the result to its outputs.

Outputs are represented by encoders and connectors. As its name suggests, the encoder is the hardware component that encodes pixel data for an output. An encoder is associated with a specific connector, which represents the physical connection to an output device, such as HDMI or VGA ports with a connected monitor. The connector also provides information on the output device's supported display modes, physical resolution, color space, and the like. Outputs on the same CRTC mirror the CRTC's screen on different output devices.

The image below shows a simple mode-setting pipeline with an additional plane for the mouse pointer, plus the buffer objects that act as scanout buffers. Arrows indicate the logical flow of pixel data from buffer objects to a VGA connector. This is a typical mode-setting pipeline for an older, discrete graphics card.

Pipeline setup

Deciding on policies for connecting and configuring the individual stages of the mode-setting pipeline is not the DRM driver's job. This is left to user-space programs, which brings us back to the compositor. As part of its initial setup, the compositor opens the device file under /dev/dri, such as /dev/dri/card1, and invokes the respective ioctl() calls to program the display pipeline. It also fetches the available display modes from a connector and picks a suitable one.

After the compositor has finished rendering the first on-screen image, it programs the mode-setting pipeline for the first time. To do so, it creates a framebuffer for the on-screen image's buffer object and attaches the framebuffer to a plane. It then sets the display mode for its on-screen buffer on the CRTC, connects all of the pipeline stages, from framebuffer to connector, and enables the display.

To change the displayed image in the next frame, no full mode setting is required. The compositor only has to replace the current framebuffer with a new one. This is called page flipping.

The individual stages of the mode-setting pipeline can be connected in a variety of ways. A CRTC might mirror to multiple encoders, or a framebuffer might be scanned out by multiple CRTCs. While this offers flexibility, it also means that not all combinations of pipeline stages are supported.

A naive implementation would apply each stage's settings individually. It would first program the display mode in the CRTC, then upload all buffer objects into graphics memory, then set up the framebuffers and planes for scanout, and finally enable the encoders and connectors. If anything fails during this procedure, the screen remains off or (even worse) in a distorted state. For example, with limited device memory, it might not be possible to store the framebuffers for more than one plane at a time. Switching modes, or even simple page flips, might fail. Failing display updates have been a common problem of graphics stacks ever since.

DRM's atomic mode setting solves this problem to some extent. The mode-setting code tracks the complete state of all elements of the pipeline in a compound data structure called drm_atomic_state, plus a sub-state for each stage in the pipeline. This mode setting is atomic in the sense that it either applies the full compound state of all pipeline stages, or none of it. To make this work, mode-setting involves two phases: first a check of the complete new atomic state and second, if successfully checked, a commit of the same.

For checking, the DRM core, its helpers, and the DRM driver test the proposed state against the limitations and constraints of the available graphics hardware. For example, a plane has to verify that the attached framebuffer is of a compatible color format and the CRTC has to verify that the given display resolution is supported by the hardware. If checking succeeds, the DRM driver programs the new state to hardware during the commit phase. If state checking fails for one or more of the stages, DRM stops the mode-setting operations and returns an error to the user-space program.

So, when our compositor intends to program a display mode, it sets the atomic state of all pipeline stages and applies them all at once. If successful, the display output changes accordingly. For successive page flipping operations, the compositor duplicates the current state, changes the framebuffers to the new ones, and applies the new state. Applying the page flip again results in an atomic-check/atomic-commit sequence within the kernel's DRM code, but with less overhead than a full mode-setting operation.

DRM's state-checking phase is independent of the hardware's state and does not modify it. If checking an atomic state fails, the compositor receives an error code, but the display output remains unchanged. It is also possible for the compositor to verify atomic states without committing them. This allows building a list of supported configurations beforehand.

For further reading, the inner workings of atomic mode setting have been covered in detail on LWN back in 2015: part 1 and part 2.

Additional features

In the discussion of planes, it has been assumed that all of the hardware's planes are the same. But that's not always the case. There is usually a plane, called the primary plane, for RGB-like color formats, which covers the whole display. The compositor sets up the primary plane to display its on-screen image.

But most hardware provides an additional plane for the mouse pointer, called the cursor plane. This plane only covers a small area and floats above the primary plane. As the name suggests, the compositor uses the cursor plane to display the mouse-pointer image, which can now be moved around without changing the primary plane's on-screen image at all.

Between the primary and cursor plane are overlay planes, which are of various sizes and often support YUV-like color formats. This makes them suitable for displaying streams of video data with low CPU overhead. For that, the video-player application provides the compositor with buffer objects that contain the YUV-based pixel data.

The compositor sets up the overlay plane with a framebuffer of the pixel data. The plane scans out the YUV pixel data and performs the color conversion to RGB in hardware. Using dma-buf, the video player can forward individual YUV frames from a hardware video decoder directly to the compositor, thus leaving the entire video processing to hardware.

If latency of the display update is of critical concern, it can be helpful to hand over mode-setting capabilities to a single application. The compositor therefore leases the functionality to the application. While an application holds an active DRM lease, it has full control over the mode-setting pipeline. This is useful for 3D headsets, which need to tightly coordinate the output frequency and latency of their internal displays to make the 3D illusion work. DRM leases expire or can be revoked, so the compositor ultimately remains in control of mode setting.

While modern compositors use Wayland as their protocol, applications for the X Window System are still common. Xwayland is an X server that runs within a Wayland session. It lets X applications participate in the Wayland sessions transparently by translating between Wayland and X protocols. This works for most use cases.

The common use case Xwayland cannot emulate is screen capturing and screen sharing. X applications have access to the X session's whole window tree, which makes screen capturing easy. For security purposes, the Wayland protocol does not allow applications to read the screen or other application's windows. Wayland compositors therefore provide dedicated implementations for capturing or sharing the screen's content. PipeWire, VNC, or RDP are commonly used for this functionality.

If no compositor is active, Linux displays a text console. DRM supports the kernel's framebuffer console for text output. This DRM fbdev emulation acts like a DRM client from user space, but runs entirely within the kernel. It also provides the old framebuffer interfaces, such as /dev/fb0. Fbdev and DRM's fbdev emulation are on their way to retirement, though. There are ideas for moving much of the console functionality to user space.

At the time of writing this article, one quickly evolving topic for Linux graphics is High Dynamic Range (HDR) rendering, which displays the output with more nuanced colors and lighting, thus showing details that are often lost with traditional rendering. Support for this will position Linux to fulfill the needs of professional graphics artists. Currently, support is still uneven, but it's possible to use HDR in games and Linux desktops are beginning to implement HDR as well.

At this point, we have followed the path of getting the application's content onto the screen in the modern Linux graphics stack—from rendering and memory management to compositing and mode setting. But we've really just scratched the surface. The stack keeps evolving and constantly adds support for new features and hardware.

Comments (11 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>