LWN.net Weekly Edition for June 4, 2020
Welcome to the LWN.net Weekly Edition for June 4, 2020
This edition contains the following feature content:
- Free user space for non-graphics drivers: when should kernel drivers be required to include free user-space components?
- The history and evolution of PHP governance: the PHP project is an interesting exercise in direct democracy; here is how it got there and how it works.
- Capacity awareness for the deadline scheduler: what needs to be done to make the deadline scheduler properly manage asymmetric systems.
- A possible end to the FSGSBASE saga: an Intel processor feature from 2012 may finally get support in the kernel.
- Development statistics for the 5.7 kernel: where the code for the 5.7 release came from.
- Merkle trees and build systems: using Merkle trees in the build system opens up some interesting optimization possibilities.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Free user space for non-graphics drivers
In the kernel graphics world, there has been a longstanding "line in the sand" that disallows merging kernel drivers without a corresponding free-software user-space driver. The idea is that not having a way to test the full functionality means that the kernel developers cannot verify the proper functioning and security of the driver; changes to the kernel driver may lead to unforeseen (and untestable) problems on the user-space side. More recently, though, we have seen other types of devices with complex drivers, but no useful free user-space piece, that have been proposed for inclusion into the kernel; at least one was merged, but the tide has perhaps turned against those types of drivers at this point—or some of them, anyway.
In mid-May, Jeffrey Hugo posted an RFC patch for the "Qualcomm Cloud AI 100" device, which is a PCIe card with an application-specific integrated circuit (ASIC) that targets "deep learning" workloads. The device is also referred to as a QAIC device; it presents a modem host interface (MHI) control path and a DMA engine for the data path. These are exposed in the driver as a Linux character device with ioctl() commands to access the data path.
Dave Airlie, who drew the line in the sand back in 2010, wondered how the QAIC driver was really any different. It has a user-facing API (which he calls the "uapi") with no open-source users and no tests for it.
Hugo said
that he would like to put one of the devices "into the hands of Linaro, so that
it can be put into KernelCI
". In order to do that, he needs to get
permission from others at Qualcomm. Beyond that, though, it will be
difficult to get anything particularly useful on the user-space side:
I've asked for authorization to develop and publish a simple userspace application that might enable the community to do such testing, but obtaining that authorization has been slow.
Perhaps to Airlie's surprise, though, Kroah-Hartman said that he would not review the driver any further without that kind of information (and code); furthermore, he would not merge it and he would actively oppose anyone else merging it, as well.
Especially given the copyright owner of this code, that would be just crazy and foolish to not have open userspace code as well. Firmware would also be wonderful as well, go poke your lawyers about derivative work issues and the like for fun conversations :)
Hugo said that he would try, but did not seem to have a lot of hope that he would be able to deliver what the kernel developers are looking for. On the other hand, Daniel Vetter was surprised by Kroah-Hartman's position:
Instead it's "totally-not-a-gpu accel driver without open source userspace" _and_ you have to be best buddies with Greg. Or at least not be on the naughty company list. Since for habanalabs all you wanted is a few test cases to exercise the ioctls. Not the entire userspace.
Vetter is referring to the Habana
Labs kernel driver that was merged over his, Airlie's, and others' objections.
The problems that were pointed out at the time are much the same as those
raised for the QAIC driver, but it would seem the outcome is going to be
different this time. Kroah-Hartman said that
he was particularly concerned with "the copyright owner of this
code
", which must be referring to Qualcomm. The code is actually
copyrighted by the Linux Foundation (Kroah-Hartman's employer) and comes
from the Code Aurora Forum, which
is a Linux Foundation collaborative project; the Qualcomm Innovation Center
(QuIC) is a principal member of that organization.
He noted that Habana Labs has released its library code as open source since the merge, so that situation has resolved. But he did admit that he was wrong about these types of drivers:
So I was wrong, and you were right, my apologies for my previous stubbornness.
The recent submission of a DirectX driver from Microsoft is in a similar position. Sasha Levin posted an RFC patch on May 19:
The use case, at least for now, is not about graphics at all; it is meant to provide a means for a Linux guest on a Windows system to access the GPU for compute purposes. But the blog post announcing the feature seems to indicate that graphics support is at least on the horizon. Vetter said that, as documented, a graphics driver needs a free user-space before it can go upstream, but that does not seem to be what is happening here:
He suggested that since the driver is not targeting graphics it could move elsewhere in the tree (such as drivers/hyperv). Airlie thought similarly:
He is concerned, though, that merging it in a Hyper-V-specific part of the tree may lead to trouble down the road if graphics functionality gets added to the driver. But Microsoft's Steve Pronovost said that, while graphics support may be in the future, the driver being proposed would not be used; it is strictly meant for the GPU-compute case. There is work going on to support Linux GUI applications on WSL, but that would be done using the Remote Desktop Protocol (RDP) and a Wayland compositor; that work would be fully open source, he said.
Levin noted
that he would personally prefer that the user-space pieces needed for the
proposed driver were
available as open source "and wish I could do something about
it
". He is amenable to moving the driver outside of the graphics
tree, which would seem to be the path forward. He also pointed out that
the RFC "is not a case of 'we want it upstream NOW' but rather
'let's work together to figure out how to do it right' :)
".
It would seem, then, that the QAIC driver is going to languish, while the Microsoft driver may be able to find its way into the mainline, even though neither has any significant free user-space code that can be used to test it. In part, that may be due to the difficult relationship Qualcomm has had with the kernel development community over the years. Beyond that, the QAIC device seems to be a more complicated beast, with more ways that things can go wrong—invisibly. But the Microsoft driver has not been merged yet; we will have to wait to see whether its lack of a free user space holds it back—or not.
These are undoubtedly not the last drivers of this sort we will see. There
is a trend toward devices that are programmable from user space and the
kernel is simply used as a conduit to carry proprietary code and data
between user space and the device; only the device maker has any real view
into what is actually happening inside it. There is clearly an advantage
for device makers to get their drivers into the mainline, but is there any
real gain for the community by doing so?
As Airlie said about the DirectX driver: "[...] I don't see
the value this adds to the Linux ecosystem at all, and I think it's
important when putting a burden on upstream that you provide some
value.
"
Users who have these devices may benefit from finding the driver in their distribution kernel, and the device makers get "free maintenance" by the kernel developers, but the community is left with a pile of code that is not particularly useful—and could be fragile in ways that could cause problems in the future. It does not really sound like the usual free-software bargain. Luckily, the kernel developers provide a highly functioning platform that the device makers can use to sell and run their devices; they may just have to maintain and distribute their own drivers in order to do so.
The history and evolution of PHP governance
The PHP language is widely used in solving some of the most interesting technical problems on the web. But for a language with widespread use, it is unique — or at least an outlier — in the way it's governed compared to other open-source projects. Unlike others, PHP governance has grown into something fairly democratic for a project its size, allowing almost anyone to bring an idea to the table. If it's popular enough, that idea can find its way into a future release. That is, of course, as long as there is a developer to put in the work to make it happen. Understanding the governance of the project can help answer questions on why things are the way they are, and add context to the challenges the project will face in the future.
The early days of PHP governance
Looking back to the early 2000s, as was true for many open-source projects at the time, the governance and direction of the project was largely dictated by the simple concept of what has been known in the PHP community as "karma". That is to say, the more contributions you made to the project the more clout you had when it came to deciding which features made it into a release. Especially early on, there was little to no gatekeeping when it came to handing out repository credentials to people who wanted to contribute something interesting. If a developer wanted to add something, the biggest barrier was often only having the technical understanding to do so correctly. A good example of this is SimpleXML, which provides object-oriented mapping of XML documents. The experimental implementation of that feature more or less just appeared in the code base one day without much of any discussion at all. Back in the early 2000s, if you knew how to do it and it seemed reasonable, most of the time your code made it into a release.
From the beginning, PHP was a language born of an itch to scratch. Solutions, rather than concerns over consistency or academic purity, have always been a main goal of the project. This show-me-the-code approach can certainly be credited for the vibrant community at work on the project still today, but it has also led to plenty of problems in regards to governance.
Sidestepping the occasional argument over a relatively minor commit in the early days, the first real struggles PHP had was around the time of the release of PHP 4 with the introduction of the Zend Engine. Written by contributors Zeev Suraski and Andi Gutmans, it replaced the implementation of the language found in PHP 3 with a more robust engine and API for PHP 4. By itself, the value of the contribution to the project is undeniable. However valuable, the engine and control over it quickly became a point of conflict with the community as those same two developers founded Zend Technologies in order to sell closed-source technology for PHP based on the open-source engine they introduced to the language.
The contribution itself rocketed Suraski's and Gutmans's standing in terms of clout when deciding the future direction of the language, but it left many of the other contributors uncomfortable. For starters, especially early on, the number of people who even understood the engine's implementation was limited and documentation was scarce. Perhaps more significantly was that Zend's product line included features that arguably belonged in the language, such as debugging and performance enhancements. These circumstances caused a certain amount of resentment among other community members as they started to get the impression that the two developers were preventing features that competed with their commercial offerings from getting into the open-source code base.
This feeling of the project being run by benevolent dictators with commercial interests rose to the point that a significant portion of the contributors began secret discussions on how to separate themselves from Zend through a fork in 2002 — the effort fell apart after Suraski discovered it. Rather than a fork, by 2003 two new projects started to provide open-source alternatives to Zend's offerings: xdebug to provide debugging services and APC, which cached the compiled form of PHP code between requests.
In the years since, there have been multiple efforts to integrate these technologies into PHP's core and none of them have been successful. Today, the xdebug project is still alive and well, continuing to be maintained by its original author, Derick Rethans. As for APC, Zend released as open-source its implementation of the caching and optimization technology called opcache ten years later in 2013, at which point it was promptly bundled into the language and released as part of PHP 5.5; this effectively ended the need for the APC project altogether.
Things have changed significantly, and arguably for the better, since those tense years. As PHP has matured both as a language and project, it has moved away from the haphazard practices of the past and adopted a Request For Comments (RFC) process to govern the discussion of how new features, both ideas and implementations, are brought into the language. The original RFC process itself is nothing new and roughly ten years old, but looking back, its impact has been remarkably positive. Recently a number of changes have been made by the community to improve the process, which makes it worth examining.
The RFC process
The basic steps for getting new code into a PHP release via the RFC process are fairly straightforward today — the first step is to document the proposal in the RFC section of the PHP wiki, followed by sending an email to the Internals mailing list. It is strongly recommended a preemptive e-mail to Internals be sent before a proposal is written as well, in order to ensure time is not wasted on something that isn't fundamentally viable. The matter is discussed (or not) for a minimum of two weeks, changes made as necessary to the RFC, and ultimately the author of the RFC opens the issue up for a vote or the RFC is withdrawn. If the measure passes, then a new thing in the PHP community is approved. If not, there are opportunities to try again in the future.
Since RFCs can be anything from a new feature to procedural changes, what happens next depends on the RFC. For code changes, a working implementation prior to the RFC being accepted is strongly preferred but not a hard and fast requirement if being proposed by an established contributor. For RFCs from less known contributors, patches would be reviewed by the community during the RFC process prior to voting. RFCs from less known contributors without a patch or that give the impression someone other than the proposer is going to be doing the work for them have little chance for success. For procedural changes, upon approval the official relevant documentation is updated to reflect the new policies.
The devil, of course, is in the details. Who gets to vote, and what constitutes a passing majority are both questions whose answers have evolved in recent times.
When it came to how a RFC is accepted, originally there were two different sets of criteria depending on the subject. RFCs that affected the language itself (such as a new syntax) required 2/3 of the voting population in favor of the change, while anything else (such as adding a new function) required a simple majority. In PHP governance, the total voting population is defined by who shows up to vote rather than a well-defined population. In terms of a supermajority for an RFC to be approved, the number of "yes" votes must be greater than or equal to the number of "no" votes multiplied by two. This means a vote could have ten or a hundred participants, and the number of participants has historically been an indicator of how contentious the subject of the RFC is in the community.
These rules, specifically the concept that so many decisions could be made by a simple majority, has been something that has bothered members of the internals community for quite some time and a RFC to remedy it was introduced in 2016. That RFC laid largely dormant however until 2019. The issue finally came to a head in early 2019 when contributor Joe Watkins took issue with Suraski's perceived abuses of the simple majority clause in an echo of the conflicts of old. On the internals mailing list, he accused Suraski of abusing the spirit of the rules:
You pushed FFI into php-src on a simple majority, it had one user, was incomplete (with leaks), and had zero justification for being included in php-src - it didn't require any internal API's and can function just fine as a PECL extension, still you pushed through with the RFC and it was accepted on a simple majority.
You are now trying to push JIT into php-src on the same slim, clearly unacceptable majority, and even if you change the majority requirements, what's worse is you want to include it in a minor version. Once again, this is an incomplete feature, failing to support the architectures we support and declaring them unimportant. You are pushing PHP towards a future where there is effectively a single contributor, possibly two, able to make changes to Zend+Opcache; You are changing core parts of PHP too fast and making other contributors, including the maintainers of external tooling which the ecosystem requires to function, uncomfortable.
In the end the RFC to remove simple majorities for future RFCs won adoption, with a clear majority of thirty in favor and two against. Suraski was one of the two opposed to the measure; he wrote at the time:
Reasons I voted 'no':
- This is clearly rushed. A more comprehensive RFC is in the works and already under discussion.
- Substantial changes were made last minute, without allowing reasonable time to discuss, or for people to even acquaint themselves with the changes. As an RFC that deals with process, that's an extremely bad precedent.
- It places strong bias for status quo even in cases where one does not exist.
- It does not tackle the voting eligibility issues we have.
The general idea is good, the implementation & surrounding process are bad.
Interested readers can examine the exact changes to the RFC procedure that resulted. In short, under the new rules, all RFC votes require a 2/3 majority of participants in the vote, effectively preventing changes from passing with a slim majority. If a RFC meets the 2/3 overall majority criteria, simple majority votes are allowed for related items within that RFC (such as implementation details of a feature).
As for who has the right to vote on RFCs, PHP only requires a GitHub account and code contributions to the project, with no limitations on when those contributions were made. Exceptions are made to grant voting rights to developers of large projects using the PHP language but not necessarily involved in internals. This means the PHP project's nearly 2,000 contributor accounts may qualify as potential voters on any given RFC. As a result, language features and procedure changes are generally decided by those who show up to cast their ballot. Generally speaking, a random sampling of accepted RFCs hint that relatively simple measures like a new function have between 20 to 30 participating voters, while more meaningful changes like anonymous classes had 52 votes cast.
Democratic coding
This democratic approach to language features for PHP has its benefits. A relatively low bar of entry helps encourage new contributors while making sure the evolution of PHP doesn't stagnate. Where in a top-down organizational structure momentum depends on the those at the top to advance the project, PHP's governance rules are set up in such a way that any of a large group of voters could conceivably reach the 2/3rd of votes needed.
Guard rails of sorts still exist, although less officially now, through the contributions of longtime internals community members who hold significant abilities to influence opinion during discussions. When compared to the governance of PHP before the RFC process, that influence is now much farther from absolute. Earlier we saw how this was true in procedural matters, but even for major language features it holds true. One example is the implementation of Scalar Type Declarations introduced in PHP 7, where even PHP founder Rasmus Lerdorf and other prominent contributors were overruled in a vote of 108 to 48 after heated debate.
In the final analysis, PHP governance continues to be a messy endeavor with various camps of thought on how the language should move forward at constant conflict with each other. The RFC process has done a great deal to democratize these conflicts and move away from the benevolent dictator approaches of the past, but little to remove the conflicts themselves. One welcome thing it does provide is a clear path for new contributors to make small contributions, such as a single useful function, and succeed in seeing their ideas make it into releases without a great deal of turmoil. For the contributor who is looking to become involved in heavier project decisions however, they will still find it to be at times a discouraging process and should expect to defend their ideas when attempts are made to tear those ideas asunder by more seasoned contributors.
Ultimately as messy as it is (as democracy can be), PHP governance's track
record speaks for itself in its continued production of important and useful
code. While PHP governance has dramatically improved over the years, it is
unlikely to ever fundamentally change from the bare-knuckle exchange of ideas
course it began with — merely simply refine it. As Lerdorf has said, "Ugly problems often
need ugly solutions because the pretty solution is sometimes just too much of
a hassle for everyone involved.
" At the time Lerdorf wasn't speaking about
governance, but at least when it comes to the PHP project the sentiment
continues to have influence today.
Capacity awareness for the deadline scheduler
The Linux deadline scheduler supports realtime systems where applications need to be sure of getting their work done within a specific period of time. It allocates CPU time to deadline tasks in such a way as to ensure that each task's specific timing constraints are met. However, the current implementation does not work well on asymmetric CPU configurations like Arm's big.LITTLE. Dietmar Eggemann recently posted a patch set to address this problem by adding the notion of CPU capacity to the deadline scheduler.
In realtime systems, tasks need to meet certain timing requirements. The Linux kernel includes two realtime scheduling classes to meet the needs of these systems: POSIX realtime (often called just "realtime") and deadline.
The POSIX realtime scheduler uses task priorities as the basis of its decisions; the task with the highest priority will be run first. The deadline scheduler, instead, dispenses with priorities and describes tasks using three parameters: the run time, period, and deadline. The run time is the CPU time that the task requires to finish its immediate work, the period defines the time between two activations of the task, and the deadline is the time by which the task must be able to use its CPU time. Interested readers can find more explanation of the theory behind the Linux realtime schedulers and the differences between them in an earlier article.
The deadline scheduler and asymmetric CPU configurations
The deadline scheduler includes an admission algorithm that runs when a task requests deadline scheduling and ensures that the system will be able to meet the task's needs. A task will be refused entry into the deadline class if the realtime guarantees cannot be met and any task in the system would miss its deadlines. However, the algorithm does not guarantee the deadlines will always be met, it only guarantees bounded tardiness for the deadline tasks and non-starvation for non-deadline ones. This is because the ability to meet deadlines in the general case depends on the tasks already in the system; a detailed explanation is available in this article.
The work of the deadline scheduler becomes more complicated in asymmetric CPU configurations, like big.LITTLE or DynamIQ. Such systems include different types of CPUs, with higher and lower performance. The same task running on a higher-performance ("big") CPU will take less time than when run on a lower-performance ("little") one. The deadline scheduler in current kernels does not take that difference into account, with the result that it can over-allocate the CPU time on lower-performance CPUs. Deadline tasks could end up on a little CPU, scheduled in such a way that they are unable to finish before their deadlines, while they would be able to do so on a higher-performance CPU. On such systems, the admission-control algorithm, which assumes that all CPUs perform at the level of the big ones, could overcommit the system with deadline tasks, making the system unusable.
The information missing in the deadline scheduler is an understanding of CPU capacity — the number of instructions that can be executed in a given time. More details of how it is calculated can be found in this article. The CPU capacity is already used in load balancing and in other situations, for example when changing CPU frequency because of overheating, and it has recently been added to the realtime scheduler. Eggemann's work takes capacity into account in the deadline scheduler's admission-control and task-placement algorithms. After the changes, the deadline scheduler places the tasks in such a way that the available capacity is sufficient to allow tasks to meet their deadlines in both symmetric and asymmetric CPU configurations.
The changes
The admission-control algorithm bases its decisions on the total CPU capacity provided by the system. In symmetric systems, where all CPUs have the same capacity, the sum is the simply the number of CPUs multiplied by a constant. Current kernels calculate total capacity this way even on asymmetric systems; all CPUs are assumed to have the capacity of the largest ones. The new code changes this metric in the asymmetric case, causing it to calculate the sum of the actual capacities of all active CPUs.
The deadline scheduler's task-placement code also must gain a better understanding of the system's CPU topology. Before moving a task to a new CPU, the scheduler needs to ensure that the new CPU can handle that task. In asymmetric systems, a new type of a check is needed to find out if the CPU's capacity is sufficient to perform a given task's work before its deadline. This fitness check is performed using the following formula:
(CPU capacity) / 1024 >= (task runtime) / (task deadline)
The default CPU capacity is 1024; lower-performance CPUs have a capacity lower than that. The left-hand side of this formula thus yields a fraction indicating the relative capacity of the CPU in question. For example, assume the capacity of a small CPU is 462, then that fraction is 462/1024, or 0.45. The formula will admit only tasks that have ratio of run time (which is relative to a big CPU) to deadline of less or equal to 0.45. A task with a runtime of 13,000µs and deadline of 16,000µs will not be admitted, as 13,000/16,000 is 0.81, which is a larger value.
This check is used when waking up a deadline task, moving a deadline task to a different CPU, and migrating a task out of a CPU that is going offline. Eggemann showed some example capacity calculations during the discussion of an earlier version of the patch set.
If a task cannot be served according to its deadline, it will miss that deadline. This can always happen with the deadline scheduler, for example, in a situation when the task was admitted successfully, but then one of the CPUs went offline and the overall capacity of the system was reduced. This change, introduced in the patch set, is how this situation will be handled in an asymmetric configuration; the scheduler will choose the CPU with the maximum available capacity. If there are several of them, it prefers the current CPU, if possible, to make use of any remaining cache contents.
Limitations and further work
The work on asymmetric CPU support in the deadline scheduler does not end here. The current patch set only supports the case when there is at least one CPU without any deadline tasks; otherwise task placement may still be incorrect. The case of more heavily loaded systems will need to be addressed later.
During the discussion, Juri Lelli pointed out a possible problem: if a set of small deadline tasks starts first, they will be placed on large CPUs. If a bigger task is then admitted, it may not find a CPU large enough to run on. Luca Abeni (co-author of the patch set) responded that they do have an updated patch where the scheduler places tasks on the smallest CPU that can get the work done. This patch will be submitted later.
The patch set has received positive reviews, and we should expect that this fix will become part of the mainline kernel soon. We can also expect to see more patches in this area as there is more work to do; with more asymmetric CPU architectures popping up, users may require better support of such configurations in their workloads.
A possible end to the FSGSBASE saga
The FSGSBASE patch series is up to its thirteenth version as of late May. It enables some "new" instructions for the x86 architecture, opening the way for a number of significant performance improvements. One might think that such a patch series would be a shoo-in, but FSGSBASE has had a troubled history; meanwhile, the delays in getting it merged may have led to a number of users installing root holes on their Linux systems in the hope of improving security."Segments" are a holdover from ancient versions of the x86 architecture; they once were distinct regions of memory used to get around the addressing limitations of that era. Virtual memory has done away with the need for segments, but the concept persists; x86_64 processors only implement two of the original segments (called "FS" and "GS"). In these processors, a "segment" is really just an offset into virtual memory with little other meaning; their remaining value comes from the segment-based addressing mode supported by the CPU.
Historic or not, these segment registers are still used. A common use for FS in user space is thread-local storage; each thread has a unique value of the FS base register pointing to its own storage area. Code running in threads can then use segment-based addressing to access local storage without having to worry about where that storage is. The kernel, instead, uses GS in a similar way for per-CPU data. There are some relics of the kernel's one-time use of FS to indicate the address range accessible to user space, but the kernel's get_fs() and set_fs() functions no longer use that segment.
Modifying the segment registers has always been a privileged operation. There is value, though, in letting user space make use of the FS and GS base registers, so the kernel provides that functionality via the arch_prctl() system call. Since the base registers are actually set by the kernel, privileged code can count on knowing what their contents will be (and that said contents make sense).
FSGSBASE
Calling into the kernel to set a register is a relatively expensive operation, though. If the call needs to be done once to set up a thread-local storage area, nobody will notice the cost, but code needing to make frequent changes to the FS or GS base register will be slowed down by the system-call overhead. Actually setting those registers, which are stored in x86 model-specific registers (MSRs), is somewhat costly in its own right. So Intel added a set of instructions to directly manipulate the FS and GS base registers to the "Ivy Bridge" series of processors in 2012. This set of instructions is often referred to as "FSGSBASE".
Before user space can actually use those instructions, though, the kernel must set a special bit enabling them and, despite the time that has passed since they became available, that bit remains unset. Since the kernel has always had control of those registers, it contains a number of assumptions about their contents; just letting user space change them without preparing the kernel first is a recipe for any of a number of easily exploited vulnerabilities.
Avoiding those problems is conceptually fairly simple, though a bit more complex in the implementation. The kernel must take pains to ensure that the FS and (especially) GS registers have correct values on every entry into kernel space. The handling of certain speculative-execution vulnerabilities gets a bit more complicated. And, of course, a control knob must be provided so that administrators can turn FSGSBASE off if need be.
All it takes is somebody to write this code. Intel was slow to post FSGSBASE patches — and nobody else stepped forward to do that work either. When patches were finally posted, they ran into a number of problems in review and have required numerous revisions. The curious can see this message from Thomas Gleixner for an opinionated timeline of events through March 2019. Version 7 of the patch set, posted in May 2019, got as far as being merged into the x86 subsystem tree before various problems came to light; that merge was subsequently reverted in a rather grumpy way after new problems came to light. More recently, Sasha Levin has picked up this work (despite not being an Intel employee) and is trying to get it across the line; he may yet succeed for the 5.9 development cycle.
Root holes enter a vacuum
The development of this seemingly simple feature has been a rather long and fraught process; during all of this time, users have been unable to take advantage of it. But users, being users, have proved unwilling to wait. One of the use cases that has created the most pressure is Intel's "Software Guard Extensions" (SGX), which is meant to allow the creation of private "enclaves" to protect privileged code and data. SGX support for the kernel has had its own difficult history and remains unmerged, so developers wanting to figure out how to make use of this feature have been working entirely out of tree.
One of the most prominent projects in this area is Graphene, which describes itself as a "library OS" for secure applications. The web site mentions SGX this way:
Graphene got its start as a research project, but has since received a fair amount of support from companies, including Intel. The project has ambitions for being the standard SGX support platform, and some cloud providers are evidently looking at supporting it — with Intel's blessing.
There is just one little problem with Graphene: working with SGX requires the ability to modify the FS base register frequently. To keep calls into enclaves fast, Graphene loads a little kernel module that enables the FSGSBASE instructions. Since the kernel is not prepared for this, that action immediately opens up a root hole on the system involved — just what one wants to see from a system that is supposed to bring heightened security. Graphene is not alone in this behavior; the Occlum SGX library does the same thing, for example.
It is fair to say that the kernel-development community was, as a whole, unimpressed by this approach to the problem. Don Porter, one of the creators of Graphene, tried to justify enabling FSGSBASE behind the kernel's back by pointing out that SGX projects assume that the host operating system is compromised; SGX exists to protect data in just that situation, after all. Extending that philosophy to compromising the system from the outset, though, is still a hard sell.
In the end, kernel developers can usually understand the idea of using this kind of a hack to get a problem out of the way while addressing other issues. The fact that there is no warning to be found in Graphene's flashy web site, or in the papers describing Graphene, that installing the code compromises the system is harder to swallow. There is even, as Levin pointed out, a book called Responsible Genomic Data Sharing that suggests using Graphene, which seems not entirely responsible. After some discussion, Porter came around to the idea that some high-profile warnings are needed to keep potential users from opening up their systems in the name of "security".
Warnings are a step in the right direction, but the proper way to address this problem is to get the FSGSBASE patches into the kernel so that the other hacks are no longer necessary. As an added benefit, these patches make the kernel faster too, since the new instructions are faster than performing operations on MSRs. Proper FSGSBASE support should, thus, make almost everybody happier.
As noted, that will hopefully happen soon. This whole long affair has left a bit of a bad taste in many developers' mouths, though; there is some overt unhappiness with how Intel has handled the situation. Straightening that out may take longer.
Development statistics for the 5.7 kernel
The 5.7 kernel was released on May 31. By all appearances this was a normal development cycle, unaffected by the troubles in the wider world. Still, there are things to be learned by looking at where the code came from this time around. Read on for LWN's traditional look at who contributed to 5.7, who supported that work, and the paths by which it got into the mainline.Work on 5.7 arrived in the form of 13,901 non-merge changesets contributed by 1,878 developers; that makes it rather busier than the 5.6 cycle was. It's notable that 281 of those developers made their first contribution to the kernel for 5.7, the highest number since 5.0; that is a distinct contrast from 5.6, which saw the lowest number of new contributors since 2013. Perhaps being made to stay at home has inspired more people to put together and send in that first kernel patch.
The most active developers contributing to 5.7 were:
Most active 5.7 developers
By changesets Gustavo A. R. Silva 235 1.7% Chris Wilson 231 1.7% Geert Uytterhoeven 161 1.2% Christoph Hellwig 138 1.0% Sean Christopherson 137 1.0% Takashi Iwai 132 0.9% Mauro Carvalho Chehab 129 0.9% Anson Huang 109 0.8% Al Viro 108 0.8% Andy Shevchenko 101 0.7% Ville Syrjälä 98 0.7% Kuninori Morimoto 96 0.7% Jani Nikula 95 0.7% Thomas Gleixner 91 0.7% Colin Ian King 91 0.7% Masahiro Yamada 90 0.6% Lorenzo Bianconi 90 0.6% Jakub Kicinski 86 0.6% Ard Biesheuvel 85 0.6% Josef Bacik 83 0.6%
By changed lines Greg Kroah-Hartman 41035 6.4% Alex Elder 14405 2.3% Chris Packham 10886 1.7% Mauro Carvalho Chehab 10355 1.6% Chris Wilson 7931 1.2% Jani Nikula 7719 1.2% Marc Zyngier 7659 1.2% Srujana Challa 7537 1.2% Namjae Jeon 7269 1.1% Manivannan Sadhasivam 6836 1.1% Jyri Sarha 5622 0.9% Linus Walleij 5056 0.8% Christoph Hellwig 4957 0.8% Laurent Pinchart 4781 0.7% Taniya Das 4714 0.7% Paul Blakey 4367 0.7% Dmitry Bogdanov 4328 0.7% Vladimir Oltean 4210 0.7% Jerome Brunet 3973 0.6% Maxime Jourdan 3921 0.6%
Gustavo A. R. Silva's place at the top of the "by changesets" column is almost entirely a result of his ongoing mission to replace zero-length arrays in structures with flexible arrays throughout the kernel; see this commit for a typical example of this work. Chris Wilson worked exclusively on the Intel i915 graphics driver, Geert Uytterhoeven contributed changes throughout various driver subsystems, Christoph Hellwig worked extensively in the XFS, SCSI, and block subsystems (and beyond), and Sean Christopherson continues to do a lot of work with the KVM hypervisor.
When Greg Kroah-Hartman gets to the top of the "lines changed" column, it usually means he's been busy deleting code, and that is certainly the case this time; he removed the exFAT filesystem (which reappeared later in the filesystem tree) and the wireless USB and UWB drivers from the staging tree. Alex Elder contributed the Qualcomm "IP accelerator" network driver, Chris Packham restored the Octeon USB and Ethernet drivers to the staging tree, and Mauro Carvalho Chehab converted untold numbers of documents to the RST format.
Work in 5.7 was supported by 215 employers that we were able to identify. The most active of those employers were:
Most active 5.7 employers
By changesets Intel 1682 12.1% (Unknown) 1202 8.6% Red Hat 986 7.1% (None) 788 5.7% SUSE 548 3.9% 514 3.7% Huawei Technologies 508 3.7% Mellanox 492 3.5% AMD 491 3.5% Linaro 412 3.0% (Consultant) 386 2.8% NXP Semiconductors 374 2.7% IBM 371 2.7% Renesas Electronics 315 2.3% Linux Foundation 310 2.2% Arm 278 2.0% 192 1.4% Code Aurora Forum 181 1.3% Oracle 176 1.3% Texas Instruments 175 1.3%
By lines changed Intel 69584 10.9% Linux Foundation 45153 7.1% Linaro 44649 7.0% (Unknown) 40631 6.4% Red Hat 33022 5.2% (None) 20662 3.2% 19940 3.1% (Consultant) 19425 3.0% Mellanox 19317 3.0% SUSE 19127 3.0% Huawei Technologies 18301 2.9% Code Aurora Forum 17861 2.8% Marvell 17833 2.8% Texas Instruments 17314 2.7% IBM 14973 2.3% NXP Semiconductors 14223 2.2% AMD 12787 2.0% BayLibre 11445 1.8% Samsung 11238 1.8% Allied Telesis 11029 1.7%
As usual, there is little change in this table from one development cycle to the next.
How those changes get into the kernel
The days when developers would send their patches directly to Linus Torvalds are long past; almost all work passes through the hands of one or more subsystem maintainers first. All maintainers who manage a patch until (and including) its merging into a subsystem repository will add a Signed-off-by tag to that patch. That helps to identify the chain that got a particular patch into the mainline; it is also useful for generating metrics about who is managing patches in general. A Signed-off-by tag by a developer who is not the author of a patch is almost always indicative of subsystem maintainer activity, so looking at such tags tells us who those maintainers are.
In 5.7 the busiest maintainers (as measured by non-author Signed-off-by tags) and the employers that supported them were:
Non-author signoffs in 5.7
Developers David S. Miller 1531 11.6% Greg Kroah-Hartman 811 6.1% Mark Brown 538 4.1% Alex Deucher 429 3.2% Andrew Morton 401 3.0% Martin K. Petersen 278 2.1% Jens Axboe 250 1.9% Mauro Carvalho Chehab 236 1.8% Paolo Bonzini 235 1.8% Shawn Guo 213 1.6% David Sterba 196 1.5% Herbert Xu 170 1.3% Michael Ellerman 169 1.3% Alexei Starovoitov 168 1.3% Saeed Mahameed 158 1.2% Vinod Koul 158 1.2% Hans Verkuil 157 1.2% Ingo Molnar 146 1.1% Jason Gunthorpe 145 1.1% Thomas Gleixner 143 1.1%
Employers Red Hat 2560 19.3% Linaro 1377 10.4% Intel 986 7.4% Linux Foundation 878 6.6% 787 5.9% Huawei Technologies 488 3.7% Mellanox 486 3.7% SUSE 486 3.7% 465 3.5% AMD 463 3.5% (None) 440 3.3% Oracle 411 3.1% IBM 347 2.6% Texas Instruments 231 1.7% Arm 231 1.7% Code Aurora Forum 213 1.6% (Unknown) 200 1.5% Qualcomm 159 1.2% (Consultant) 158 1.2% Cisco 158 1.2%
There may be over 200 companies supporting work on the Linux kernel, but it is still true that half of the patches going into the kernel pass through the hands of gatekeepers working for just five companies.
Once a patch lands in a subsystem Git repository, the trail of Signed-off-by tags ends. With a bit of digging, though, one can still find the traces that are left when a branch from one repository is merged into another; that allows the generation of a picture showing the paths taken by patches once they are applied. The result is a complicated graph, a small piece of which looks like this:
Click on the image above to see the entire graph in its full, eye-straining glory.
One picture that emerges quickly from the graph is that it is still fairly flat. Some large subsystems go through a few layers of maintainers, but there are a lot of trees that feed directly into the mainline. Developers may not send their patches straight to Torvalds anymore, but subsystem maintainers still do.
There is another thing to be seen in this plot. Maintainers are supposed
to apply cryptographic signatures to their tags before pushing changes
upstream; that is how the recipient knows that the pull request actually
came from the person it appears to. Recently, Torvalds asked
a maintainer to start using signed tags, and added that "I've been
encouraging people to do that even on kernel.org, and we've got fairly
high coverage these days
". One is naturally led to wonder how high
that coverage actually is.
In the plot, trees that are not using signed tags to push changes upstream are colored red; one sees that there are still quite a few of them. Of the 121 trees that Torvalds pulled from during the 5.7 development cycle, 101 were using signed tags, while 20 were not, so that coverage is just over 83%. That only looks at trees directly pulled by Torvalds, though. If one looks at the whole picture, there are 214 subsystem trees involved, of which 167 are using signed tags — 78% of the total. So coverage may be "fairly high", but it is certainly not universal yet.
But, then, the kernel development process has always been a work in progress; like the kernel itself, it will never reach a point where it can be deemed to be "perfect". Meanwhile, it continues to integrate changes at a high rate and release the kernels that we all depend on; perfection is not needed to be good enough most of the time.
Merkle trees and build systems
In traditional build tools like Make, targets and dependencies are always files. Imagine if you could specify an entire tree (directory) as a dependency: You could exhaustively specify a "build root" filesystem containing the toolchain used for building some target as a dependency of that target. Similarly, a rule that creates that build root would have the tree as its target. Using Merkle trees as first-class citizens in a build system gives great flexibility and many optimization opportunities. In this article I'll explore this idea using OSTree, Ninja, and Python.
OSTree
OSTree is like Git, but for storing entire filesystem images such as a complete Linux system. OSTree stores more metadata about its files than Git does: ownership, complete permissions (Git only remembers whether or not a file is executable), and extended attributes ("xattrs"). Like Git, it doesn't store timestamps. OSTree is used by Flatpak, rpm-ostree from Project Atomic/CoreOS, and GNOME Continuous, which is where OSTree was born.
My company has been using OSTree to build and roll-out software updates to Linux-based devices for the last four years. OSTree provides deployment tools for distributing images to different machines, deploying or rolling back an image atomically, managing changes to /etc, and so on, but in this article I'll focus on using OSTree for its data model.
Like Git, OSTree stores files in a "Content Addressable Store", which means that you can retrieve the contents of a file if you know the checksum of those contents. OSTree uses SHA-256, but I will use "SHA" and "checksum" interchangeably. This store or "repository" is a directory in the filesystem (for example "ostree/") where each file tracked by OSTree (a "blob" in Git terminology) is stored under ostree/objects/ as a file whose filename is the SHA of its contents. This is something of a simplification because file ownership, permissions, and xattrs are also reflected in the checksum.
A "tree" (directory) is stored as a file that contains a list of files and sub-trees, and their SHAs. The filename of this file, just like for blobs, is the SHA of its contents. This way the entire tree, including its sub-trees and their sub-trees, and the contents of each of the files within, can be uniquely identified by a single SHA. This data structure is called a Merkle tree.
You can have different versions of a tree, like Git commits or Git branches, or completely separate trees, but any common files are stored only once in the OSTree repository (in the figure above, file2.txt and d/file3.txt are identical so they are stored only once in ostree/objects/). Like Git, OSTree has "commit" and "checkout" operations.
OSTree "refs" (short for "references"), similar to Git refs, are how OSTree implements branches and tags. A ref is a metadata file in the OSTree repository: Its filename is whatever you want it to be, such as the branch or tag name, and its content is a single SHA pointing at a tree. The connection to the tree is indirect as it is really the SHA of a "commit" which in turn points at a tree, but in this article I'll ignore commits as they aren't directly relevant.
OSTree + Ninja
Ninja is a build system similar to Make. I covered Ninja for LWN three years ago. Unlike Make, Ninja doesn't support wildcards or loops, so you're supposed to write a "configure" script to generate a Ninja file that specifies each build target explicitly. At my company, the internal build system is a 3,000-line Python file, plus several dozen YAML files with packaging instructions for various components; when run, this Python script generates a 90,000-line Ninja file.
In Ninja (like Make) build targets and inputs are files. OSTree refs are also files. The build system, then, creates a different ref for each build step, and the ref itself is the target (output) of that build rule. For example, the target of a generated Ninja rule might be the file "build/ostree/refs/heads/xyz", where "build/ostree" is the OSTree repository in the "build" output directory, and "xyz" is the ref name.
Here's a concrete example from the build system, where it builds a rootfs for the Linux devices:
rootfs = ostree_combine([ l4t_kernel(), bionic_userspace(), package("a"), package("b"), container("c"), ]) phony("rootfs", rootfs) default(rootfs, ...)
Each of l4t_kernel(), bionic_userspace(), package(), and container() is a Python function that creates an OSTree tree (perhaps by downloading and unpacking a tarball, or by running an upstream makefile, the details don't matter right now), then creates an OSTree ref pointing at this tree, and returns the ref (which is a Python string generated internally, perhaps "build/ostree/refs/heads/package/a").
ostree_combine() is a Python function that takes any number of OSTree refs, each of which points at a tree; it combines them together into a single tree, creates another ref pointing at this tree, and returns the ref (this time it might be called "build/ostree/refs/heads/ostree_combine/c85e333f577b").
But I lie — these functions don't do any of this at all. What they do do, is write a Ninja rule, that when invoked via ninja, will carry out those steps.
When you run ninja in an incremental build, if any of the input refs changes — remember, a ref contains a SHA pointing at a tree, so if any file in the tree changes then the ref's content will change — then Ninja will know that the rootfs target is out of date and the ostree_combine rule needs to be re-run.
Crucially, you never need to write out any of these ref filenames explicitly in the build script; you pass them around like any normal Python variable, and other functions can take them and record them as dependencies in their own Ninja rules.
In the example above we also create a Ninja phony rule to create a top-level target name that is convenient to type at the ninja command line, and we add it to Ninja's list of default targets.
Benefits
The ability to specify entire trees as build dependencies or targets means that the same mechanism can be used for specifying coarse-grained dependencies (such as third-party packages that are being integrated into the rootfs) as for fine-grained dependencies (individual files). One obvious benefit is that toolchains and build environments can be managed explicitly: The rule to compile something can take a "build root" rootfs as one of its dependencies, and chroot into it to run the compilation command (some literature calls this "hermetic builds"). This build rootfs can itself be created by other rules.
Less obvious, but possibly the best thing about this approach, is the ability to pass intermediate build outputs around as variables. We saw an example in the Python snippet earlier, where package("a") returns a target name that we passed into another rule, ostree_combine(). This means you don't have to come up with a name for every single intermediate artifact; you can generate them automatically. The composability leads to concise and readable build scripts: the example above is not at all contrived, it is very similar to the production build script. By making this easy to express, it is easier to exploit opportunities to cache or parallelize steps in the build.
To provide a (somewhat contrived) example: Generating the ldconfig cache only depends on the contents of a few directories like /lib and /usr/lib; similarly, the mandb cache only depends on the contents of /usr/share/man.
Traditionally, these operations are run in series. But a build system that can define a dependency that is a subtree of a previous target, could specify separate rules for these operations, run them in parallel, and then merge the results back into a final rootfs tree. In this example, even if the rootfs SHA changes, it's possible that the /usr/share/man subtree hasn't changed, so there's no need to re-run mandb. In the diagram above, red arrows operate on data (file contents); green arrows operate on OSTree metadata.
[Update (added paragraph)] You can imagine there are applications of this to "cloud" build systems that farm out the execution of individual build steps to remote build servers: To run this mandb rule, a remote server doesn't need to download the entire rootfs, only /usr/share/man. Merging the output from this build step back into the rootfs can be done by operating solely on Merkle tree metadata.
OSTree's tooling also gives a few compelling benefits that really need to be pointed out, but you can get them by exporting (committing) the final build artifacts to an OSTree repository (you don't need the close OSTree integration, throughout the intermediate build steps, that I have been describing):
Visibility of changes: My company's continuous integration (CI) system runs ostree diff on each pull request, so that the developers can see exactly which files changed in the output rootfs. This is a wonderful tool for gaining confidence in the correctness of the incremental builds.
Fast incremental deployment: OSTree provides tools for deploying a tree to a remote device. This is used to deploy changes to devices in the field ("over the air" software updates) but this same, production software update process is fast enough for interactive development (an incremental build + deploy + reboot in under a minute).
Implementation details
Our Python script has various functions for getting files, tarballs, Git snapshots, and apt packages into OSTree. A tree can consist of as little as a single file, and refs are cheap.
There are also various functions to manipulate trees, such as ostree_combine() in the example above, but also ostree_ln(), ostree_mkdir(), and ostree_mv(). These are fast because they operate directly on OSTree metadata; they don't need to do ostree checkout to manipulate the trees. Note that a ref can point to any tree, it doesn't have to be rooted at the "/" of your final image.
To run a command, such as a compilation, there is ostree_mod(), which modifies a tree by running a given command. It will check out the specified tree, optionally chroot into it, run the specified command, and create a new tree from the output. For example:
ostree_mod( input_tree=ostree_combine([build_root, src]), command="make -C /src install DESTDIR=/dest", chroot=True, capture_output_subdir="/dest")
This uses fakeroot and bubblewrap to sandbox the command so that it can't access anything outside of the input tree. Bubblewrap is a tool born from Red Hat's Project Atomic, and used by Flatpak among others, that allows unprivileged users to create secure sandboxes. Here bubblewrap is not used for security, but as a convenient way of ensuring correct, "hermetic" builds. Our version of fakeroot is heavily patched so that the build command sees the file permissions that are stored by OSTree; this allows us to run the build as an unprivileged user but still modify root-owned files.
OSTree's "bare" repository format is used, which means that the checkout operation only needs to create hard-links to the relevant files inside the repository; this needs to be fast because every build rule that calls ostree_mod() involves an ostree checkout and an ostree commit. OverlayFS is used to ensure that the OSTree repository is not modified by accident via those hard-links. This patch for bubblewrap is needed to support OverlayFS; the patch probably isn't upstreamable because it requires additional capabilities, which is at odds with the bubblewrap project's security goals. There are also several OSTree patches, some of which are merged and some not (yet).
apt2ostree
apt2ostree is a tool that has been extracted from our build system. It builds a Debian/Ubuntu rootfs from a list of .deb packages — much like debootstrap or multistrap. Unlike those tools, the output is an OSTree tree rather than a normal directory. It is faster, parallelized, and incremental. It also records package versions in a "lockfile" for reproducible builds.
From a list of .deb package names, apt2ostree downloads and unpacks each package into its own OSTree tree, then it combines these into a single tree (so far this is equivalent to debootstrap's "stage 1"). It then checks out the tree, runs dpkg --configure -a within a chroot ("stage 2"), and commits the result to OSTree.
From a list of packages, apt2ostree performs dependency resolution (via aptly) and generates a "lockfile" that contains the complete list of all packages, their versions, and their SHAs. This lockfile can be committed to Git. Builds from the lockfile are functionally reproducible.
"Stage 1" of apt2ostree is fast for several reasons. It only downloads and extracts any given package once; if it is used in multiple images it doesn't need to be extracted again. This saves disk space too, because the contents of the packages are committed to OSTree so they will share disk space with the built images. Downloading and extracting is done in a separate Ninja rule per package; this allows parallelism (it can be downloading one package at the same time as compiling a second image, or performing other build tasks within the larger build system, all thanks to Ninja) and incremental builds (there is no need to repeat work if the package version hasn't changed). Combining the contents of the packages is fast because it only touches OSTree metadata.
apt2ostree only has a single user as far as I know (my company's build system). See the README file in the apt2ostree repository on GitHub for more information. I don't necessarily expect anyone to use it, but it serves as a good self-contained example of the techniques described in this article.
Conclusions and acknowledgments
We have found that OSTree and Ninja work very well together, thanks to a neat hack: Using a "ref" (a file in the OSTree metadata directory) as the target or dependency of a Ninja rule, to track changes to an entire tree. But most important, I think, is the idea of trees as first-class citizens in a build system. For researchers, OSTree and Ninja provide an easy way to explore these ideas. For production, we have also found OSTree and Ninja to work fantastically well for our use case: system integrators building container images and rootfs images for embedded Linux devices.
Most of these ideas (the good ones, at least) are from my colleague William Manley, who also did most of the implementation of the build system. I merely wrote it up.
Page editor: Jonathan Corbet
Inside this week's LWN.net Weekly Edition
- Briefs: Linux 5.7; Devuan 3.0; FreeNAS Linux; Firefox 77; Quotes; ...
- Announcements: Newsletters; conferences; security updates; kernel patches; ...