Leading items

Welcome to the LWN.net Weekly Edition for December 20, 2018

This edition contains the following feature content:

A 2018 retrospective: our look back at the year that was.
Python gets a new governance model: the long discussion has come to an end, and the Python project has picked a committee model going forward.
Handling the Kubernetes symbolic link vulnerability: how a serious security vulnerability was diagnosed and fixed.
Relief for retpoline pain: attempts to claw back some of the performance lost to Spectre mitigation.
Linux in mixed-criticality systems: another attempt to make Linux available on systems handling safety-critical tasks.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Note that this is the final LWN.net Weekly Edition for 2018; we will be taking next week off for some much-needed rest. Happy holidays to all of our readers!

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

A 2018 retrospective

By Jonathan Corbet
December 19, 2018

The December 20 LWN.net Weekly Edition is the final one for the year; as usual, we will be taking the last week of the year off for a brief rest. LWN, which is about to conclude its 21st year of publication, has had the time to build up some traditions, one of which is a year-end retrospective that evaluates the predictions we made back in January. As usual, some of those predictions aged rather better than others; read on for our report card.

Revisiting the predictions

We started with a vague prediction that there would be an increase in introspection as we think about what our projects and our industry should be trying to achieve in the world. On the industry side that has certainly happened; technology companies have lost their halo and find themselves under increasing levels of scrutiny, and some have had to change course as a result. Whether that process has extended to the free-software community is debatable, though. Some actions, such as the removal of the questionable Speck encryption algorithm from the kernel, show concern about the world we are creating, but there aren't a huge number of examples to point to.

The prediction that we would see more hardware vulnerabilities was published before the disclosure of Meltdown and Spectre, so we could perhaps claim a major success there. By January 2, though, when that article was published, it was obvious that something of that nature was about to surface, so the amount of credit that is due is limited. Whether the level of interest in open hardware has increased as a result is not really clear. The RISC-V architecture has indeed seen such an increase, but that may be more a result of commercial forces than security concerns, and whether RISC-V processors will truly be more secure has yet to be proven.

Concerns about a major security incident at a cloud provider were not realized — so far as we know. Even the various rounds of hardware vulnerabilities appear to have been handled well enough. Serious security breaches continued to surface elsewhere, of course, to nobody's surprise.

Work on alternative container runtimes continued as predicted, as did blockchain hype; no surprises there. The collapse in the prices of many cryptocurrencies and the lack of other convincing blockchain applications suggests that some of the shine has come off of some blockchain-based technologies, though.

The prediction that vendors would move closer to mainline kernels was a bit controversial at the time, but things do indeed seem to be moving in that direction. The Android project, in particular, is working hard to make that happen. We are still some distance from being able to run mainline kernels on our mobile devices, but there is a glimmer of hope on the horizon.

Did alternative kernels like Fuchsia gain prominence this year? That may or may not have happened, but those projects have certainly not gone away. It is still true that Linux developers take our domination of small, embedded systems for granted; they can be heard saying that no vendor will want to bother qualifying another kernel for its hardware. Perhaps that is true, but perhaps that is the pride that goes before the fall in that piece of the market.

Wayland support did grow as predicted, but "the long reign of the X Window System" seems far from an end. It probably is true that Python 3 adoption has reached a turning point; those who are still running Python 2 applications are generally thinking seriously about moving forward.

What was missed?

One thing that was certainly missing from the list of predictions was that Linus Torvalds would take a one-month break from running the kernel project and a new code of conduct would be adopted. This change came out of the blue, and has caused quite a bit of controversy both within the kernel project and beyond it. Even developers who have no particular disagreement with the code that was adopted are somewhat unhappy with the way that change came about. So far, though, the sky has not fallen, and kernel development continues pretty much as it did before.

Perhaps Microsoft's rebasing of its Edge browser on top of the Chromium engine could have been foreseen, but we didn't see it coming. The result is a browser ecosystem that increasingly looks like a monoculture, with all of the associated risks. Due to some of the forces mentioned above, the company that now appears to control much of the web (and beyond) is not quite as well trusted as it once was. Once again, Firefox is our main line of defense against this scenario; perhaps the project can come through for us again.

The era of big-money stock deals involving Linux seemed to have ended with the dotcom crash, but then IBM surprised the world by buying Red Hat for $34 billion. It could well be that Red Hat will bring value to IBM that far exceeds the massive premium above the company's valuation that IBM agreed to pay. Or maybe we are seeing another occasion where history rhymes with itself, and an over-the-top valuation of a free-software company is the harbinger of the end of a long, technology-driven boom.

The creation of new free-software licenses has been frowned upon for many years, so one might be forgiven for thinking that innovation in this area had stopped. This innovation resurfaced this year, though, as companies created new licenses intended to increase their incoming revenue from their flagship projects while keeping those projects as something that at least vaguely looks like free software. It is fair to say that these new licenses have gotten a chilly reception, but it is probably also fair to say that this will not stop companies from trying to craft licenses in ways that will improve their bottom line.

Nobody expects the Spanish Inquisition, and almost nobody expected Guido van Rossum's abrupt abdication, in July, of his role as the benevolent dictator for life of the Python project. As a result, the project was unprepared for this change and spent the last six months figuring out how its governance will work going forward. Python the language has never been stronger; Python the project is in no danger, but its developers clearly have some things to work out.

All told, it would be hard to argue that this was not yet another good year for Linux and free software — like the years before it. LWN has been privileged to enjoy a ringside seat for 21 years of this community's growth. We owe this privilege to an engaged and supportive community of readers; no publication could ever ask for a better audience. As this year comes to a close, we would like to thank you all for your continued support of LWN, and to wish you the best of the holiday season, however you may celebrate it.

Comments (63 posted)

Python gets a new governance model

By Jake Edge
December 18, 2018

Back in late October, when we looked in on the Python governance question, which came about due to the resignation of Guido van Rossum, things seemed to be mostly set for a vote in late November. There were six Python Enhancement Proposals (PEPs) under consideration that would be ranked by voters in a two-week period ending December 1; instant-runoff voting would be used to determine the winner. In the interim, though, much of that changed; the voting period, winner-determination mechanism, and number of PEPs under consideration are all different. But the voting concluded on December 16 and a winner has been declared; PEP 8016 ("The Steering Council Model"), which was added to the mix in early November, came out on top.

Right around the time of our previous article, a new thread was started on the Python committers Discourse instance to discuss the pros and cons of various voting systems. Instant-runoff voting fell out of favor; there were concerns that it didn't truly represent the will of the electorate, as seen in a Burlington, Vermont mayoral election in 2009, for example. The fact that it was put in place by fiat under a self-imposed deadline based on in-person conversations at the core developer sprint, rather than being hashed out on the Discourse instance or the python-committers mailing list may have also been a factor. As Nathaniel J. Smith put it:

I'm particularly concerned by all this rhetoric about how the deadline is fixed and everyone has to get in line. I want to be done soon too! But trying to steamroller other core devs like this, and acting like some core devs get to resolve disagreements like this by pure fiat, is a really unhealthy precedent. I feel like some of us are so concerned about making sure it looks like we can work together, make decisions, and hold a legitimate vote, that they're undermining our ability to actually work together, make decisions, or hold a legitimate vote.

Donald Stufft put together a lengthy summary of many of the different voting systems along with their good and bad attributes. No one had any interest in using "plurality voting" (also known as "first past the post"), but opinions differed on other possibilities. Eventually, PEP 8001 ("Python Governance Voting Process") was changed to use the Condorcet method to determine the winner. A tie or cycle, which are both possible—though unlikely—under the Condorcet method, would result in a re-vote among the tied options. Condorcet has been used by Debian and other projects for many years, which is part of the reason consensus formed around that method.

The winner

In the end, Condorcet led to an election where the results were clear without any real ambiguity about them. As Tim Peters, who was one of the more active developers in the voting-methodology discussion, noted: "Not only was there a flat-out Condorcet ('beats all') winner, but if we throw that winner out, there's also a flat-out Condorcet winner among the 7 remaining - and so on, all the way down to 'further discussion'." Given that the pool of voters was fairly small, 94, and that only 62 people actually voted, there could have been some far messier outcomes.

It is perhaps not surprising that a late entrant into the election was the clear winner. Smith and Stufft authored the PEP; it likely benefited from the discussion of the other PEPs and the changes that were made to them along the way. It also doesn't hurt that it is explicitly intended to be boring, simple, and flexible.

As with most of the other proposals, PEP 8016 creates a council. Various sizes were proposed in the other PEPs, but the steering council of PEP 8016 consists of five people elected by the core team. The definition of the core team is somewhat different than today's core developers or committers. The PEP explicitly states that roles other than "developer" could qualify for the core team. Becoming a member of the team simply requires a two-thirds majority vote of the existing members—and no veto by the steering council.

The veto is not well specified in the PEP and was the subject of a question during the discussion process. According to Smith, that idea came from the Django governance document, which was a major influence on the PEP. It is hoped that it would never have to be used, "but there are situations when the alternatives are even worse". There is also an escape hatch if it turns out that a core team member needs to be removed; a super-majority of four council members can vote to do so.

The steering council

The council is imbued with "broad authority to make decisions about the project", but the goal is that it uses that authority rarely; it is meant to delegate its authority broadly. The PEP says that the council should seek consensus, rather than dictate, and that it should define a standard PEP decision-making process that will (hopefully) rarely need council votes to resolve. It is, however, the "court of final appeal" for decisions affecting the language. But the council cannot change the governance PEP; that can only happen via a two-thirds vote of the core team.

The mandate for the council is focused on things like the quality and stability of Python and the CPython implementation, as well as ensuring that contributing to the project is easy so that contributions will continue to flow into it. Beyond that, maintaining the relationship between the core team and the Python Software Foundation (PSF) is another piece of the puzzle.

Steering council members will serve for the length of single Python feature release; after each release, a new council will be elected. Candidates must be nominated by a core team member; "approval voting" will be used to choose the new council. Each core team member can anonymously vote for zero to five of the candidates; the five with the highest total number of votes will serve on the new council, with ties decided by agreement between the tied candidates or by random choice.

There are some conflict-of-interest rules as well: "While we trust council members to act in the best interests of Python rather than themselves or their employers, the mere appearance of any one company dominating Python development could itself be harmful and erode trust." So no more than two council members can be from the same company; if a third person from the company is elected, they are disqualified and the next highest vote-getter moves up. If the situation arises during the council's term (e.g. a change in employer or an acquisition), enough council members must resign to ensure this makeup. Vacancies on the council (for this or any other reason) will be filled by a vote of the council.

In the event of core team dissatisfaction with the council, a no-confidence vote can be held. A member of the core team can call for such a vote; if any other member seconds the call, a vote is held. The vote can either target a single member of the council or the council as a whole. If two-thirds of the core team votes for no confidence, the councilperson or council is removed. In the latter case, a new council election is immediately triggered.

Some of the other PEPs specified things such as how PEPs would be decided upon or placed various restrictions on who could serve on the council. Victor Stinner's summary of the seven proposals gives a nice overview of the commonalities and differences between them. Many were fairly similar at a high level, most obviously varying on the size of the council, though there are other substantive differences, of course. PEP 8010 ("The Technical Leader Governance Model"), which more or less preserved the "benevolent dictator" model, and PEP 8012 ("The Community Governance Model"), which did not have a central authority, were both something of an outlier. It is interesting to note that 8012 came in second in the voting, while 8010 was one of the least favored governance options.

Another election

Next up is council elections. There are two phases, each of which will last two weeks; first is a nomination period, followed by the actual voting. Van Rossum has not ridden off into the sunset as some might have thought; he was active in some of the threads leading up to the governance election and was the first to start organizing the council election. He asked that the process start after the new year to give folks some time to relax over the holidays. Smith agreed, noting that starting on January 6 would lead to the actual vote starting January 20 and a council elected on February 3.

Overall, the process has gone fairly smoothly since Van Rossum stepped down and the first steps toward new governance were taken back in July. There would seem to be plenty of good candidates for the council, many of whom were active in the governance discussions. The first incarnation of the council will have lots of things to decide, including the PEP approval process, but it won't have all that much time to do so. Instead of the usual 18-month cycle, the council will serve an abbreviated term until Python 3.8 is released, which is currently scheduled for October 2019. The council elected after that should have a full 18 months, unless, of course, the release cadence is shortened. It will all be interesting to watch play out; once again, stay tuned.

Comments (4 posted)

Handling the Kubernetes symbolic link vulnerability

By Jake Edge
December 19, 2018

KubeCon NA

A year-old bug in Kubernetes was the topic of a talk given by Michelle Au and Jan Šafránek at KubeCon + CloudNativeCon North America, which was held mid-December in Seattle. In the talk, they looked at the details of the bug and the response from the Kubernetes product security team (PST). While the bug was fairly straightforward, it was surprisingly hard to fix. The whole process also provided experience that will help improve vulnerability handling in the future.

The Kubernetes project first became aware of the problem from a GitHub issue that was created on November 30, 2017. It gave full detail of the bug and was posted publicly. That is not the proper channel for reporting Kubernetes security bugs, Au stressed. Luckily, a security team member saw the bug report and cleared out all of the details, moving it to a private issue tracker. There is a documented disclosure process for the project that anyone finding a security problem should follow, she said.

Background

In order to understand the bug, some background on volumes in Kubernetes is needed, much of which can also be found in a blog post by Au and Šafránek. They put up an example pod spec (which can be seen in the slides [PDF]) that was using a volume. When that pod gets scheduled to a node, the volume will be mounted so that containers in the pod can access it. Each container can specify where the volume will be mounted inside the container. A directory is created on the host filesystem where the volume is mounted.

When the container starts, the kubelet node manager needs to tell the container runtime about the volume; both the host filesystem path and the location in the container where it should be bind mounted are needed. In addition, a container can specify a subdirectory of the location where the volume is mounted in the container using the subPath directive. But subPath was subject to a classic symbolic-link race condition vulnerability.

A container could symbolically link anything to a subPath name in its view of the volume. A subsequent container that used the subPath name would be using a link controlled by the owner of the pod. Those links are resolved on the host, so linking the subPath name to / would provide access to the root directory on the host; game over.

Šafránek demonstrated the flaw. He showed that it allows access to the root directory of the host, which means that the whole node is compromised. For example, it gives access to the Docker socket so an attacker can run anything in the containers, access any secrets used by the containers, and more. All of that comes about because Kubernetes does not check for symbolic links.

Working out a solution

The problem was reported just before KubeCon Austin, so the PST brainstormed on solutions at that gathering. The first, "naive", idea was simply to resolve the full path, then validate that it still points inside the volume. But there is a time of check to time of use (TOCTTOU) race condition in that scheme. The user can modify the directories after the check but before they are handed to the container runtime.

The next idea was to freeze the filesystem while validating the path and handing it to the container runtime. For Windows, CreateFile() can be used to lock each component of the path until after the runtime mounts it, but something different was needed for Linux. Bind mounting the volume to some Kubernetes directory, outside of user control, and then handing that off is a safe way to get it to the runtime, but it still leaves race conditions. After any symbolic links are resolved and after the path has been validated to remain inside the volume are two points where a user could switch the path to contain a symbolic link.

The /proc filesystem contained a clue that was used for the actual solution that was implemented. The links under /proc/PID/fd can reliably be used to bind mount a file descriptor corresponding to the final component of the subPath. The volume directory is opened, then each component of the subPath is opened using openat() while disallowing following symbolic links and validating that the path is still within the volume. The file descriptor file in /proc of the final component is then bind mounted to a safe location and handed off to the container runtime. That eliminates the races and implements a scheme that is not dependent on the underlying filesystem type.

Making the fix

It took a fair amount of time to get the fix out there; there were lots of end-to-end tests that needed to be developed and run on both Linux and Windows. But, since Kubernetes is developed in the open, how could this fix be developed in secret? The answer is that a separate repository, kubernetes-security, was used. Only the PST can normally access it, but the PST can give temporary access to those working on the fix. Au and Šafránek lost their access after the fix was released; "we have no idea what's going on there now", Šafránek said.

The development and testing process is similar to that of the open kubernetes repository, but the logs of tests and such for kubernetes-security go to a private bucket that only Google employees can access. Šafránek works for Red Hat in Europe, so sometimes he had to wait for Au, who works for Google in the US, to wake up so that he could find out what went wrong for a test run.

The flaw was disclosed to third-party Kubernetes vendors on the closed kubernetes-distributors-announce mailing list one week before it was publicly disclosed. On March 12, CVE-2017-1002101 was announced, which was roughly four months after it was reported. Kubernetes 1.7, 1.8, and 1.9 were updated and released on that day. The timeline and more can be found in the public post-mortem document.

Au went over some "best practices" for avoiding these kinds of problems. To start with, do not run containers as the root user; containers running as another user only have the same access as that user. That can be enforced by using PodSecurityPolicy, though containers will still run in the root group; the upcoming RunAsGroup feature will address that shortcoming. The security policy can also be used to restrict volume access, though that would not have helped for this particular vulnerability.

Using sandboxed containers is something that is being investigated for future Kubernetes releases. Using gVisor or Kata Containers will provide another security boundary. That is in keeping with a core principle that there should be at least two security barriers around untrusted code. For this vulnerability, a sandbox could have prevented access to the host filesystem. Au said she expects to see some movement on container sandboxes over the next year or so.

She started her talk summary with a reminder to follow the project's security disclosure process. She also suggested that Kubernetes and other projects be "extra cautious" when handling untrusted paths. Symbolic-link races and TOCTTOU are well-known dangers in path handling. In addition, she recommended setting restrictive security policies and using multiple security boundaries.

In answer to a question, Au said that most of the four months was taken up by development and testing, some of which was slowed down by the end-of-year holiday break. About two weeks were taken up with the actual release process. When asked about what could improve for the next CVE, Šafránek said that getting access to the private logs is important; Au said that it is being worked on. She also pointed to the post-mortem document as a good source for improvement ideas.

[I would like to thank LWN's travel sponsor, The Linux Foundation, for assistance in traveling to Seattle for KubeCon NA.]

Comments (21 posted)

Relief for retpoline pain

By Jonathan Corbet
December 14, 2018

Indirect function calls — calls to a function whose address is stored in a pointer variable — have never been blindingly fast, but the Spectre hardware vulnerabilities have made things far worse. The indirect branch predictor used to speed up indirect calls in the CPU can no longer be used, and performance has suffered accordingly. The "retpoline" mechanism was a brilliant hack that proved faster than the hardware-based solutions that were tried at the beginning. While retpolines took a lot of the pain out of Spectre mitigation, experience over the last year has made it clear that they still hurt. It is thus not surprising that developers have been looking for alternatives to retpolines; several of them have shown up on the kernel lists recently.

The way to make an indirect call faster is to replace it with a direct call; that renders branch prediction unnecessary. Of course, if a direct call would have sufficed in any given situation, the developer would have used it rather than an indirect call, so this replacement is not always straightforward. All of the proposed solutions to retpoline overhead strive to do that replacement in one way or another, though; they vary from the simple to the complex.

Speeding up DMA operations

The simplest method is often the best; that is the approach taken in Christoph Hellwig's patch set speeding up the DMA-mapping code. Setting up DMA buffers can involve a lot of architecture-specific trickery; the DMA mapping layer does its best to hide that trickery behind a common API. As is often the case in the kernel, the code in the middle uses a structure full of function pointers to direct a generic DMA call to the code that can implement it in any specific setting.

It turns out, though, that the most common case for DMA mapping is the simplest: the memory is simply directly mapped in both the CPU's and the device's address space with no particular trickery required. Hellwig's work takes advantage of that fact by testing for this case and calling the direct-mapping support code directly rather than going through a function pointer. So, for example, code that looks like this:

    addr = ops->map_page(...);

is transformed into something like:

    if (dma_is_direct(ops))
    	addr = dma_direct_map_page(...);
    else
    	addr = ops->map_page(...);

The cost of the if test is more than recouped in the direct-mapping case by avoiding the indirect function call (and it is tiny relative to the cost of that call in the other cases). Jesper Dangaard Brouer, who reported the performance hit in the DMA-mapping code, expressed his happiness at this change: "my XDP performance is back". Barring problems, this change seems likely to be merged sometime soon.

Choosing from a list

In some situations, an indirect function call will end up invoking one out of a relatively small list of known functions; a variant of the above approach can be used to test for each of the known alternatives and call the correct function directly. This patch set from Paolo Abeni implements that approach with a simple set of macros. If a given variable func can point to either of f1() or f2(), the indirect call can be avoided with code that looks like this:

    INDIRECT_CALLABLE_DECLARE(f1(args...));
    INDIRECT_CALLABLE_DECLARE(f2(args...));
    /* ... */
    INDIRECT_CALL_2(func, f2, f1, args...);

This code will expand to something like:

    if (func == f1)
    	f1(args);
    else if (func == f2)
    	f2(args);
    else
    	(*func)(args);

Abeni's patch set is aimed at the network stack, so it contains some additional optimizations that can apply when the choice is between the IPv4 and IPv6 versions of a function. He claims a 10% or so improvement for a UDP generic receive offload (GRO) benchmark. Networking maintainer David Miller has indicated a willingness to accept this work, though the current patch set needs a couple of repairs before it can be merged.

Static calls

Sometimes indirect calls reflect a mode of operation in the kernel that is not often changed; in such cases, the optimal approach might be to just turn the indirect call into a direct call and patch the code when the target must be changed. That is the approach taken by the static calls patch set from Josh Poimboeuf.

Imagine a global variable target that can hold a pointer to either of f1() or f2(). This variable could be declared as a static call with a declaration like:

    DEFINE_STATIC_CALL(target, f1);

Initially, target will point to f1(). Changing it to point to f2() requires a call like:

    static_call_update(target, f2);

Actually calling the function pointed to by target is done with static_call():

    static_call(target, args...);

Since changing the target of a call involves code patching, it is an expensive operation and should not be done often. One possible use case for static calls is tracepoints in the kernel, which can have an arbitrary function attached to them, but which are not often changed. Using a static call for that attachment can reduce the runtime overhead of enabling a tracepoint.

This patch set has been through a couple of revisions so far. It implements two different mechanisms. The first tracks all call sites for each static call variable and patches each of them when the target changes; the second stores the target in a trampoline and all calls jump through there. The motivations for the two approaches are not spelled out, but one can imagine that the direct calls will be a little faster, while the trampoline will be quicker and easier to patch when the target changes.

Relpolines/optpolines

A rather more involved and general-purpose approach can be seen in this patch set posted by Nadav Amit in October. Rather than requiring developers to change indirect call sites by hand, Amit adds a new mechanism that optimizes indirect calls on the fly.

The patch set uses some "assembly macro magic" to change how every retpoline injected into the kernel works; the new version contains both fast and slow paths. The fast path is a test and direct call to the most frequently called target (hopefully) from any retpoline, while the slow path is the old retpoline mechanism. In the normal production mode, the fast path should mitigate the retpoline overhead in a large fraction of the calls from that site.

What makes this work interesting is the selection of the target for the fast path. Each "relpoline" (a name that was deemed too close to "retpoline" for comfort and which, as a result, may be renamed to something like "optpoline") starts out in a learning mode where it builds a hash table containing the actual target for each call that is made. After a sufficient number of calls, the most frequently called target is patched directly into the code, and the learning mode ends. To follow changing workloads, relpolines are put back into the learning mode after one minute of operation, a period that Amit says "might be too aggressive".

This mechanism has the advantage of optimizing all indirect calls, not just the ones identified as a problem by a developer. It can also operate on indirect calls added in loadable modules at any point during the system's operation. The results, he says, are "not bad"; they include a 10% improvement in an nginx benchmark. Even on a system with retpolines disabled, simply optimizing the indirect calls yields a 2% improvement for nginx. The downside, of course, is the addition of a fair amount of low-level complexity to implement this mechanism.

Response to this patch set has been muted but generally positive. There are, though, lots of suggestions on the details of how this mechanism would work. There may be further optimizations to be had by storing more than one common target, for example. The learning mechanism can probably benefit from some improvement. There was also a suggestion to use a GCC plugin rather than the macro magic to insert the new call mechanism into the kernel. As a result, the patch set is still under development and will likely take some time yet to be ready.

What's next

Various other developers have been working on the indirect call problem as well. Edward Cree, for example, has posted a patch set adding a simple learning mechanism to static calls. Nearly one year after the Spectre vulnerability was disclosed, the development community is clearly still trying to do something about the performance costs the Spectre mitigations have imposed.

The current round of fixes is trying to recover the performance lost when the indirect branch predictor was taken out of the picture. As Cree put it: "Essentially we're doing indirect branch prediction in software because the hardware can't be trusted to get it right; this is sad". Merging four different approaches (at least) to this problem may not be the best solution, especially since this particular vulnerability should eventually be fixed in the hardware, rendering all of the workarounds unnecessary. Your editor would not want to speculate on which of the above patches, if any, will make it into the mainline, though.

Comments (15 posted)

Linux in mixed-criticality systems

By Jonathan Corbet
December 13, 2018

LinuxLab

The Linux kernel is generally seen as a poor fit for safety-critical systems; it was never designed to provide realtime response guarantees or to be certifiable for such uses. But the systems that can be used in such settings lack the features needed to support complex applications. This problem is often solved by deploying a mix of computers running different operating systems. But what if you want to support a mixture of tasks, some safety-critical and some not, on the same system? At a talk given at LinuxLab 2018, Claudio Scordino described an effort to support this type of mixed-criticality system.

For the moment, this work is focused on automotive systems, which have a bunch of non-critical tasks (user interaction and displaying multimedia, for example) and critical tasks (such as autonomous driving and engine control). These tasks can be (and often are) handled with independent computers running different operating systems, but there is a lot of interest in combining these computers into one. The result, should this effort be successful, would be a system that is both cheaper and more flexible.

One way of doing this would be to turn Linux into a fully realtime system. In the past, dual-kernel approaches, such as RTLinux, RTAI, and Xenomai have been developed toward that end. More recently, the PREEMPT-RT patches have seen the most attention, though Scordino described the result as "soft realtime". The problem with all of these systems is certification for use in safety-critical settings, which is hard (if not impossible) for a kernel as large as Linux. In addition, regulations can prevent its use anyway; European automotive regulations do not allow the use of a shared system for both non-critical tasks and engine control, for example.

So attention has shifted to an alternative approach: resource partitioning that can create a hard separation between tasks on a single platform. Modern processors support this under most common architectures. If one of these systems could be successfully partitioned, it would become possible to use Linux for non-critical code and a certified operating system for the rest. This is the focus of the Hercules project, which has been funded by the European Union.

For the safety-critical side of the system, Hercules has chosen Erika Enterprise, a system that is licensed under GPLv2+. It has been designed explicitly for automotive electronic control units, and carries a number of the relevant certifications. Erika is used in production in some cars now; it supports a range of CPUs and can run under a variety of hypervisors.

On the hypervisor side, the system of choice is Jailhouse, which is also available under GPLv2. Jailhouse, too, has been designed for safety-critical applications with certification in mind; to that end, the project has a goal of not exceeding 10,000 lines of code on any architecture that it supports. The plan is to use Jailhouse to run realtime, safety-critical tasks on multicore platforms alongside Linux. It should be able to provide strong and clean isolation between the two while performing at "bare-metal levels".

Jailhouse has the concept of a "root cell" which, while being in control of the system as a whole, is not in full control of the hardware it is running on. The root cell will be running Linux. Other "cells" can run (as an "inmate") any kernel that has Jailhouse support; unlike full virtualization systems, Jailhouse is not able to run unmodified kernels. There is no scheduling built into Jailhouse; each core in the system is given over fully to one system for its use. There is no overcommitting of hardware, and no hardware emulation. Memory is partitioned between the cells, with some set aside for Jailhouse itself.

Linux systems with Jailhouse support have a special device (/dev/jailhouse) used for configuration of the root cell and the loading of inmate systems into the other cells. It uses a rather long and intimidating configuration file, written as a set of C data-structure definitions, that fully describes the hardware and its partitioning; this file can be automatically generated on x86 systems, but must be written by hand for Arm systems.

Cells in Jailhouse are isolated from each other, but they are still likely to need to communicate; Jailhouse provides a virtual PCI device for that purpose. There is no multicast capability, which is a bit of a shortcoming, so the Hercules developers have added their own communications library on top. It provides both blocking and non-blocking calls and dynamic message sizes; this code should be released soon.

One of the biggest problems that needs to be solved is avoiding interference between the cells, which can happen even with hard partitioning. Memory bandwidth and cache space can be particularly problematic. One solution for cache contention is to use cache coloring — assigning virtual addresses so that each cell uses a different portion of the cache. Another approach is to use the system's performance counters to monitor the use of memory bandwidth and cache space; tasks can then be throttled if need be to keep them within their limits. Coscheduling, wherein processes are scheduled so as to avoid contending with each other for memory, is also under development. This code, too, is expected to be released soon.

As the session ended, Thomas Gleixner observed from the audience that Hercules is "a nice fairy tale", an embodiment of the design that everybody seems to want. He also said that it is "a pipe dream", though, for a simple reason: there are no CPUs large enough to run this kind of system that have been certified for safety-critical tasks. Without a certified CPU, the system as a whole cannot be certified. Scordino responded that the vendors are working on this problem. Once they have a solution, it would appear that Hercules will be ready to run on it.

[Thanks to LinuxLab and to the Linux Foundation, LWN's travel sponsor, for supporting my travel to the event.]

Comments (32 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>