Leading items

Welcome to the LWN.net Weekly Edition for July 28, 2022

This edition contains the following feature content:

Digital autonomy and the GNOME desktop: what can the GNOME project do to reduce reliance on cloud-based resources and improve the computing experience for users worldwide?
Living with the Rust trademark: a couple of projects discuss whether they are allowed to use the name "Rust".
Stuffing the return stack buffer: a complicated workaround to regain some performance lost to Retbleed mitigations.
Support for Intel's Linear Address Masking: how Intel's approach to pointer metadata may look on Linux systems.
Docker and the OCI container ecosystem: a comprehensive overview of the state of Linux-based container implementations.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Digital autonomy and the GNOME desktop

By Jake Edge
July 27, 2022

GUADEC

While GUADEC, the GNOME community's annual conference, has always been held in Europe (or online-only) since it began in 2000, this year's edition was held in North America, specifically in Guadalajara, Mexico, July 20-25. Rob McQueen gave a talk on the first day of the conference about providing solutions that bring some level of digital safety and autonomy to users—and how GNOME can help make that happen. McQueen is the CEO of the Endless OS Foundation, which is an organization geared toward those goals; he was also recently reelected as the president of the GNOME Foundation board of directors.

His talk was meant to introduce and describe an objective that the GNOME board has been discussing and working on regarding the state of the internet today and how GNOME can make that experience better for its users. The cloud-focused computing environment that is prevalent today has a number of problems that could be addressed by such an effort. That topic is related to what he does for work, as well, since Endless OS is working on "bridging the digital divide" by helping those who are not able to access all of the information that is available on today's internet. Some of those efforts are aimed at bringing that data to those who cannot, or perhaps choose not to, directly connect to the internet itself—or only do so sporadically.

The problems

The UN estimates that there will be 2.5 billion more people on the planet by 2050; most of those people, perhaps 2 billion of them, will be born in places where power and connectivity, thus technology, are quite limited. Meanwhile, there are more smartphones on the planet than people, but both Endless and GNOME have an interest in desktop computing. In order for people to fully participate in the activities that computing can facilitate, such as education, employment, content creation, and more, a form-factor beyond what a phone provides is needed.

Computers these days use a lot of internet, he said. Many users have workloads that are split between the computer in front of them and one in the cloud. That creates infrastructure constraints for personal computing; there is more needed than just a device. That infrastructure consists of all of the disparate pieces that allow the connection to the rest of world: trenches, wires, towers, satellites, and more.

Some predictions are that in around ten years, satellites and other technology will solve the global connectivity problem, he said. But he has been working in the "digital divide space" for around ten years and that prediction was also made ten years ago. Matching the growth in global population with connectivity infrastructure is an extremely difficult and expensive problem to solve.

Even if you do have the internet connectivity, though, there are still plenty of problems. In the free-software world, we are able to examine the software that we run on the computer in front of us, McQueen said. When software is running in the cloud, which is effectively just someone else's computer as the snarky definition that he referred to notes, that ability is not present. Running it elsewhere means that the user loses control of their data, including: if and how well the data is secured, whether it is shared with third parties, whether they will still be able to access it tomorrow, and so on.

Data that is centralized is also a target for attack, he said. There are enormous resources being poured into securing these centralized resources these days. The problem has risen to a level of national concern; for example, the Biden administration in the US has a panel that is advising it on how to secure the internet and the infrastructure it runs on.

Loss of data control can have "very real-world consequences". For example, apps that track menstrual cycles, with the convenience of syncing the data to the cloud, can also reveal things that could be dangerous from a legal perspective in the US today. Given the current climate in the US, he said, health data could potentially put someone or their healthcare provider in legal trouble, "or it could even put your life at risk—this is terrifying."

Then there are governments that are trying to quell dissent by use of internet blockages of various sorts. For example, Russia has been limiting access to sites that provide a more balanced view of its war on Ukraine because it has its own narrative to promote. He noted that the NetBlocks organization maps these kinds of network disruptions and tries to tie them to the real-world events that may have triggered them. He said that he could not resist mentioning Flash Drives for Freedom, which creates USB drives containing suppressed information that get smuggled into North Korea; the visual impact of its home page image (seen at right) was "too good not to include" in his slides.

Solutions?

He had just presented "some of the consequences of the way we approach computing" today; he does not have an "amazing answer" of what should be done, but he did have some questions, ideas, and things that are "worth exploring together". GNOME has a focus on software that runs locally, in part because that puts users in control of their data and gives them the ability to look at the source code; ultimately, the project believes those things allow users to have more trust in their computing environment.

McQueen asked a few different questions about how the GNOME project could improve in some of these areas. What can the project add to its desktop to provide more safety for its users and to allow them to have better control over their data? What can it do to help users who live in a country that gets cut off from the internet due to a war? How can the GNOME desktop help block various kinds of threats to its users and their data?

He explained that there are other organizations out there solving some of these problems; GNOME could potentially partner with them to use their technology in its desktop in order to accomplish these goals. He listed several different technology areas that would fit well into the GNOME desktop. The first of those was regarding offline content.

Storage is a reasonable substitute to deal with connectivity woes that come about due to lack of infrastructure, upheavals like wars, or censorship of various kinds. There are a number of projects that exist for "bringing bits of the internet onto the computer in front of you". For example, Kiwix has technology for downloading entire web sites, such as Wikipedia, and making them available offline. It turns out that Wikipedia in Russian has been seeing increased downloads on Kiwix of late since "the Russian government has threatened to block access to Wikipedia for documenting narratives that do not agree with the official position ".

Endless OS is collaborating with Learning Equality, which is a non-profit that has created a learning platform called Kolibri. It allows accessing educational resources, including ebooks, videos, audio, and games, in a curated set of courses for offline schools. The Endless Key project uses Kolibri to create offline educational resources for US middle and high school students who do not have internet access at home.

He then turned to peer-to-peer technology, which is where he started his career; after 15 years of working on that, he has learned that it is an extremely difficult problem to solve. He looked for good examples of peer-to-peer tools for the desktop and came up with two. The first is Syncthing (which we looked at a year ago); it is "kind of a decentralized Dropbox". It is a bit difficult to configure, but once that is done, file folders will be synchronized between multiple devices either over the local network or using cloud servers.

The other example is Snapdrop, which is "so simple and so cool". It is a web application that discovers other devices on the network and allows drag-and-drop file transfer among them. Since it is web-based, it is device independent, but it does require that the devices are online to access the web page. The transfer happens in a peer-to-peer fashion, but the application gets loaded from the cloud.

Local first

The third technology area that McQueen wanted to talk about was local-first software. He had borrowed some slides from Peter van Hardenberg at Ink & Switch, which is an "industrial research group" that has been working on local-first software for the last five years. Those researchers have come up with a manifesto of sorts, with seven principles, or ideals, that describe software that is not completely reliant on the network or the cloud, but still provides many of the same features and benefits that users have come to expect.

The first of these ideals is "no spinners"; the user's work is actually on the device in front of them. But on the flipside, their work is not trapped on a single device; through some mechanism, replicas are kept in sync on other devices of interest. The network is optional, however; when it is present, synchronization can happen, but work can still be done without it. The fourth item is that seamless collaboration is a requirement today; it has become an indispensable feature that needs to be incorporated in any synchronization mechanisms that arise.

The data that gets stored needs to remain accessible even if the software that uses it goes away. Digital archivists (and others) worry that we are storing much of the data about our life today in ways that will not be accessible 20, or even ten or less, years on. For example, it could become impossible to access a document made in Google Docs sometime down the road. Avoiding that has benefits both for individual users and for society as a whole.

Local-first software has an advantage of having privacy and security built-in because it is not storing its data in some centralized cloud location that becomes a huge temptation for attackers. That centralized storage is also susceptible to various misdeeds by the companies controlling it—or their employees. Local-first gives users ultimate ownership and control of their data. No cloud-based application provider can cut off access due to its whim or at the behest of, say, an oppressive government regime.

McQueen recommended that people visit the Ink & Switch site to find out more. The group has done more than just think about local-first software; it has done some work on using conflict-free replicated data types (CRDTs), which provide eventual consistency for data that is being updated in multiple places. The data structure is a good basis for collaborative tools that can seamlessly move between connected and unconnected modes, he said.

Development

There is also a case to be made that today's cloud applications are overly expensive to build and operate. They are usually written in several tiers: one for the web-based user interface, a layer for business logic, an API layer that provides access to a storage/database layer, and so on. These often use different programming languages and it all leads to a complex distributed application that can be difficult to scale. He put up a slide of the Cloud Native landscape, as an example of the complexity that arises for cloud applications.

In the self-contained software world, we have a longstanding tradition of writing code that "we can reason about", he said; it is architecturally fairly simple, generally having fewer code bases and using less languages. The goal of local-first software is not to reject the cloud, but it is "about rethinking the relationship with the cloud". The cloud has a role in helping to synchronize the data, improving its availability, and in providing additional compute power. With proper key management ("asterisk, it's complicated"), data can be end-to-end encrypted so that the cloud becomes a passive carrier rather than an active participant.

The GNOME Foundation has identified three areas that it plans to provide funding for in the coming years. One is to help bootstrap an app store for GNOME; another is to work on improving the diversity within the GNOME community. The third is to look at ways to integrate decentralized and local-first technologies into the GNOME desktop. McQueen thinks that GNOME is well-positioned to take a lead on bringing some of these technologies to its users. For one thing, GNOME is "very opinionated" about how its software looks and operates. Applying that same approach to local-first software makes sense.

He had some examples of existing tools that already embody some parts of the local-first approach. WebArchives is an application that loads Kiwix files for offline viewing of Wikipedia and other sites. Endless has an Encyclopedia application that is similar, but it is integrated with the desktop search on Endless OS as well. Encyclopedia is currently a separate GTK-based application, but Endless is moving toward integrating all of that into Kolibri for the future, he said.

In the realm of peer-to-peer applications, there is Teleport, which discovers other Teleport-ready devices on the local network and allows transferring files between them. One limitation is that all of the participants need to be running GNOME, but it provides an example of an application that is easy to set up, which could perhaps be merged with techniques from Syncthing or Snapdrop.

Automerge is another project that could be useful for GNOME; it provides a library to do CRDT handling for collaborative applications. It was originally JavaScript-based, but has been rewritten in Rust, which has the advantage of moving away from the "millions of lines of crusty C code that we are running on top of". Using the GTK Rust bindings along with Automerge will allow GNOME to start experimenting with local-first collaborative applications, he said.

He wrapped up by talking about several kinds of applications where it would be useful to have access to the same data in multiple locations without making that data available to cloud providers. For example, health-tracking applications (such as GNOME Health) would benefit from synchronization across devices, but that data is of a particularly personal nature, of course. Contact lists and calendars are additional kinds of applications where multi-device synchronization and (limited) sharing among collaborators make a lot of sense. McQueen thinks that GNOME is in a great position to help set the stage for the computing experience of those 2.5 billion people who are "arriving" over the next 30 years or so. The GNOME Foundation is only one voice in the project, however, so he is hoping to see others join in to work on various aspects of it.

A YouTube video of the talk is available, though the audio volume is rather low.

[I would like to thank LWN subscribers for supporting my trip to Guadalajara, Mexico for GUADEC.]

Comments (2 posted)

Living with the Rust trademark

By Jonathan Corbet
July 21, 2022

The intersection of free software and trademark law has not always been smooth. Free-software licenses have little to say about trademarks but, sometimes, trademark licenses can appear to take away some of the freedoms that free-software licenses grant. The Firefox browser has often been the focal point for trademark-related controversy; happily, those problems appear to be in the past now. Instead, the increasing popularity of the Rust language is drawing attention to its trademark policies.

When a free-software project gets a trademark, it is an indication that the name of that project has come to have some sort of value, and somebody (hopefully the people in charge of the project) want to control how it is used. They may want to prevent their project's name being used with versions that have been modified with rent-seeking or malicious features, for example. Other projects might want the exclusive right to market commercial versions of their code under the trademarked name. As a general rule, trademark licenses will restrict the changes to a project's code that can be distributed without changing the name.

As a result of those restrictions, trademark policies can appear to be a violation of free-software principles. But those restrictions apply to the trademarked name, not the code itself; any such restrictions can be avoided just by not using the name. Thus, for some time, Firefox as distributed by Debian was known as "Iceweasel" until 2016; it lacked no functionality and was entirely free software. It is worth noting that version 3 of the GNU General Public License explicitly allows the withholding of trademark rights.

That does not mean, though, that all trademark policies for free-software projects are met with universal acclaim.

Rust in Debian

At the end of June, Luke Leighton filed a Debian bug regarding the trademarks on the Rust and Cargo names:

This is an extremely serious situation that either requires pulling rust and cargo from debian or a rename of both rust and cargo exactly as was done with iceweasel.
Failure to do so is also extremely serious because Unlawful Distribution may still be considered grounds for financial compensation as well as a Legal Notice to Cease and Desist, and also to remove all public and private use of the Trademark from all Records. mailing lists, bugtracker, debian archives - everything.

On July 17, he posted a similar report to the GCC developers mailing list. The discussions in both places were tense at times, with Leighton's strident and sometimes threatening tone (which he attributed to Asperger's) making the discussion less pleasant than it should have been. That should not, however, obscure the fact that he does indeed have a point that both projects need to pay attention to.

Leighton was objecting to the Rust and Cargo trademark policies as they existed in late June. Specifically, he called out this rule:

Distributing a modified version of the Rust programming language or the Cargo package manager and calling it Rust or Cargo requires explicit, written permission from the Rust core team. We will usually allow these uses as long as the modifications are (1) relatively small and (2) very clearly communicated to end-users.

Debian distributes a large set of Rust-related packages, including the rustc compiler and cargo package manager/build system. Most distributions apply patches to programs of this size, and Debian is no exception. Thus, by a strict reading, Debian is likely to be in violation of the above terms.

Debian developer (and Rust package maintainer) Sylvestre Ledru responded that he would "chat with some people on the Rust side about this" and suggested that there was little to worry about; since Ledru also was involved with the resolution of the Iceweasel episode, there was reason to trust his judgment on this. That didn't stop an extended discussion on whether these restrictions made Rust non-free or whether Debian was at risk, of course.

On July 18, Ledru returned with an announcement that a discussion with the Rust Foundation had resulted in some changes to the posted trademark policy. The current version of this policy has a new clause for allowed use of the trademarks:

Distributing a modified version of the Rust programming language or the Cargo package manager, provided that the modifications are limited to:

porting the software to a different architecture
fixing local paths
adding patches that have been released upstream, or
adding patches that have been reported upstream, provided that the patch is removed if it is not accepted upstream

From Ledru's point of view, the issue is resolved, and he closed the bug accordingly. Leighton promptly reopened it, leading to continued discussion about whether the issue has truly been resolved or not. It would appear, though, that the project as a whole feels there is no longer anything to worry about.

GCC and Rust

A solution for Debian is not necessarily a solution for GCC, though. The GCC project is not shipping a modified version of rustc; instead, this project has recently approved a plan to merge an independently implemented compiler front-end for the Rust language. By Leighton's interpretation of the trademark policy, the GCC Rust compiler cannot actually be called a Rust compiler. Should this interpretation hold, GCC could end up claiming to compile a language called something like "Gust" or "Oxide" instead.

Mark Wielaard responded, though, that the concern was overwrought:

That looks to me as an overreaching interpretation of how to interpret a trademark. I notice you are the bug reporter. It would only apply if a product based on gcc with the gccrs frontend integrated would claim to be endorsed by the Rust Foundation by using the Rust wordmark. Just using the word rust doesn't trigger confusion about that. And trademarks don't apply when using common words to implement an interface or command line tool for compatibility with a programming language.

He also pointed out that the one-time GCC Java implementation never ran afoul of the Java trademark. Unsurprisingly at this point, Leighton disagreed: "sorry, Mark, you're still misunderstanding, on multiple levels and in so many ways i am having a hard time tracking them all". The conversation did not improve from there until David Edelsohn said (in an apparently private message quoted by Leighton) that the issues raised would be given "due consideration" and that any problem found would be addressed. That, evidently, was the wording that Leighton wanted to hear.

In the short term, there is probably little danger of a "Gust" solution; as was pointed out in the discussion, the Rust Foundation has long been aware of the GCC work and has never raised any concerns about it. That could change at some future point, though, especially if the GCC Rust compiler implements a different version of the language, perhaps with extensions that have not been approved for official Rust™. If GCC Rust were to be successful enough to cause the Rust Foundation to fear loss of control over the language, the results could be bad for everybody involved. It will take some time before GCC Rust can catch up to rustc, though, if it ever does, so this is not an immediate concern.

Free-software licenses are about giving rights to others; trademark licenses, instead, are concerned with restricting rights. So there will always be a perceived conflict between the two. Most of the time, the free-software community has found ways to coexist with trademarks. The outcome in the Rust case will almost certainly be the same as long as the parties involved pay attention and deal with each other in good faith; thus far, there is no indication that anything other than that is happening.

Comments (99 posted)

Stuffing the return stack buffer

By Jonathan Corbet
July 22, 2022

"Retbleed" is the name given to a class of speculative-execution vulnerabilities involving return instructions. Mitigations for Retbleed have found their way into the mainline kernel but, as of this writing, some remaining problems have kept them from the stable update releases. Mitigating Retbleed can impede performance severely, especially on some Intel processors. Thomas Gleixner and Peter Zijlstra think they have found a better way that bypasses the existing mitigations and misleads the processor's speculative-execution mechanisms instead.

If a CPU is to speculate past a return instruction, it must have some idea of where the code will return to. In recent Intel processors, there is a special hidden data structure called the "return stack buffer" (RSB) that caches return addresses for speculation. The RSB can hold 16 entries, so it must drop the oldest entries if a call chain goes deeper than that. As that deep call chain returns, the RSB can underflow. One might think that speculation would just stop at that point but, instead, the CPU resorts to other heuristics, including predicting from the branch history buffer. Alas, techniques for mistraining the branch history buffer are well understood at this point.

As a result, long call chains in the kernel are susceptible to speculative-execution attacks. On Intel processors starting with the Skylake generation, the only way to prevent such attacks is to turn on the indirect branch restricted speculation (IBRS) CPU "feature", which was added by Intel early in the Spectre era. IBRS works, but it has the unwelcome side effect of reducing performance by as much as 30%. For some reason, users lack enthusiasm for this solution.

Another way

Gleixner and Zijlstra decided to try a different approach. Speculative execution of return calls on these processors can only be abused if the RSB underflows. So, if RSB underflow can be prevented, this particular problem will go away. And that, it seems, can be achieved by "stuffing" the RSB whenever it is at risk of running out of entries.

That immediately leads to two new challenges: knowing when the RSB is running low, and finding a way to fill it back up. The first piece is handled by tracking the current call-chain depth — in an approximate way. The build system is modified to create a couple of new sections in the executable kernel image to hold entry and exit thunks for kernel functions and to track them. When RSB stuffing is enabled, the entry thunk will be invoked on entry to each function, and the exit thunk will be run on the way out.

The state of the RSB is tracked with a per-CPU, 64-bit value that is originally set to:

    0x8000 0000 0000 0000

The function entry thunk "increments" this counter by right-shifting it by five bits. The processor will sign-extend the value, so the counter will, after the first call, look like:

    0xfc00 0000 0000 0000

If twelve more calls happen in succession, the sign bit will have been extended all the way to the right and the counter will contain all ones, with bits beginning to fall off the right end; this counter thus cannot reliably count above twelve. In this way it mimics the RSB, which cannot hold more than 16 entries, with a safety margin of four calls; the use of shifts achieves that behavior without the need to introduce a branch. Whenever a return thunk is executed, the opposite happens: the counter is left-shifted by five bits. After twelve returns, the next shift will clear the remaining bits, and the counter will have a value of zero, which is the indication that something must be done to prevent the RSB from underflowing.

That "something" is a quick series of function calls (coded in assembly and found at the end of this patch) that adds 16 entries to the call stack, and thus to the RSB as well. Each of those calls, if ever returned from, will immediately execute an int3 instruction; that will stop speculation if those return calls are ever executed speculatively. The actual kernel does not want to execute those instructions (or all of those returns), of course, so the RSB-stuffing code increments the real stack pointer past the just-added call frames.

The end result is an RSB that no longer matches the actual call stack, but which is full of entries that will do no harm if speculated into. At this point, the call-depth counter can be set to -1 (all ones in the two's complement representation) to reflect the fact that the RSB is full. The kernel is now safe against Retbleed exploitation — until and unless another chain of twelve returns happens, in which case the RSB will need to be stuffed again.

Costs

Quite a bit of work has been put into minimizing the overhead of this solution, especially on systems where it is not needed. The kernel is built with direct calls to its functions as usual; at boot time, if the retbleed=stuff option is selected, all of those calls will be patched to go through the accounting thunks instead. The thunks themselves are placed in a huge-page mapping to minimize the translation lookaside buffer overhead. Even so, as the cover letter comments, there are costs: "We both unsurprisingly hate the result with a passion".

Those costs come in a few forms. An "impressive" amount of memory is required to hold the thunks and associated housekeeping. The bloating of the kernel has a performance impact of its own, even on systems where RSB stuffing is not enabled. The extra instructions add to pressure on the instruction cache, slowing execution. That last problem could be mitigated somewhat, the cover letter says, by allocating the thunks at the beginning of each function rather than in a separate section. Gleixner has prepared a GCC patch to make that possible, and reports that some of the performance loss is gained back when it is used.

The cover letter contains a long list of benchmark results comparing the performance of RSB stuffing against that of disabling mitigations entirely and of using IBRS. The numbers for RSB stuffing are eye-opening, including a 382% performance regression for one microbenchmark. In all cases, though, RSB stuffing performs better than IBRS.

Better performance than IBRS is only interesting, though, if the primary goal of blocking Retbleed attacks has been achieved. The cover letter says this:

The assumption is that stuffing at the 12th return is sufficient to break the speculation before it hits the underflow and the fallback to the other predictors. Testing confirms that it works. Johannes [Wikner], one of the retbleed researchers, tried to attack this approach and confirmed that it brings the signal to noise ratio down to the crystal ball level.
There is obviously no scientific proof that this will withstand future research progress, but all we can do right now is to speculate about that.

So RSB stuffing seems to work — for now, at least. That should make it attractive in situations where defending against Retbleed attacks is considered to be necessary; hosting providers with untrusted users would be one obvious example. But nobody will be happy with the overhead, even if it is better than IBRS. For a lot of users, RSB stuffing will be seen as a clever hack that, happily, they do not need to actually use.

Comments (34 posted)

Support for Intel's Linear Address Masking

By Jonathan Corbet
July 25, 2022

A 64-bit pointer can address a lot of memory — far more than just about any application could ever need. As a result, there are bits within that pointer that are not really needed to address memory, and which might be put to other needs. Storing a few bits of metadata within a pointer is a common enough use case that multiple architectures are adding support for it at the hardware level. Intel is no exception; support for its "Linear Address Masking" (LAM) feature has been slowly making its way toward the mainline kernel.

CPUs can support this metadata by simply masking off the relevant bits before dereferencing a pointer. Naturally, every CPU vendor has managed to support this feature differently. Arm's top-byte ignore feature allows the most-significant byte of the address to be used for non-pointing purposes; it has been supported by the Linux kernel since 5.4 came out in 2019. AMD's "upper address ignore" feature, instead, only allows the seven topmost bits to be used in this way; support for this feature was proposed earlier this year but has not yet been accepted.

One of the roadblocks in the AMD case is that this feature would allow the creation of valid user-space pointers that have the most-significant bit set. In current kernels, only kernel-space addresses have that bit set, and an unknown amount of low-level code depends on that distinction. The consequences of confusing user-space and kernel-space addresses could be severe and contribute to the ongoing CVE-number shortage, so developers are nervous about any feature that could cause such confusion to happen. Quite a bit of code would likely have to be audited to create any level of confidence that allowing user-space addresses with that bit set would not open up a whole set of security holes.

Intel's LAM feature offers two modes, both of which are different from anybody else's:

LAM_U57 allows six bits of metadata in bits 62 to 57.
LAM_U48 allows 15 bits of metadata in bits 62 to 48.

It's worth noting that neither of these modes allows bit 63 (the most-significant bit) to be used for this purpose, so LAM avoids the pitfall that has created trouble for AMD.

Support for LAM is added by this patch set from Kirill Shutemov. Since LAM must be enabled in the CPU by privileged code, the patch set introduces a new API in the form of two new arch_prctl() commands that are designed to be able to support any CPU's pointer-metadata mechanism. The first, ARCH_ENABLE_TAGGED_ADDR, enables the use of LAM for the current process; it takes an integer argument indicating how many bits of data the process wishes to store in pointers and selects the mode (from the above set) that holds at least that many.

Programs trying to use this feature need to know where they can store their metadata within a pointer; this needs to happen in a general way if such programs are to be portable across architectures. The second arch_prctl() operation, ARCH_GET_UNTAG_MASK, returns a 64-bit value with bits set to indicate the available space. The patch set also adds a line to each process's arch_status file in /proc indicating the effective mask.

The LAM patches are, for the most part, uncontroversial; the LAM feature is seen as being better designed than AMD's equivalent. That said, there are some ongoing concerns about the LAM_U48 mode in particular that may prevent it from being supported in Linux anytime soon.

Linux has supported five-level page tables on x86 systems since the 4.11 release in 2017. On a system with five-level page tables enabled, 57 bits of address space are available to running processes. The kernel will not normally map memory into the upper nine bits of that address space, which is the part added by the fifth page-table level, out of fear of breaking applications; among other things, some programs may be storing metadata in those bits even without hardware support. More care must be taken when applying this trick since the metadata bits must always be masked out before dereferencing a pointer, but it is possible. Programs that do this would obviously break, though, if those bits became necessary to address memory within their address space.

To avoid this kind of problem, the kernel will only map memory into the upper part of the address space if the application explicitly asks for it in an mmap() call. It's a rare application that will need to do that, and should be a relatively easy thing to add in cases where it's necessary.

The LAM_U48 mode, which uses 15 pointer bits and only leaves 48 significant bits for the actual address, is clearly inconsistent with any attempt to use a 57-bit address space. One might argue that any programmer who tries to use both together deserves the resulting explosions, but it is better to simply not provide useless footguns whenever possible. Since the kernel already plays a role in the use of both modes, it is well placed to ensure that they are not used together.

In Shutemov's patch set, the enabling of LAM_U48 is relegated to a set of optional patches at the end; if they are left out, the kernel will only support the LAM_U57 mode, which is certainly one way to solve the problem. If the patches are included, instead, then user space must choose which of the two features it will use (if either); the mode that is selected first wins. If LAM_U48 is enabled, the ability to map memory in the upper nine bits of the address space will be permanently removed from that process. But if the process has already mapped memory there when it tries to enable LAM_U48, the attempt will fail.

It seems like a reasonable solution that would make all of the functionality available and let processes choose which they will actually use, but developers remain concerned about the LAM_U48 mode. Alexander Potapenko suggested that distributors would want to remove this mode if it makes it into the mainline, but that it would become harder to do so over time as other changes land on top of it. Dave Hansen, one of the x86 maintainers, said that he would not merge LAM_U48 immediately, but would consider doing so in the future.

So, while there does not seem to be much to impede the adoption of LAM in general, it is not clear that all of the LAM patches will be merged anytime soon. If there are people with use cases for LAM_U48 out there, this might be a good time to make those use cases known; otherwise they may find that the feature is unavailable to them.

Comments (12 posted)

Docker and the OCI container ecosystem

July 26, 2022

This article was contributed by Jordan Webb

Docker has transformed the way many people develop and deploy software. It wasn't the first implementation of containers on Linux, but Docker's ideas about how containers should be structured and managed were different from its predecessors. Those ideas matured into industry standards, and an ecosystem of software has grown around them. Docker continues to be a major player in the ecosystem, but it is no longer the only whale in the sea — Red Hat has also done a lot of work on container tools, and alternative implementations are now available for many of Docker's offerings.

Anatomy of a container

A container is somewhat like a lightweight virtual machine; it shares a kernel with the host, but in most other ways it appears to be an independent machine to the software running inside of it. The Linux kernel itself has no concept of containers; instead, they are created by using a combination of several kernel features:

Bind mounts and overlayfs may be used to construct the root filesystem of the container.
Control groups may be used to partition CPU, memory, and I/O resources for the host kernel.
Namespaces are used to create an isolated view of the system for processes running inside the container.

Linux's namespaces are the key feature that allow the creation of containers. Linux supports namespaces for multiple different aspects of the system, including user namespaces for separate views of user and group IDs, PID namespaces for distinct sets of process IDs, network namespaces for distinct sets of network interfaces, and several others. When a container is started, a runtime creates the appropriate control groups, namespaces, and filesystem mounts for the container; then it launches a process inside the environment it has created.

There is some level of disagreement about what that process should be. Some prefer to start an init process like systemd and run a full Linux system inside the container. This is referred to as a "system container"; it was the most common type of container before Docker. System containers continue to be supported by software like LXC and OpenVZ.

Docker's developers had a different idea. Instead of running an entire system inside a container, Docker says that each container should only run a single application. This style of container is known as an "application container." An application container is started using a container image, which bundles the application together with its dependencies and just enough of a Linux root filesystem to run it.

A container image generally does not include an init system, and may not even include a package manager — container images are usually replaced with updated versions rather than updated in place. An image for a statically-compiled application may be as minimal as a single binary and a handful of support files in /etc. Application containers usually don't have a persistent root filesystem; instead, overlayfs is used to create a temporary layer on top of the container image. This is thrown away when the container is stopped. Any persistent data outside of the container image is grafted on to the container's filesystem via a bind mount to another location on the host.

The OCI ecosystem

These days, when people talk about containers, they are likely to be talking about the style of application containers popularized by Docker. In fact, unless otherwise specified, they are probably talking about the specific container image format, run-time environment, and registry API implemented by Docker's software. Those have all been standardized by the Open Container Initiative (OCI), which is an industry body that was formed in 2015 by Docker and the Linux Foundation. Docker refactored its software into a number of smaller components; some of those components, along with their specifications, were placed under the care of the OCI. The software and specifications published by the OCI formed the seed for what is now a robust ecosystem of container-related software.

The OCI image specification defines a format for container images that consists of a JSON configuration (containing environment variables, the path to execute, and so on) and a series of tarballs called "layers". The contents of each layer are stacked on top of each other, in series, to construct the root filesystem for the container image. Layers can be shared between images; if a server is running several containers that refer to the same layer, they can potentially share the same copy of that layer. Docker provides minimal images for several popular Linux distributions that can be used as the base layer for application containers.

The OCI also publishes a distribution specification. In this context, "distribution" does not refer to a Linux distribution; it is used in a more general sense. This specification defines an HTTP API for pushing and pulling container images to and from a server; servers that implement this API are called container registries. Docker maintains a large public registry called Docker Hub as well as a reference implementation (called "Distribution", perhaps confusingly) that can be self-hosted. Other implementations of the specification include Red Hat's Quay and VMware's Harbor, as well as hosted offerings from Amazon, GitHub, GitLab, and Google.

A program that implements the OCI runtime specification is responsible for everything pertaining to actually running a container. It sets up any necessary mounts, control groups, and kernel namespaces, executes processes inside the container, and tears down any container-related resources once all the processes inside of it have exited. The reference implementation of the runtime specification is runc, which was created by Docker for the OCI.

There are a number of other OCI runtimes to choose from. For example, crun offers an OCI runtime written in C that has the goal of being faster and more lightweight than runc, which, like most of the rest of the OCI ecosystem, is written in Go. Google's gVisor includes runsc, which provides greater isolation from the host by running applications on top of a user-mode kernel. Amazon's Firecracker is a minimal hypervisor written in Rust that can use KVM to give each container its own virtual machine; Intel's Kata Containers works similarly but supports multiple hypervisors (including Firecracker.)

A container engine is a program that ties these three specifications together. It implements the client side of the distribution specification to retrieve container images from registries, interprets the images it has retrieved according to the image specification, and launches containers using a program that implements the runtime specification. A container engine provides tools and/or APIs for users to manage container images, processes, and storage.

Kubernetes is a container orchestrator, capable of scheduling and running containers across hundreds or even thousands of servers. Kubernetes does not implement any of the OCI specifications itself. It needs to be used in combination with a container engine, which manages containers on behalf of Kubernetes. The interface that it uses to communicate with container engines is called the Container Runtime Interface (CRI).

Docker

Docker is the original OCI container engine. It consists of two main user-visible components: a command-line-interface (CLI) client named docker, and a server. The server is named dockerd in Docker's own packages, but the repository was renamed moby when Docker created the Moby Project in 2017. The Moby Project is an umbrella organization that develops open-source components used by Docker and other container engines. When Moby was announced, many found the relationship between Docker and the Moby project to be confusing; it has been described as being similar to the relationship between Fedora and Red Hat.

dockerd provides an HTTP API; it usually listens on a Unix socket named /var/run/docker.sock, but can be made to listen on a TCP socket as well. The docker command is merely a client to this API; the server is responsible for downloading images and starting container processes. The client supports starting containers in the foreground, so that running a container at the command-line behaves similarly to running any other program, but this is only a simulation. In this mode, the container processes are still started by the server, and input and output are streamed over the API socket; when the process exits, the server reports that to the client, and then the client sets its own exit status to match.

This design does not play well with systemd or other process supervision tools, because the CLI never has any child processes of its own. Running the docker CLI under a process supervisor only results in supervising the CLI process. This has a variety of consequences for users of these tools. For example, any attempt to limit a container's memory usage by running the CLI as a systemd service will fail; the limits will only apply to the CLI and its non-existent children. In addition, attempts to terminate a client process may not result in terminating all of the processes in the container.

Failure to limit access to Docker's socket can be a significant security hazard. By default dockerd runs as root. Anyone who is able to connect to the Docker socket has complete access to the API. Since the API allows things like running a container as a specific UID and binding arbitrary filesystem locations, it is trivial for someone with access to the socket to become root on the host. Support for running in rootless mode was added in 2019 and stabilized in 2020, but is still not the default mode of operation.

Docker can be used by Kubernetes to run containers, but it doesn't directly support the CRI specification. Originally, Kubernetes included a component called dockershim that provided a bridge between the CRI and the Docker API, but it was deprecated in 2020. The code was spun out of the Kubernetes repository and is now maintained separately as cri-dockerd.

containerd & nerdctl

Docker refactored its software into independent components in 2015; containerd is one of the fruits of that effort. In 2017, Docker donated containerd to the Cloud Native Computing Foundation (CNCF), which stewards the development of Kubernetes and other tools. It is still included in Docker, but it can also be used as a standalone container engine, or with Kubernetes via an included CRI plugin. The architecture of containerd is highly modular. This flexibility helps it to serve as a proving ground for experimental features. Plugins may provide support for different ways of storing container images and additional image formats, for example.

Without any additional plugins, containerd is effectively a subset of Docker; its core features map closely to the OCI specifications. Tools designed to work with Docker's API cannot be used with containerd. Instead, it provides an API based on Google's gRPC. Unfortunately, concerned system administrators looking for access control won't find it here; despite being incompatible with Docker's API, containerd's API appears to carry all of the same security implications.

The documentation for containerd notes that it follows a smart client model (as opposed to Docker's "dumb client"). Among other differences, this means that containerd does not communicate with container registries; instead, (smart) clients are required to download any images they need themselves. Despite the difference in client models, containerd still has a process model similar to that of Docker; container processes are forked from the containerd process. In general, without additional software, containerd doesn't do anything differently from Docker, it just does less.

When containerd is bundled with Docker, dockerd serves as the smart client, accepting Docker API calls from its own dumb client and doing any additional work needed before calling the containerd API; when used with Kubernetes, these things are handled by the CRI plugin. Other than that, containerd didn't really have its own client until relatively recently. It includes a bare-bones CLI called ctr, but this is only intended for debugging purposes.

This changed in December 2020 with the release of nerdctl. Since its release, running containerd on its own has become much more practical; nerdctl features a user interface designed to be compatible with the Docker CLI and provides much of the functionality Docker users would find missing from a standalone containerd installation. Users who don't need compatibility with the Docker API might find themselves quite happy with containerd and nertdctl.

Podman

Podman is an alternative to Docker sponsored by Red Hat, which aims to be a drop-in replacement for Docker. Like Docker and containerd, it is written in Go and released under the Apache 2.0 License, but it is not a fork; it is an independent reimplementation. Red Hat's sponsorship of Podman is likely to be at least partially motivated by the difficulties it encountered during its efforts to make Docker's software interoperate with systemd.

On a superficial level, Podman appears nearly identical to Docker. It can use the same container images, and talk to the same registries. The podman CLI is a clone of docker, with the intention that users migrating from Docker can alias docker to podman and mostly continue with their lives as if nothing had changed.

Originally, Podman provided an API based on the varlink protocol. This meant that while Podman was compatible with Docker on a CLI level, tools that used the Docker API directly could not be used with Podman. In version 3.0, the varlink API was scrapped in favor of an HTTP API, which aims to be compatible with the one provided by Docker while also adding some Podman-specific endpoints. This new API is maturing rapidly, but users of tools designed for Docker would be well-advised to test for compatibility before committing to switch to Podman.

As it is largely a copy of Docker's API, Podman's API doesn't feature any sort of access control, but Podman has some architectural differences that may make that less important. Podman gained support for running in rootless mode early on in its development. In this mode, containers can be created without root or any other special privileges, aside from that small bit of help from newuidmap and newgidmap. Unlike Docker, when Podman is invoked by a non-root user, rootless mode is used by default.

Users of Podman can also dodge security concerns about its API socket by simply disabling it. Though its interface is largely identical to the Docker CLI, podman is no mere API client. It creates containers for itself without any help from a daemon. As a result, Podman plays nicely with tools like systemd; using podman run with a process supervisor works as expected, because the processes inside the container are children of podman run. The developers of Podman encourage people to use it in this way by a command to generate systemd units for Podman containers.

Aside from its process model, Podman caters to systemd users in other ways. While running an init system such as systemd inside of a container is antithetical to the Docker philosophy of one application per container, Podman goes out of its way to make it easy. If the program to run specified by the container is an init system, Podman will automatically mount all the kernel filesystems needed for systemd to function. It also supports reporting the status of containers to systemd via sd_notify(), or handing the notification socket off to the application inside of the container for it to use directly.

Podman also has some features designed to appeal to Kubernetes users. Like Kubernetes, it supports the notion of a "pod", which is a group of containers that share a common network namespace. It can run containers using Kubernetes configuration files and also generate Kubernetes configurations. However, unlike Docker and containerd, there is no way for Podman to be used by Kubernetes to run containers. This is a deliberate omission. Instead of adding CRI support to Podman, which is a general-purpose container engine, Red Hat chose to sponsor the development of a more specialized alternative in the form of CRI-O.

CRI-O

CRI-O is based on many of the same underpinnings as Podman. So the relationship between CRI-O and Podman could be said to be similar to the one between containerd and Docker; CRI-O delivers much of the same technology as Podman, with fewer frills. This analogy doesn't stretch far, though. Unlike containerd and Docker, CRI-O and Podman are completely separate projects; one is not embedded by the other.

As might be suggested by its name, CRI-O implements the Kubernetes CRI. In fact, that's all that it implements; CRI-O is built specifically and only for use with Kubernetes. It is developed in lockstep with the Kubernetes release cycle, and anything that is not required by the CRI is explicitly declared to be out of scope. CRI-O cannot be used without Kubernetes and includes no CLI of its own; based on the stated goals of the project, any attempt to make CRI-O suitable for standalone use would likely be viewed as an unwelcome distraction by its developers.

Like Podman, the development of CRI-O was initially sponsored by Red Hat; like containerd, it was later donated to the CNCF in 2019. Although they are now both under the aegis of the same organization, the narrow focus of CRI-O may make it more appealing to Kubernetes administrators than containerd. The developers of CRI-O are free to make decisions solely on the basis of maximizing the benefit to users of Kubernetes, whereas the developers of containerd and other container engines have many other types of users and uses cases to consider.

Conclusion

These are just a few of the most popular container engines; other projects like Apptainer and Pouch cater to different ecological niches. There are also a number of tools available for creating and manipulating container images, like Buildah, Buildpacks, skopeo, and umoci. Docker deserves a great deal of credit for the Open Container Initiative; the standards and the software that have resulted from this effort have provided the foundation for a wide array of projects. The ecosystem is robust; should one project shut down, there are multiple alternatives ready and available to take its place. As a result, the future of this technology is no longer tied to one particular company or project; the style of containers that Docker pioneered seems likely to be with us for a long time to come.

Comments (48 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>