Leading items
Welcome to the LWN.net Weekly Edition for February 20, 2020
This edition contains the following feature content:
- Debian discusses how to handle 2038: now that the kernel work to support 32-bit systems past 2038 is mostly done, what will be required to get Debian into shape?
- Finer-grained kernel address-space layout randomization: a new KASLR proposal surfaces.
- Revisiting stable-kernel regressions: how many patches distributed in stable kernels bring problems with them?
- Keeping secrets in memfd areas: a new API to manage memory with secret contents.
- Filesystem UID mapping for user namespaces: yet another shiftfs: a third approach to the problem of shifting user and group IDs for filesystem access within user namespaces.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Debian discusses how to handle 2038
At this point, most of the kernel work to avoid the year-2038 apocalypse has been completed. Said apocalypse could occur when time counted in seconds since 1970 overflows a 32-bit signed value (i.e. time_t). Work in the GNU C Library (glibc) and other C libraries is well underway as well. But the "fun" is just beginning for distributions, especially those that support 32-bit architectures, as a recent Debian discussion reveals. One of the questions is: how much effort should be made to support 32-bit architectures as they fade from use and 2038 draws nearer?
Steve McIntyre started the conversation with a post to the debian-devel mailing list. In it, he noted that Arnd Bergmann, who was copied on the email, had been doing a lot of the work on the kernel side of the problem, but that it is mostly a solved problem for the kernel at this point. McIntyre and Bergmann (not to mention Debian as a whole) are now interested in what is needed to update a complete Linux system, such as Debian, to work with a 64-bit time_t.
McIntyre said that glibc has been working on an approach that splits the problem up based on the architecture targeted. Those that already have a 64-bit time_t will simply have a glibc that works with that ABI. Others that are transitioning from a 32-bit time_t to the new ABI will continue to use the 32-bit version by default in glibc. Applications on the latter architectures can request the 64-bit time_t support from glibc, but then they (and any other libraries they use) will only get the 64-bit versions of the ABI.
One thing that glibc will not be doing is bumping its SONAME (major version, essentially); doing so would make it easier to distinguish versions with and without the 64-bit support for 32-bit architectures. The glibc developers do not consider the change to be an ABI break, because applications have to opt into the change. It would be difficult and messy for Debian to change the SONAME for glibc on its own.
Moving forward with the 32-bit time values will, obviously, run into
year-2038 problems, so it makes sense for a system like Debian to request
64-bit
time support—except that there is a lot of code out there that will not
work at all with the newer ABI. Some of it is in binary form, proprietary
applications of various sorts (e.g. games), but there are plenty of
problems for the open-source code as well.
Bergmann scanned the libraries in Debian
and "identified
that about one third of our library packages would need rebuilding
(and tracking) to make a (recursive) transition
", McIntyre said.
He outlined two ways forward that they had come up with.
The first is to rename libraries that need to be fixed to support 64-bit
time values so that there could be two versions of them that could both be
installed on a single system. That entails fixing a bunch of packages and
rebuilding lots of code. "This effort will be *needed* only for the sake of our 32-bit
ports, but would affect *everybody*.
"
The second is to decide which of the 32-bit architectures Debian supports will actually be viable in 2038 and to create new versions of those architectures with different names. There would be two versions of those architecture ports active for, probably, one release and users would not be able to simply upgrade from one to the other. But it would reduce the impact:
He and Bergmann think the second option is the right way to go, but McIntyre was soliciting input from others in the project. Ansgar Burchardt replied that the i386 port, at least, probably should not even take the second path. It should simply continue using 32-bit time values.
If that is the chosen direction for i386, Russ Allbery suggested that the "cross-grading" feature be fully supported. Cross-grading would allow a Debian system to be upgraded to a new architecture, but is currently not for the faint of heart:
But Burchardt is not
convinced there will be enough i386 Debian systems to matter in even
ten years. He showed some numbers from Debian's popularity contest
(popcon) on i386 versus amd64 that suggest i386 will only make up around
0.2% of the total in ten years. "For just avoiding the Y2038 problem,
i386 might sort itself out already
without additional intervention.
" McIntyre thought that
cross-grading support might make an interesting project for an intern, however.
There is work going on in glibc to provide both 32- and 64-bit interfaces
simultaneously, Guillem Jover said,
which could be used to provide a smoother transition without needing a
SONAME change. Bergmann said
that the glibc work is proceeding, but he was not sure that it would
actually help with the transition that much; the problem is that on the
scale of a whole distribution it "adds so much extra work and time before
completion that there won't be many people left to use it by the time
the work is done.;-)
"
Jover pointed to the large-file
support (LFS) transition as something of a model, though he cautioned that
"the LFS 'transition' is not one we
should be very proud of, as it's been dragging on for a very long
time, and it's not even close to be finished
". LFS allows
applications to handle files with sizes larger than a 32-bit quantity
can hold, which parallels the time_t situation. Jover
suggested that fully enabling LFS as part of the time_t transition
might make sense. Bergmann said that LFS support is "a done
deal
", as both glibc and musl libc require using 64-bit file
sizes (off_t) when 64-bit time values are used.
Lennart Sorensen also saw parallels to the LFS transition, but Ben Hutchings thought otherwise:
Similarly, every program that uses wall-clock time will fail as we approach 2038, and the failure mode is likely to be even worse than with LFS as few programs will check for errors from time APIs.
YunQiang Su suggested a third option to consider, which would define 64-bit versions of all of the affected interfaces throughout the distribution, by adding a "64" for the new function and data structure names, then modify packages to use those interfaces over time. A deadline could be set of, say, 2030; after that, anything using the older interfaces would be considered to have a release-critical bug. McIntyre said that it would be technically possible to do so, but there are some major downsides:
2030 is also too late to fix the problem - people are already starting to see issues in some applications. We want to make a difference soon: the earlier that we have fixes available, the fewer broken systems will have to be fixed / removed / replaced later.
Most who commented in the thread seem to see i386 as a dying architecture, at least for Debian on the time scale being considered here (18 years). Florian Weimer summed it up this way:
Bergmann generally agreed with that assessment, but thought it made sense to see how things go for a different 32-bit architecture that probably needs to have a user space that supports 64-bit time_t: armhf. Once that work is done, it may be straightforward to apply it to i386 if that is deemed useful—or a decision could be made to phase out i386 sometime before 2038.
But Marco d'Itri wondered why the picture for armhf was different than that of i386. Bergmann pointed out that armhf is being used for a lot of embedded systems and that is likely to continue. McIntyre also noted that the Civil Infrastructure Platform is based on Debian and is often used with armhf. Bergmann gave a summary of some research he did on the use of the various Arm architecture versions, some of which he expects to still be in use, in 32-bit form, in 2038 and beyond:
The consensus seems to be that the second option, a new architecture name that indicates 64-bit time support, is the right way to go and that armhf makes the most sense as a starting point. In his reply to Jover, Bergmann summed it up this way:
McIntyre is hoping to get started on work soon, so that an armhf port for 64-bit time might perhaps be released with Debian 11 ("bullseye"), which is presumably coming in mid-2021. Bergmann also noted that Adélie Linux has been working on porting its user space for 64-bit times using musl libc; it has a list of open issues that were found:
Even though the 32-bit architectures were largely the focus of the discussion, there is still quite a bit of work that needs to be done for 64-bit systems as well. While the C libraries will soon fully support 64-bit time_t values, they will also require them for the time interfaces; there is plenty of code assuming 32-bit time values buried in user space, but those will presumably be found and fixed over the next few years. We may still see a glitch or three in January 2038 (or before), but it seems like the Linux world, at least, will be pretty well prepared long before the appointed hour.
Finer-grained kernel address-space layout randomization
The idea behind kernel address-space layout randomization (KASLR) is to make it harder for attackers to find code and data of interest to use in their attacks by loading the kernel at a random location. But a single random offset is used for the placement of the kernel text, which presents a weakness: if the offset can be determined for anything within the kernel, the addresses of other parts of the kernel are readily calculable. A new "finer-grained" KASLR patch set seeks to remedy that weakness for the text section of the kernel by randomly reordering the functions within the kernel code at boot time.
Kristen Carlson Accardi posted an RFC patch set that implemented a proof-of-concept for finer-grained KASLR in early February. She identified three weaknesses of the existing KASLR:
- low entropy in the randomness that can be applied to the kernel as a whole
- the leak of a single address can reveal the random offset applied to the kernel, thus revealing the rest of the addresses
- the kinds of information leaks needed to reveal the offset abound
This patch set rearranges your kernel code at load time on a per-function level granularity, with only around a second added to boot time."
The changes required are in two main areas. When the kernel is built, a GCC option is used to place each function in its own .text section. The relocation addresses can be used to allow shuffling the text sections as the kernel is loaded, just after it is decompressed. There are, she noted, tables of addresses in the kernel for things like exception handling and kernel probes (kprobes), but those can be handled too:
The second area of changes is in the loading of the kernel into memory; the boot process was changed to parse the vmlinux ELF file to retrieve the key symbols and collect up a list of .text.* sections to be reordered. The function order is then randomized and any tables are updated as needed:
For debugging the proof-of-concept, a pseudo-random-number generator (PRNG) was used so that the same order could be generated by giving it the identical seed. The patch adding the PRNG, which was authored by Kees Cook, might provide some performance benefits, but Andy Lutomirski objected to using a new, unproven algorithm; he suggested using a deterministic random bit generator (DRBG), such as ChaCha20. Similarly, Jason A. Donenfeld was concerned that the random-number sequence could be predicted from just a few leaked address values, which might defeat the purpose of the feature. Cook said that using ChaCha20 instead was a better idea moving forward.
The patch set removes access to the /proc/kallsyms file, which
lists addresses of kernel symbols, for
non-root users. Currently kallsyms simply gives addresses of all
zeroes when non-root users read it, but the list of symbols is given in the
order they appear in the kernel text; that would give away the randomized
layout of the kernel, so access was disabled. Cook pointed out that
making the kallsyms file unreadable has, in the past, "seemed to break weird
stuff in userspace
". He suggested either sorting the symbol names
alphabetically in the output—or perhaps just waiting to see if there were
any complaints.
Impacts
Accardi measured the impact on boot time in a VM and found that it took
roughly one second longer to boot, which is fairly negligible for many use
cases. The run-time performance is harder to characterize; the
all-important kernel build benchmark was about 1% slower than building on
the same kernel with just KASLR enabled. Some other workloads performed
much worse, "while
others stayed the same or were mysteriously better
". It probably is
greatly dependent on the code flow for the workload, which might make for
an area to research in the future; optimizing the function layout for the
workload has
been shown [PDF] to have a positive effect on performance.
Adding the extra information to the vmlinux ELF file to support finer-grained KASLR increases its size, but there is a much bigger effect from the need to increase the boot heap size. Randomizing the addresses of the sections requires a much bigger heap, 64MB, than current boot heaps (64KB for all compressors except bzip2, which needs 4MB). The problem is that a larger boot heap ends up increasing the size of the kernel image by adding a zero-filled section to accommodate the heap.
One of Cook's patches, which was included in Accardi's patch set, seeks to remedy that problem, but it turned out that the underlying problem was a bug in how the sections in the kernel object are laid out. Arvind Sankar pointed to his patch set from January that would fix the problem, which Cook thought was a much better solution to the problem.
Lutomirski also suggested that the sort mechanism being used on the symbol names was too expensive; the swap function being used in the sort() call did quite a bit of unneeded work if a bit more memory was available:
Cook said that he thought there were a number of areas where the tradeoff of memory versus speed need to be considered. The amount of memory being used by the proof-of-concept is much greater than he expected (58MB in his tests). One of the problems there is that the version of free() used when decompressing the kernel image does not actually free any memory. But Accardi thought that the boot latency of a second or so was not likely to deter those who are interested in having the protection—boot-time minimalists are not likely to use finer-grained KASLR anyway, she said.
Security and alignment
In the cover letter, Accardi analyzed the security properties of the patch set, noting that information leaks are often considered to require local access to the system, but that CVE-2019-0688 demonstrated a remote address leak for Windows. The patch set assumes that information leaks are plentiful, so it is trying to make it harder for attackers even in the presence of these leaks. Quantifying the added difficulty is dependent on a number of factors:
She suggested that other alignments could be considered down the road and that execute-only memory (XOM), if it lands, would make the finer-grained technique more effective against certain kinds of attacks. Function sections could perhaps simply be byte-aligned and padded with INT3 instructions, so that a wrong guess would trigger a trap. But the required alignment of functions on Intel processors is somewhat more complicated. Cook said that 16-byte function alignment, as it is now in the kernel, is wasting some space (and some entropy in the function start addresses) when using finer-grained KASLR:
Jann Horn pointed out that Intel recommends 16-byte alignment for branch targets; other alignments might result in less efficient calls. Sankar noted that the current alignment is not that detrimental to the entropy, but Lutomirski said there is another thing to consider:
He also said that there is a bug with some Intel processors that cannot handle certain kinds of jump instructions that span a cache-line boundary. Peter Zijlstra looked at the erratum document [PDF] and thought it implied a need for 32-byte alignment for functions. Handling that may actually require a change to the kernel overall, Cook thought.
The reaction to the idea of finer-grained KASLR was generally positive. No objections to the goals or the techniques used (at a high level) were heard, anyway. It seems like a nice incremental improvement to KASLR. It can also coexist with various control-flow integrity (CFI) measures that are working their way upstream. As Accardi noted, the idea is not new and there has been quite a bit of research into it. OpenBSD uses a similar technique to randomize its kernel at boot time, for example. There is more work to do, of course, but it would not be a surprise to see finer-grained KASLR in the mainline sometime this year.
Revisiting stable-kernel regressions
Stable-kernel updates are, unsurprisingly, supposed to be stable; that is why the first of the rules for stable-kernel patches requires them to be "obviously correct and tested". Even so, for nearly as long as the kernel community has been producing stable update releases, said community has also been complaining about regressions that make their way into those releases. Back in 2016, LWN did some analysis that showed the presence of regressions in stable releases, though at a rate that many saw as being low enough. Since then, the volume of patches showing up in stable releases has grown considerably, so perhaps the time has come to see what the situation with regressions is with current stable kernels.
As an example of the number of patches going into the stable kernel updates, consider that, as of 4.9.213, 15,648 patches have been added to the original 4.9 release — that is an entire development cycle worth of patches added to a "stable" kernel. Reviewing all of those to see whether each contains a regression is not practical, even for the maintainers of the stable updates. But there is an automated way to get a sense for how many of those stable-update patches bring regressions with them.
The convention in the kernel community is to add a Fixes tag to any patch fixing a bug introduced by another patch; that tag includes the commit ID for the original, buggy patch. Since stable kernel releases are supposed to be limited to fixes, one would expect that almost every patch would carry such a tag. In the real world, about 40-60% of the commits to a stable series carry Fixes tags; the proportion appears to be increasing over time as the discipline of adding those tags improves.
It is a relatively straightforward task (for a computer) to look at the Fixes tag(s) in any patch containing them, extract the commit IDs of the buggy patches, and see if those patches, too, were added in a stable update. If so, it is possible to conclude that the original patch was buggy and caused a regression in need of fixing. There are, naturally, some complications, including the fact that stable-kernel commits have different IDs than those used in the mainline (where all fixes are supposed to appear first); associating fixes with commits requires creating a mapping between the two. Outright reverts of buggy patches tend not to have Fixes tags, so they must be caught separately. And so on. The end result will necessarily contain some noise, but there is a useful signal there as well.
For the curious, this analysis was done with the stablefixes tool, part of the gitdm collection of repository data-mining hacks. It can be cloned from git://git.lwn.net/gitdm.git.
Back in 2016, your editor came up with a regression rate of at least 2% for the longer-term stable kernels that were maintained at that time. The 4.4 series, which had 1,712 commits then, showed a regression rate of at least 2.3%. Since then, the number of commits has grown considerably — to 14,211 in 4.4.213 — as a result of better discipline and the use of automated tools (including a machine-learning system) to select fixes that were not explicitly earmarked for stable backporting. Your editor fixed up his script, ported it to Python 3, and reran the analysis for the currently supported stable kernels; the results look like this.
Series Commits Tags Fixes Reverts 5.4.18 2,423 1,482 61% 74 29 Details 4.19.102 11,758 5,647 48% 588 100 Details 4.14.170 15,527 6,727 43% 985 134 Details 4.9.213 15,647 6,286 40% 951 139 Details 4.4.213 14,210 5,110 36% 834 124 Details
In the above table, Series identifies the stable kernel that was looked at. Commits is the number of commits in that series, while Tags is the number and percentage of those commits with a Fixes tag. The count under Fixes is the number of commits in that series that are explicitly fixing another commit applied to that series. Reverts is the number of those fixes that were outright reverts; a famous person might once have said that reversion is the sincerest form of patch criticism. Hit the "Details" link for a list of the fixes found for each series.
Looking at those numbers would suggest that, for example, 3% of the commits in 5.4.18 are fixing other commits, so the bad commit rate would be a minimum of 3%. The situation is not actually that simple, though, for a few reasons. One of those is that a surprising number of the regression fixes appear in the same stable release as the commits they are fixing. In a case like that, while the first commit can indeed be said to have introduced a regression, no stable release actually contained the regression and no user will have ever run into it. Counting those is not entirely fair. If one subtracts out the same-release fixes, the results look like this:
Series Fixes Same
releaseVisible
regressions5.4.18 74 29 45 4.19.102 588 176 412 4.14.170 985 253 732 4.9.213 951 229 722 4.4.213 834 232 602
Another question to keep in mind is what to do with all those commits without Fixes tags. Many of them are certainly fixes for bugs introduced in other patches, but nobody went to the trouble of figuring out how the bugs happened. If the numbers in the table above are taken as the total count of regressions in a stable series, that implies that none of the commits without Fixes tags are fixing regressions, which will surely lead to undercounting regression fixes overall. On the other hand, if one assumes that the untagged commits contain regression fixes in the same proportion as the tagged ones, the result could well be a count that is too high.
Perhaps the best thing that can be done is to look at both numbers, with a reasonable certainty that the truth lies somewhere between them:
Series Visible
regressionsRegression rate Low High 5.4.18 45 1.9% 3.0% 4.19.102 412 3.5% 7.3% 4.14.170 732 4.7% 10.9% 4.9.213 722 4.6% 11.5% 4.4.213 602 4.2% 11.8%
So that is about as good as the numbers are going to get, though there are still some oddball issues. Consider the case of mainline commit 4abb951b73ff ("ACPICA: AML interpreter: add region addresses in global list during initialization"). This commit included a "Cc: stable@vger.kernel.org" tag, so it was duly included (as commit 22083c028d0b) in the 4.19.2 release. It was then reverted in 4.19.3, with the complaint that it didn't actually fix a bug but did cause regressions. This same change returned in 4.19.6 after an explicit request. Then, two commits followed in 4.19.35: commit d4b4aeea5506 addressed a related issue and the original upstream commit in a Fixes tag, while f8053df634d4 claimed to be the original upstream commit, which had already been applied. That last one looks like a fix for a partially done backport. How does one try to account for a series of changes like that? Honestly, one doesn't even try.
So what can we conclude from all this repository digging? The regression rates seen in 2016 were quite a bit lower than what we are seeing now; that would suggest that the increasing volume of patches being applied to the stable trees is not just increasing the number of regressions, but also the rate of regressions. That is not a good sign. On the other hand, the amount of grumbling about stable regressions seems to have dropped recently. Perhaps that's just because people have gotten used to the situation. Or perhaps the worst problems, such as filesystem-destroying regressions, are no longer getting through, while the problems that do slip through are relatively minor.
Newer kernels have a visibly lower regression rate than the older ones. There are two equally plausible explanations for that. Perhaps the process of selecting patches for stable backporting is getting better, and fewer regressions are being introduced than were before. Or perhaps those kernels just haven't been around for long enough for all of the regressions already introduced to be found and fixed yet. The 2016 article looked at 4.4.14, which had 39 regression fixes (19 fixed in the same release). 4.4.213 now contains 110 fixes for regressions introduced in 4.4.14 or earlier (still 19 fixed in the same release). So there is ample reason to believe that the regression rate in 5.4.18 is higher than indicated above.
In any case, it seems clear that the push to get more and more fixes into the stable trees is unlikely to go away anytime soon. And perhaps that is a good thing; a stable tree with thousands of fixes and a few regressions may still be far more stable than one without all those patches. Even so, it would be good to keep an eye on the regression rate; if that is allowed to get too high, the result is likely to be users moving away from stable updates, which is definitely not the desired result.
Keeping secrets in memfd areas
Back in November 2019, Mike Rapoport made the case that there is too much address-space sharing in Linux systems. This sharing can be convenient and good for performance, but in an era of advanced attacks and hardware vulnerabilities it also facilitates security problems. At that time, he proposed a number of possible changes in general terms; he has now come back with a patch implementing a couple of address-space isolation options for the memfd mechanism. This work demonstrates the sort of features we may be seeing, but some of the hard work has been left for the future.Sharing of address spaces comes about in a number of ways. Linux has traditionally mapped the kernel's address space into every user-space process; doing so improves performance in a number of ways. This sharing was thought to be secure for years, since the mapping doesn't allow user space to actually access that memory. The Meltdown and Spectre hardware bugs, though, rendered this sharing insecure; thus kernel page-table isolation was merged to break that sharing.
Another form of sharing takes place in the processor's memory caches; once again, hardware vulnerabilities can expose data cached in this shared area. Then there is the matter of the kernel's direct map: a large mapping (in kernel space) that contains all of physical memory. This mapping makes life easy for the kernel, but it also means that all user-space memory is shared with the kernel. In other words, an attacker with even a limited ability to run code in the kernel context may have easy access to all memory in the system. Once again, in an era of speculative-execution bugs, that is not necessarily a good thing.
The memfd subsystem wasn't designed for address-space isolation; indeed, its initial purpose was as a sort of interprocess communication mechanism. It does, however, provide a way to create a memory region attached to a file descriptor with specific characteristics; a memfd can be "sealed", for example, so that a recipient knows that it will not be changed. Rapoport decided that it would be a good foundation on which to build a "secret memory" feature.
Actually creating an isolated memory area requires passing a new flag to memfd_create() called MFD_SECRET. That, however, doesn't describe how this secrecy should be implemented. There are a number of options that offer varying levels of security and performance degradation, so the user has to make a decision. The available options, as implemented in the patch, could easily have been specified directly to memfd_create() with their own flags, but Rapoport decided to require the use of a separate ioctl() call instead. Until the secrecy mode has been specified with this call, the user cannot map the memfd, and thus cannot actually make use of it.
There are two modes implemented so far; the first of them, MFD_SECRET_EXCLUSIVE, does a number of things to hide the memory attached to the memfd from prying eyes. That memory is marked as being unevictable, for example, so it will never be flushed out to swap. The effect is similar to calling mlock(), but with a couple of differences: pages are not actually allocated until they are faulted in, and the limit on the number of locked pages appears to be (perhaps by mistake) implemented separately from the limits imposed by mlock(). There is also no way to unlock pages except by destroying the memfd, which requires unmapping it and closing its file descriptor.
The other thing done by MFD_SECRET_EXCLUSIVE is to remove the pages used by the memfd from the kernel's direct map, making it inaccessible from kernel space. The problem with this is that the direct map is normally set up using huge pages, which makes accessing it far more efficient. Removing individual (small) pages forces huge pages to be broken apart into lots of small pages, slowing the system for everybody. The current code (admittedly a proof of concept) allocates each page independently when it is faulted in, which seems likely to maximize the damage done to the direct mapping. That will need to change before this feature could be seriously considered for merging.
The other mode, MFD_SECRET_UNCACHED does everything MFD_SECRET_EXCLUSIVE does, but also causes the memory to be mapped with caching disabled. That will prevent its contents from ever living in the processor's memory caches, rendering it inaccessible to exploits that use any of a number of hardware vulnerabilities. It also makes access to that memory far slower in general, to the point that it may seem inaccessible to the intended user as well. For small amounts of infrequently accessed data (cryptographic keys, for example) it may be a useful option, though.
In its current form, the feature only allows one mode to be selected. In
truth, though, MFD_SECRET_UNCACHED is a strict superset of
MFD_SECRET_EXCLUSIVE, so that is not currently a problem.
Rapoport suggests that this whole API could change in the future, with an
alternative being "something like 'secrecy level' from 'a bit more
secret than normally' to 'do your best even at the expense of
performance'
".
Part of the purpose behind this posting was to get comments on the proposed API, but those have not been forthcoming so far. This may be one of those projects that has to advance further — and get closer to being merge-ready — before developers will take notice. But at least the work itself is not a secret anymore, so interested users can start to think about whether it meets their needs or not.
Filesystem UID mapping for user namespaces: yet another shiftfs
The idea of an ID-shifting virtual filesystem that would remap user and group IDs before passing requests through to an underlying real filesystem has been around for a few years but has never made it into the mainline. Implementations have taken the form of shiftfs and shifting bind mounts. Now there is yet another approach to the problem under consideration; this one involves a theoretically simpler approach that makes almost no changes to the kernel's filesystem layer at all.ID-shifting filesystems are meant to be used with user namespaces, which have a number of interesting characteristics; one of those is that there is a mapping between user IDs within the namespace and those outside of it. Normally this mapping is set up so that processes can run as root within the namespace without giving them root access on the system as a whole. A user namespace could be configured so that ID zero inside maps to ID 10000 outside, for example; ranges of IDs can be set up in this way, so that ID 20 inside would be 10020 outside. User namespaces thus perform a type of ID shifting now.
In systems where user namespaces are in use, it is common to set them up to use non-overlapping ranges of IDs as a way of providing isolation between containers. But often complete isolation is not desired. James Bottomley's motivation for creating shiftfs was to allow processes within a user namespace to have root access to a specific filesystem. With the current patch set, instead, author Christian Brauner describes a use case where multiple containers have access to a shared filesystem and need to be able to access that filesystem with the same user and group IDs. Either way, the point is to be able to set up a mapping for user and group IDs that differs from the mapping established in the namespace itself.
Shiftfs was a virtual filesystem that would pass operations through to an underlying filesystem while remapping (by applying a constant offset) the user and group IDs involved. The later bind-mount implementation did away with the separate filesystem and made the shifting a property of the mount itself. Brauner's approach, apparently sketched out at the 2019 Linux Plumbers Conference, is different; it makes the shifting a property of the user namespace itself.
Processes in Linux, as in any Unix-like system, have associated user and group IDs. It is tempting to think that these IDs control access to files, but that is not quite true; instead, Linux maintains a separate user and group ID for filesystem access. These IDs can be changed (by an appropriately privileged process) using the setfsuid() and setfsgid() system calls. This feature is rarely used, so the filesystem user and group IDs are normally the same as the regular IDs, but the mechanism to separate the two sets of IDs has been there since nearly the beginning.
The implementation of user namespaces necessarily understands these filesystem IDs (FSIDs), but that understanding has never been exposed outside the kernel. Brauner's patch set works by making the FSIDs visible and explicit, allowing them to be mapped independently of the normal IDs. In particular, it creates two new files (fsuid_map and fsgid_map) under the /proc directory for each process running inside a user namespace. These behave like the existing uid_map and gid_map files, in that they accept one or more ranges of IDs to remap, but they affect the FSIDs instead.
So, for example, a system administrator can, on current systems, map 100 user IDs starting at zero inside the container to the range 10,000-10,100 outside by writing this line to uid_map:
0 10000 100
By default, this mapping will also affect that namespace's FSIDs. But if the FSIDs should be mapped differently, say to a range starting at 20,000, then the administrator could write this to fsuid_map:
0 20000 100
This mechanism is conceptually simpler than the ideas that came before, though it still requires a 24-part patch series to implement. It keeps all of the ID mapping in the same place and doesn't require special filesystem or mount types. So there is definitely something to like here.
There is, though, a significant limitation in this implementation: the FSID mappings are global, and affect all of a container's filesystem activity, regardless of which filesystem is being accessed. The shiftfs or bind-mount approaches, instead, can be set up on a per-filesystem basis. Whether this loss of flexibility matters will depend on the specific use case in question; it seems likely that some users will want the ability to configure access to different filesystems differently. Adding that ability by way of the FSID mechanism may well be a complex task.
Thus far, though, no potential users have spoken up to request this capability. This patch set is young, with the second revision having only just been posted, so it's possible that many users with an interest in this area have not yet encountered it. The third time might be the charm for this sort of ID-shifting capability, but to assume that to be the case would be premature.
Page editor: Jonathan Corbet
Next page:
Brief items>>
