|
|
Log in / Subscribe / Register

Leading items

Welcome to the LWN.net Weekly Edition for February 20, 2020

This edition contains the following feature content:

This week's edition also includes these inner pages:

  • Brief items: Brief news items from throughout the community.
  • Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Debian discusses how to handle 2038

By Jake Edge
February 19, 2020

At this point, most of the kernel work to avoid the year-2038 apocalypse has been completed. Said apocalypse could occur when time counted in seconds since 1970 overflows a 32-bit signed value (i.e. time_t). Work in the GNU C Library (glibc) and other C libraries is well underway as well. But the "fun" is just beginning for distributions, especially those that support 32-bit architectures, as a recent Debian discussion reveals. One of the questions is: how much effort should be made to support 32-bit architectures as they fade from use and 2038 draws nearer?

Steve McIntyre started the conversation with a post to the debian-devel mailing list. In it, he noted that Arnd Bergmann, who was copied on the email, had been doing a lot of the work on the kernel side of the problem, but that it is mostly a solved problem for the kernel at this point. McIntyre and Bergmann (not to mention Debian as a whole) are now interested in what is needed to update a complete Linux system, such as Debian, to work with a 64-bit time_t.

McIntyre said that glibc has been working on an approach that splits the problem up based on the architecture targeted. Those that already have a 64-bit time_t will simply have a glibc that works with that ABI. Others that are transitioning from a 32-bit time_t to the new ABI will continue to use the 32-bit version by default in glibc. Applications on the latter architectures can request the 64-bit time_t support from glibc, but then they (and any other libraries they use) will only get the 64-bit versions of the ABI.

One thing that glibc will not be doing is bumping its SONAME (major version, essentially); doing so would make it easier to distinguish versions with and without the 64-bit support for 32-bit architectures. The glibc developers do not consider the change to be an ABI break, because applications have to opt into the change. It would be difficult and messy for Debian to change the SONAME for glibc on its own.

Moving forward with the 32-bit time values will, obviously, run into year-2038 problems, so it makes sense for a system like Debian to request 64-bit time support—except that there is a lot of code out there that will not work at all with the newer ABI. Some of it is in binary form, proprietary applications of various sorts (e.g. games), but there are plenty of problems for the open-source code as well. Bergmann scanned the libraries in Debian and "identified that about one third of our library packages would need rebuilding (and tracking) to make a (recursive) transition", McIntyre said. He outlined two ways forward that they had come up with.

The first is to rename libraries that need to be fixed to support 64-bit time values so that there could be two versions of them that could both be installed on a single system. That entails fixing a bunch of packages and rebuilding lots of code. "This effort will be *needed* only for the sake of our 32-bit ports, but would affect *everybody*."

The second is to decide which of the 32-bit architectures Debian supports will actually be viable in 2038 and to create new versions of those architectures with different names. There would be two versions of those architecture ports active for, probably, one release and users would not be able to simply upgrade from one to the other. But it would reduce the impact:

This would allow most of our developers to ignore the problem here (as 64-bit arches are not affected) and let a smaller number of people re-bootstrap with new ABIs with 64-bit time_t embedded.

He and Bergmann think the second option is the right way to go, but McIntyre was soliciting input from others in the project. Ansgar Burchardt replied that the i386 port, at least, probably should not even take the second path. It should simply continue using 32-bit time values.

So maybe just recommend people to move to 64-bit architectures and put 32-bit applications in a time namespace so they believe they are still in 2001 ;-) 32-bit architectures will probably still be useful in embedded contexts for a long time and there it might be easier to just change the ABI, but for a general-purpose distribution we start seeing more and more problems and I don't really see us supporting them as a full architecture in 10+ years.

If that is the chosen direction for i386, Russ Allbery suggested that the "cross-grading" feature be fully supported. Cross-grading would allow a Debian system to be upgraded to a new architecture, but is currently not for the faint of heart:

I'm sure I'm not the only one who is stuck with continuously-upgraded i386 hosts who has been wanting to switch but has been waiting until cross-grading is a little bit less scary.

But Burchardt is not convinced there will be enough i386 Debian systems to matter in even ten years. He showed some numbers from Debian's popularity contest (popcon) on i386 versus amd64 that suggest i386 will only make up around 0.2% of the total in ten years. "For just avoiding the Y2038 problem, i386 might sort itself out already without additional intervention." McIntyre thought that cross-grading support might make an interesting project for an intern, however.

There is work going on in glibc to provide both 32- and 64-bit interfaces simultaneously, Guillem Jover said, which could be used to provide a smoother transition without needing a SONAME change. Bergmann said that the glibc work is proceeding, but he was not sure that it would actually help with the transition that much; the problem is that on the scale of a whole distribution it "adds so much extra work and time before completion that there won't be many people left to use it by the time the work is done.;-)"

Jover pointed to the large-file support (LFS) transition as something of a model, though he cautioned that "the LFS 'transition' is not one we should be very proud of, as it's been dragging on for a very long time, and it's not even close to be finished". LFS allows applications to handle files with sizes larger than a 32-bit quantity can hold, which parallels the time_t situation. Jover suggested that fully enabling LFS as part of the time_t transition might make sense. Bergmann said that LFS support is "a done deal", as both glibc and musl libc require using 64-bit file sizes (off_t) when 64-bit time values are used.

Lennart Sorensen also saw parallels to the LFS transition, but Ben Hutchings thought otherwise:

LFS is a great example of how *not* to do it. 23 years on, we still have open bugs for programs that should opt in but didn't. Not every program needs to handle > 2 GiB files, but there are now filesystems with 64-bit inode numbers and they break every non-LFS program that calls stat().

Similarly, every program that uses wall-clock time will fail as we approach 2038, and the failure mode is likely to be even worse than with LFS as few programs will check for errors from time APIs.

YunQiang Su suggested a third option to consider, which would define 64-bit versions of all of the affected interfaces throughout the distribution, by adding a "64" for the new function and data structure names, then modify packages to use those interfaces over time. A deadline could be set of, say, 2030; after that, anything using the older interfaces would be considered to have a release-critical bug. McIntyre said that it would be technically possible to do so, but there are some major downsides:

The problem here is that we have many thousands of packages to work on, glibc up through other libs to all the applications that use them. It's invasive work, and likely to take a very long time. Since it's work that would *not* be needed for 64-bit architectures, we're also likely to see push-back from some upstreams.

2030 is also too late to fix the problem - people are already starting to see issues in some applications. We want to make a difference soon: the earlier that we have fixes available, the fewer broken systems will have to be fixed / removed / replaced later.

Most who commented in the thread seem to see i386 as a dying architecture, at least for Debian on the time scale being considered here (18 years). Florian Weimer summed it up this way:

My opinion (professional in this case, even) is that i386 users want compatibility with their binaries from 1998. Otherwise they would have rebuilt them for x86-64 by now. Under this worldview, i386 is for backwards compatibility with existing software. Users will want to run these old programs in a time namespace with shifted time, too.

Bergmann generally agreed with that assessment, but thought it made sense to see how things go for a different 32-bit architecture that probably needs to have a user space that supports 64-bit time_t: armhf. Once that work is done, it may be straightforward to apply it to i386 if that is deemed useful—or a decision could be made to phase out i386 sometime before 2038.

But Marco d'Itri wondered why the picture for armhf was different than that of i386. Bergmann pointed out that armhf is being used for a lot of embedded systems and that is likely to continue. McIntyre also noted that the Civil Infrastructure Platform is based on Debian and is often used with armhf. Bergmann gave a summary of some research he did on the use of the various Arm architecture versions, some of which he expects to still be in use, in 32-bit form, in 2038 and beyond:

In some deeply embedded systems, you'd be looking at installing a current version of Debian (because why not) and then running it for decades beyond the end of support without updates. While this is often no problem in the absence of attack vectors, the time32 problem means that a piece of industrial equipment may be created for a 40 year lifetime today and work flawlessly for the first 18 years before suddenly breaking. The sooner time64 gets supported in Debian, the more of them have a chance of surviving.

The consensus seems to be that the second option, a new architecture name that indicates 64-bit time support, is the right way to go and that armhf makes the most sense as a starting point. In his reply to Jover, Bergmann summed it up this way:

So far, armhf is the only Debian target architecture that I know needs this, so it seems best to focus the work on that one. Once the porting work is done and enough bugs are fixed, the other architectures can decide if they still care. If any new 32-bit architectures (rv32, arc?) get added, it would probably be sensible to start out with a time64 port.

McIntyre is hoping to get started on work soon, so that an armhf port for 64-bit time might perhaps be released with Debian 11 ("bullseye"), which is presumably coming in mid-2021. Bergmann also noted that Adélie Linux has been working on porting its user space for 64-bit times using musl libc; it has a list of open issues that were found:

Most of these are for packages that use low-level system calls directly rather than going through glibc, either for syscalls that don't have an abstraction (seccomp, futex, ...) or for implementing a runtime environment for a language other than C.

Even though the 32-bit architectures were largely the focus of the discussion, there is still quite a bit of work that needs to be done for 64-bit systems as well. While the C libraries will soon fully support 64-bit time_t values, they will also require them for the time interfaces; there is plenty of code assuming 32-bit time values buried in user space, but those will presumably be found and fixed over the next few years. We may still see a glitch or three in January 2038 (or before), but it seems like the Linux world, at least, will be pretty well prepared long before the appointed hour.

Comments (29 posted)

Finer-grained kernel address-space layout randomization

By Jake Edge
February 19, 2020

The idea behind kernel address-space layout randomization (KASLR) is to make it harder for attackers to find code and data of interest to use in their attacks by loading the kernel at a random location. But a single random offset is used for the placement of the kernel text, which presents a weakness: if the offset can be determined for anything within the kernel, the addresses of other parts of the kernel are readily calculable. A new "finer-grained" KASLR patch set seeks to remedy that weakness for the text section of the kernel by randomly reordering the functions within the kernel code at boot time.

Kristen Carlson Accardi posted an RFC patch set that implemented a proof-of-concept for finer-grained KASLR in early February. She identified three weaknesses of the existing KASLR:

  • low entropy in the randomness that can be applied to the kernel as a whole
  • the leak of a single address can reveal the random offset applied to the kernel, thus revealing the rest of the addresses
  • the kinds of information leaks needed to reveal the offset abound
So, the "tl;dr" is: "This patch set rearranges your kernel code at load time on a per-function level granularity, with only around a second added to boot time."

The changes required are in two main areas. When the kernel is built, a GCC option is used to place each function in its own .text section. The relocation addresses can be used to allow shuffling the text sections as the kernel is loaded, just after it is decompressed. There are, she noted, tables of addresses in the kernel for things like exception handling and kernel probes (kprobes), but those can be handled too:

Most of these tables generate relocations and require a simple update, and some tables that have assumptions about order require sorting after the update. In order to modify these tables, we preserve a few key symbols from the objcopy symbol stripping process for use after shuffling the text segments.

The second area of changes is in the loading of the kernel into memory; the boot process was changed to parse the vmlinux ELF file to retrieve the key symbols and collect up a list of .text.* sections to be reordered. The function order is then randomized and any tables are updated as needed:

The existing code which updated relocation addresses was modified to account for not just a fixed delta from the load address, but the offset that the function section was moved to. This requires inspection of each address to see if it was impacted by a randomization. We use a bsearch to make this less horrible on [performance].

For debugging the proof-of-concept, a pseudo-random-number generator (PRNG) was used so that the same order could be generated by giving it the identical seed. The patch adding the PRNG, which was authored by Kees Cook, might provide some performance benefits, but Andy Lutomirski objected to using a new, unproven algorithm; he suggested using a deterministic random bit generator (DRBG), such as ChaCha20. Similarly, Jason A. Donenfeld was concerned that the random-number sequence could be predicted from just a few leaked address values, which might defeat the purpose of the feature. Cook said that using ChaCha20 instead was a better idea moving forward.

The patch set removes access to the /proc/kallsyms file, which lists addresses of kernel symbols, for non-root users. Currently kallsyms simply gives addresses of all zeroes when non-root users read it, but the list of symbols is given in the order they appear in the kernel text; that would give away the randomized layout of the kernel, so access was disabled. Cook pointed out that making the kallsyms file unreadable has, in the past, "seemed to break weird stuff in userspace". He suggested either sorting the symbol names alphabetically in the output—or perhaps just waiting to see if there were any complaints.

Impacts

Accardi measured the impact on boot time in a VM and found that it took roughly one second longer to boot, which is fairly negligible for many use cases. The run-time performance is harder to characterize; the all-important kernel build benchmark was about 1% slower than building on the same kernel with just KASLR enabled. Some other workloads performed much worse, "while others stayed the same or were mysteriously better". It probably is greatly dependent on the code flow for the workload, which might make for an area to research in the future; optimizing the function layout for the workload has been shown [PDF] to have a positive effect on performance.

Adding the extra information to the vmlinux ELF file to support finer-grained KASLR increases its size, but there is a much bigger effect from the need to increase the boot heap size. Randomizing the addresses of the sections requires a much bigger heap, 64MB, than current boot heaps (64KB for all compressors except bzip2, which needs 4MB). The problem is that a larger boot heap ends up increasing the size of the kernel image by adding a zero-filled section to accommodate the heap.

One of Cook's patches, which was included in Accardi's patch set, seeks to remedy that problem, but it turned out that the underlying problem was a bug in how the sections in the kernel object are laid out. Arvind Sankar pointed to his patch set from January that would fix the problem, which Cook thought was a much better solution to the problem.

Lutomirski also suggested that the sort mechanism being used on the symbol names was too expensive; the swap function being used in the sort() call did quite a bit of unneeded work if a bit more memory was available:

Unless you are severely memory-constrained, never do a sort with an expensive swap function like this. Instead allocate an array of indices that starts out as [0, 1, 2, ...]. Sort *that* where the swap function just swaps the indices. Then use the sorted list of indices to permute the actual data. The result is exactly one expensive swap per item instead of one expensive swap per swap.

Cook said that he thought there were a number of areas where the tradeoff of memory versus speed need to be considered. The amount of memory being used by the proof-of-concept is much greater than he expected (58MB in his tests). One of the problems there is that the version of free() used when decompressing the kernel image does not actually free any memory. But Accardi thought that the boot latency of a second or so was not likely to deter those who are interested in having the protection—boot-time minimalists are not likely to use finer-grained KASLR anyway, she said.

Security and alignment

In the cover letter, Accardi analyzed the security properties of the patch set, noting that information leaks are often considered to require local access to the system, but that CVE-2019-0688 demonstrated a remote address leak for Windows. The patch set assumes that information leaks are plentiful, so it is trying to make it harder for attackers even in the presence of these leaks. Quantifying the added difficulty is dependent on a number of factors:

Firstly and most obviously, the number of functions you randomize matters. This implementation keeps the existing .text section for code that cannot be randomized - for example, because it was assembly code, or we opted out of randomization for performance reasons. The less sections to randomize, the less entropy. In addition, due to alignment (16 bytes for x86_64), the number of bits in a address that the attacker needs to guess is reduced, as the lower bits are identical.

She suggested that other alignments could be considered down the road and that execute-only memory (XOM), if it lands, would make the finer-grained technique more effective against certain kinds of attacks. Function sections could perhaps simply be byte-aligned and padded with INT3 instructions, so that a wrong guess would trigger a trap. But the required alignment of functions on Intel processors is somewhat more complicated. Cook said that 16-byte function alignment, as it is now in the kernel, is wasting some space (and some entropy in the function start addresses) when using finer-grained KASLR:

I know x86_64 stack alignment is 16 bytes. I cannot find evidence for what function start alignment should be. It seems the linker is 16 byte aligning these functions, when I think no alignment is needed for function starts, so we're wasting some memory (average 8 bytes per function, at say 50,000 functions, so approaching 512KB) between functions. If we can specify a 1 byte alignment for these orphan sections, that would be nice, as mentioned in the cover letter: we lose a 4 bits of entropy to this alignment, since all randomized function addresses will have their low bits set to zero.

Jann Horn pointed out that Intel recommends 16-byte alignment for branch targets; other alignments might result in less efficient calls. Sankar noted that the current alignment is not that detrimental to the entropy, but Lutomirski said there is another thing to consider:

There is a security consideration here that has nothing to do with entropy per se. If an attacker locates two functions, they learn the distance between them. This constrains what can fit in the gap. Padding reduces the strength of this type of attack, as would some degree of random padding.

He also said that there is a bug with some Intel processors that cannot handle certain kinds of jump instructions that span a cache-line boundary. Peter Zijlstra looked at the erratum document [PDF] and thought it implied a need for 32-byte alignment for functions. Handling that may actually require a change to the kernel overall, Cook thought.

The reaction to the idea of finer-grained KASLR was generally positive. No objections to the goals or the techniques used (at a high level) were heard, anyway. It seems like a nice incremental improvement to KASLR. It can also coexist with various control-flow integrity (CFI) measures that are working their way upstream. As Accardi noted, the idea is not new and there has been quite a bit of research into it. OpenBSD uses a similar technique to randomize its kernel at boot time, for example. There is more work to do, of course, but it would not be a surprise to see finer-grained KASLR in the mainline sometime this year.

Comments (10 posted)

Revisiting stable-kernel regressions

By Jonathan Corbet
February 13, 2020
Stable-kernel updates are, unsurprisingly, supposed to be stable; that is why the first of the rules for stable-kernel patches requires them to be "obviously correct and tested". Even so, for nearly as long as the kernel community has been producing stable update releases, said community has also been complaining about regressions that make their way into those releases. Back in 2016, LWN did some analysis that showed the presence of regressions in stable releases, though at a rate that many saw as being low enough. Since then, the volume of patches showing up in stable releases has grown considerably, so perhaps the time has come to see what the situation with regressions is with current stable kernels.

As an example of the number of patches going into the stable kernel updates, consider that, as of 4.9.213, 15,648 patches have been added to the original 4.9 release — that is an entire development cycle worth of patches added to a "stable" kernel. Reviewing all of those to see whether each contains a regression is not practical, even for the maintainers of the stable updates. But there is an automated way to get a sense for how many of those stable-update patches bring regressions with them.

The convention in the kernel community is to add a Fixes tag to any patch fixing a bug introduced by another patch; that tag includes the commit ID for the original, buggy patch. Since stable kernel releases are supposed to be limited to fixes, one would expect that almost every patch would carry such a tag. In the real world, about 40-60% of the commits to a stable series carry Fixes tags; the proportion appears to be increasing over time as the discipline of adding those tags improves.

It is a relatively straightforward task (for a computer) to look at the Fixes tag(s) in any patch containing them, extract the commit IDs of the buggy patches, and see if those patches, too, were added in a stable update. If so, it is possible to conclude that the original patch was buggy and caused a regression in need of fixing. There are, naturally, some complications, including the fact that stable-kernel commits have different IDs than those used in the mainline (where all fixes are supposed to appear first); associating fixes with commits requires creating a mapping between the two. Outright reverts of buggy patches tend not to have Fixes tags, so they must be caught separately. And so on. The end result will necessarily contain some noise, but there is a useful signal there as well.

For the curious, this analysis was done with the stablefixes tool, part of the gitdm collection of repository data-mining hacks. It can be cloned from git://git.lwn.net/gitdm.git.

Back in 2016, your editor came up with a regression rate of at least 2% for the longer-term stable kernels that were maintained at that time. The 4.4 series, which had 1,712 commits then, showed a regression rate of at least 2.3%. Since then, the number of commits has grown considerably — to 14,211 in 4.4.213 — as a result of better discipline and the use of automated tools (including a machine-learning system) to select fixes that were not explicitly earmarked for stable backporting. Your editor fixed up his script, ported it to Python 3, and reran the analysis for the currently supported stable kernels; the results look like this.

SeriesCommitsTagsFixes Reverts
5.4.18 2,423 1,482 61% 74 29 Details
4.19.102 11,758 5,647 48% 588 100 Details
4.14.170 15,527 6,727 43% 985 134 Details
4.9.213 15,647 6,286 40% 951 139 Details
4.4.213 14,210 5,110 36% 834 124 Details

In the above table, Series identifies the stable kernel that was looked at. Commits is the number of commits in that series, while Tags is the number and percentage of those commits with a Fixes tag. The count under Fixes is the number of commits in that series that are explicitly fixing another commit applied to that series. Reverts is the number of those fixes that were outright reverts; a famous person might once have said that reversion is the sincerest form of patch criticism. Hit the "Details" link for a list of the fixes found for each series.

Looking at those numbers would suggest that, for example, 3% of the commits in 5.4.18 are fixing other commits, so the bad commit rate would be a minimum of 3%. The situation is not actually that simple, though, for a few reasons. One of those is that a surprising number of the regression fixes appear in the same stable release as the commits they are fixing. In a case like that, while the first commit can indeed be said to have introduced a regression, no stable release actually contained the regression and no user will have ever run into it. Counting those is not entirely fair. If one subtracts out the same-release fixes, the results look like this:

SeriesFixesSame
release
Visible
regressions
5.4.18 74 29 45
4.19.102 588 176 412
4.14.170 985 253 732
4.9.213 951 229 722
4.4.213 834 232 602

Another question to keep in mind is what to do with all those commits without Fixes tags. Many of them are certainly fixes for bugs introduced in other patches, but nobody went to the trouble of figuring out how the bugs happened. If the numbers in the table above are taken as the total count of regressions in a stable series, that implies that none of the commits without Fixes tags are fixing regressions, which will surely lead to undercounting regression fixes overall. On the other hand, if one assumes that the untagged commits contain regression fixes in the same proportion as the tagged ones, the result could well be a count that is too high.

Perhaps the best thing that can be done is to look at both numbers, with a reasonable certainty that the truth lies somewhere between them:

SeriesVisible
regressions
Regression rate
LowHigh
5.4.18 45 1.9% 3.0%
4.19.102 412 3.5% 7.3%
4.14.170 732 4.7% 10.9%
4.9.213 722 4.6% 11.5%
4.4.213 602 4.2% 11.8%

So that is about as good as the numbers are going to get, though there are still some oddball issues. Consider the case of mainline commit 4abb951b73ff ("ACPICA: AML interpreter: add region addresses in global list during initialization"). This commit included a "Cc: stable@vger.kernel.org" tag, so it was duly included (as commit 22083c028d0b) in the 4.19.2 release. It was then reverted in 4.19.3, with the complaint that it didn't actually fix a bug but did cause regressions. This same change returned in 4.19.6 after an explicit request. Then, two commits followed in 4.19.35: commit d4b4aeea5506 addressed a related issue and the original upstream commit in a Fixes tag, while f8053df634d4 claimed to be the original upstream commit, which had already been applied. That last one looks like a fix for a partially done backport. How does one try to account for a series of changes like that? Honestly, one doesn't even try.

So what can we conclude from all this repository digging? The regression rates seen in 2016 were quite a bit lower than what we are seeing now; that would suggest that the increasing volume of patches being applied to the stable trees is not just increasing the number of regressions, but also the rate of regressions. That is not a good sign. On the other hand, the amount of grumbling about stable regressions seems to have dropped recently. Perhaps that's just because people have gotten used to the situation. Or perhaps the worst problems, such as filesystem-destroying regressions, are no longer getting through, while the problems that do slip through are relatively minor.

Newer kernels have a visibly lower regression rate than the older ones. There are two equally plausible explanations for that. Perhaps the process of selecting patches for stable backporting is getting better, and fewer regressions are being introduced than were before. Or perhaps those kernels just haven't been around for long enough for all of the regressions already introduced to be found and fixed yet. The 2016 article looked at 4.4.14, which had 39 regression fixes (19 fixed in the same release). 4.4.213 now contains 110 fixes for regressions introduced in 4.4.14 or earlier (still 19 fixed in the same release). So there is ample reason to believe that the regression rate in 5.4.18 is higher than indicated above.

In any case, it seems clear that the push to get more and more fixes into the stable trees is unlikely to go away anytime soon. And perhaps that is a good thing; a stable tree with thousands of fixes and a few regressions may still be far more stable than one without all those patches. Even so, it would be good to keep an eye on the regression rate; if that is allowed to get too high, the result is likely to be users moving away from stable updates, which is definitely not the desired result.

Comments (11 posted)

Keeping secrets in memfd areas

By Jonathan Corbet
February 14, 2020
Back in November 2019, Mike Rapoport made the case that there is too much address-space sharing in Linux systems. This sharing can be convenient and good for performance, but in an era of advanced attacks and hardware vulnerabilities it also facilitates security problems. At that time, he proposed a number of possible changes in general terms; he has now come back with a patch implementing a couple of address-space isolation options for the memfd mechanism. This work demonstrates the sort of features we may be seeing, but some of the hard work has been left for the future.

Sharing of address spaces comes about in a number of ways. Linux has traditionally mapped the kernel's address space into every user-space process; doing so improves performance in a number of ways. This sharing was thought to be secure for years, since the mapping doesn't allow user space to actually access that memory. The Meltdown and Spectre hardware bugs, though, rendered this sharing insecure; thus kernel page-table isolation was merged to break that sharing.

Another form of sharing takes place in the processor's memory caches; once again, hardware vulnerabilities can expose data cached in this shared area. Then there is the matter of the kernel's direct map: a large mapping (in kernel space) that contains all of physical memory. This mapping makes life easy for the kernel, but it also means that all user-space memory is shared with the kernel. In other words, an attacker with even a limited ability to run code in the kernel context may have easy access to all memory in the system. Once again, in an era of speculative-execution bugs, that is not necessarily a good thing.

The memfd subsystem wasn't designed for address-space isolation; indeed, its initial purpose was as a sort of interprocess communication mechanism. It does, however, provide a way to create a memory region attached to a file descriptor with specific characteristics; a memfd can be "sealed", for example, so that a recipient knows that it will not be changed. Rapoport decided that it would be a good foundation on which to build a "secret memory" feature.

Actually creating an isolated memory area requires passing a new flag to memfd_create() called MFD_SECRET. That, however, doesn't describe how this secrecy should be implemented. There are a number of options that offer varying levels of security and performance degradation, so the user has to make a decision. The available options, as implemented in the patch, could easily have been specified directly to memfd_create() with their own flags, but Rapoport decided to require the use of a separate ioctl() call instead. Until the secrecy mode has been specified with this call, the user cannot map the memfd, and thus cannot actually make use of it.

There are two modes implemented so far; the first of them, MFD_SECRET_EXCLUSIVE, does a number of things to hide the memory attached to the memfd from prying eyes. That memory is marked as being unevictable, for example, so it will never be flushed out to swap. The effect is similar to calling mlock(), but with a couple of differences: pages are not actually allocated until they are faulted in, and the limit on the number of locked pages appears to be (perhaps by mistake) implemented separately from the limits imposed by mlock(). There is also no way to unlock pages except by destroying the memfd, which requires unmapping it and closing its file descriptor.

The other thing done by MFD_SECRET_EXCLUSIVE is to remove the pages used by the memfd from the kernel's direct map, making it inaccessible from kernel space. The problem with this is that the direct map is normally set up using huge pages, which makes accessing it far more efficient. Removing individual (small) pages forces huge pages to be broken apart into lots of small pages, slowing the system for everybody. The current code (admittedly a proof of concept) allocates each page independently when it is faulted in, which seems likely to maximize the damage done to the direct mapping. That will need to change before this feature could be seriously considered for merging.

The other mode, MFD_SECRET_UNCACHED does everything MFD_SECRET_EXCLUSIVE does, but also causes the memory to be mapped with caching disabled. That will prevent its contents from ever living in the processor's memory caches, rendering it inaccessible to exploits that use any of a number of hardware vulnerabilities. It also makes access to that memory far slower in general, to the point that it may seem inaccessible to the intended user as well. For small amounts of infrequently accessed data (cryptographic keys, for example) it may be a useful option, though.

In its current form, the feature only allows one mode to be selected. In truth, though, MFD_SECRET_UNCACHED is a strict superset of MFD_SECRET_EXCLUSIVE, so that is not currently a problem. Rapoport suggests that this whole API could change in the future, with an alternative being "something like 'secrecy level' from 'a bit more secret than normally' to 'do your best even at the expense of performance'".

Part of the purpose behind this posting was to get comments on the proposed API, but those have not been forthcoming so far. This may be one of those projects that has to advance further — and get closer to being merge-ready — before developers will take notice. But at least the work itself is not a secret anymore, so interested users can start to think about whether it meets their needs or not.

Comments (9 posted)

Filesystem UID mapping for user namespaces: yet another shiftfs

By Jonathan Corbet
February 17, 2020
The idea of an ID-shifting virtual filesystem that would remap user and group IDs before passing requests through to an underlying real filesystem has been around for a few years but has never made it into the mainline. Implementations have taken the form of shiftfs and shifting bind mounts. Now there is yet another approach to the problem under consideration; this one involves a theoretically simpler approach that makes almost no changes to the kernel's filesystem layer at all.

ID-shifting filesystems are meant to be used with user namespaces, which have a number of interesting characteristics; one of those is that there is a mapping between user IDs within the namespace and those outside of it. Normally this mapping is set up so that processes can run as root within the namespace without giving them root access on the system as a whole. A user namespace could be configured so that ID zero inside maps to ID 10000 outside, for example; ranges of IDs can be set up in this way, so that ID 20 inside would be 10020 outside. User namespaces thus perform a type of ID shifting now.

In systems where user namespaces are in use, it is common to set them up to use non-overlapping ranges of IDs as a way of providing isolation between containers. But often complete isolation is not desired. James Bottomley's motivation for creating shiftfs was to allow processes within a user namespace to have root access to a specific filesystem. With the current patch set, instead, author Christian Brauner describes a use case where multiple containers have access to a shared filesystem and need to be able to access that filesystem with the same user and group IDs. Either way, the point is to be able to set up a mapping for user and group IDs that differs from the mapping established in the namespace itself.

Shiftfs was a virtual filesystem that would pass operations through to an underlying filesystem while remapping (by applying a constant offset) the user and group IDs involved. The later bind-mount implementation did away with the separate filesystem and made the shifting a property of the mount itself. Brauner's approach, apparently sketched out at the 2019 Linux Plumbers Conference, is different; it makes the shifting a property of the user namespace itself.

Processes in Linux, as in any Unix-like system, have associated user and group IDs. It is tempting to think that these IDs control access to files, but that is not quite true; instead, Linux maintains a separate user and group ID for filesystem access. These IDs can be changed (by an appropriately privileged process) using the setfsuid() and setfsgid() system calls. This feature is rarely used, so the filesystem user and group IDs are normally the same as the regular IDs, but the mechanism to separate the two sets of IDs has been there since nearly the beginning.

The implementation of user namespaces necessarily understands these filesystem IDs (FSIDs), but that understanding has never been exposed outside the kernel. Brauner's patch set works by making the FSIDs visible and explicit, allowing them to be mapped independently of the normal IDs. In particular, it creates two new files (fsuid_map and fsgid_map) under the /proc directory for each process running inside a user namespace. These behave like the existing uid_map and gid_map files, in that they accept one or more ranges of IDs to remap, but they affect the FSIDs instead.

So, for example, a system administrator can, on current systems, map 100 user IDs starting at zero inside the container to the range 10,000-10,100 outside by writing this line to uid_map:

    0 10000 100

By default, this mapping will also affect that namespace's FSIDs. But if the FSIDs should be mapped differently, say to a range starting at 20,000, then the administrator could write this to fsuid_map:

    0 20000 100

This mechanism is conceptually simpler than the ideas that came before, though it still requires a 24-part patch series to implement. It keeps all of the ID mapping in the same place and doesn't require special filesystem or mount types. So there is definitely something to like here.

There is, though, a significant limitation in this implementation: the FSID mappings are global, and affect all of a container's filesystem activity, regardless of which filesystem is being accessed. The shiftfs or bind-mount approaches, instead, can be set up on a per-filesystem basis. Whether this loss of flexibility matters will depend on the specific use case in question; it seems likely that some users will want the ability to configure access to different filesystems differently. Adding that ability by way of the FSID mechanism may well be a complex task.

Thus far, though, no potential users have spoken up to request this capability. This patch set is young, with the second revision having only just been posted, so it's possible that many users with an interest in this area have not yet encountered it. The third time might be the charm for this sort of ID-shifting capability, but to assume that to be the case would be premature.

Comments (12 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>


Copyright © 2020, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds