|
|
Subscribe / Log in / New account

An update on sealed system mappings

By Daroc Alden
February 4, 2025

Jeff Xu has been working on a patch set that makes certain mappings in a process's address space impossible to change, sealing them against tampering. This has some potential security benefits — mainly, making sure that someone cannot relocate the vsyscall and vDSO mappings — but some kernel developers haven't been impressed with the patches. While the core functionality (sealing the mappings) is sound, some of the supporting code for enabling and disabling the new feature caused concern by going against the normal design for such things. Reviewers also questioned how this feature would interact with checkpointing and with sandboxing.

Unlike the mseal() system call, which can be used to seal any memory mapping, Xu's patch set is focused specifically on sealing mappings that the kernel uses, before any user-space code starts executing. The patch set seals the memory mappings for five things: the vDSO (code to implement some system calls in user space), vvar (data for vDSO calls), sigpage (code for implementing signal handling on Arm), uprobes (user-space tracing), and vsyscall (an older and obsolete system-call mechanism). Each of these facilities involves having the kernel map some additional pages into a user-space process; all of them except the uprobe pages are created on process startup, and should by and large remain unmodified until the process dies. Uprobes are inserted dynamically, and the kernel maps the pages at that time, but that mapping also lives until the process is terminated.

Each of these mappings is relied on to perform functions that would otherwise be performed by the kernel. Therefore, it's important for security that user-space code doesn't change these pages. If a process had some vulnerability that did allow changing the pages (or the mappings), it could potentially spoof the results of certain system calls, which would be highly exploitable. While not the most common kind of exploit, it is a known kind of attack. Removing the possibility entirely would be another component of defense-in-depth against malware.

These mappings are generally set to be read-only or execute-only, depending on what the architecture supports, but Xu's patch would go a step further by sealing the mappings themselves, preventing them from being made writable or unmapped. This change should provide additional security without impacting the behavior of well-behaved user-space programs, because they were generally not supposed to be touching any of these mappings in the first place.

Checkpointing

There is one major exception, however: Checkpoint/Restore in Userspace (CRIU), which can freeze an application or container and "checkpoint" its state to disk and then restore it. It needs to move these mappings during the process of restoring a stored application, to return them to where they were located in the checkpointed process. The fourth (and most recent, as of this writing) version of Xu's patch set only works for kernels that are built without checkpoint support, and only on 64-bit x86 and Arm architectures. That limitation was not appreciated by some reviewers. Andrei Vagin, one of the CRIU developers, said:

I like the idea of this patchset, but I don't like the idea of forcing users to choose between this security feature and checkpoint/restore functionality. We need to explore ways to make this feature work with checkpoint/restore. Relying on CAP_CHECKPOINT_RESTORE is the obvious approach.

CAP_CHECKPOINT_RESTORE capability controls a process's ability to perform some of the privileged operations that CRIU uses to implement checkpointing and restoration, including reading memory map files in /proc for other processes. Since only trusted programs are supposed to be given the capability, allowing such processes to selectively disable the sealing of system mappings should not expose the system to much more risk.

Vagin also pointed out that CRIU only needs to move mappings during the restore process, not change their properties or contents. Xu thought that was an important detail to consider. He agreed "that forcing users to choose isn't ideal". He questioned whether it made sense to have a prctl() option to control whether mappings are sealed or not on systems with both features enabled, but still thought that the code should unconditionally seal the mappings on systems which did not have checkpointing support enabled.

Xu's patch set seals the created mappings while the kernel is processing a call to execve(); the proposed prctl() option would not be able to disable the sealing for an existing process, but could be used to turn it off for newly created processes. Vagin agreed that such an option would work for CRIU, which controls the way that restored processes are created. He did mention that other programs, such as gVisor, might not be able to make use of the option, however.

Command-line parameters

The larger source of complaints, however, was how Xu's patch set handled the kernel command-line parameter that it adds (exec.seal_system_mappings). The patch set has both a kernel configuration setting and a kernel command-line parameter. This isn't unusual — many features have a configuration setting to build the feature into the kernel, and then a command-line parameter to enable or disable it at run time. Xu's patch set was set up the opposite way around: the feature was always built into the kernel, the configuration setting controlled whether it defaulted to on or off, and the setting could be overridden by the command line. Lorenzo Stoakes objected that the kernel command-line flag just seems like "a debug flag to experiment on arbitrary kernels".

Xu indicated that the design was deliberate, to allow users to enable the security feature even if a distribution chose not to enable it in its kernel builds. Stoakes was not convinced. Xu reiterated that users should know what they're doing, and if a user wants to enable the feature on a kernel that also supports CRIU, and this breaks things, then that's their prerogative. When Xu proved not to be receptive to changes in this area, Stoakes blocked the patch set from being merged, saying: "This series is unacceptable in its current form as it allows untested architectures and known broken configurations to enable this feature."

Even once Stoakes had blocked the most recent version of the patch set, discussion continued, however. Kees Cook summarized the discussion up to that point, saying that if Xu could clarify some details about which mappings and platforms were included, clean up the command-line parameters, and add a prctl() option, he would have everything needed for the next version of the patch set. Liam Howlett thought that there was more research needed, however:

So we have at least two userspace uses that this will break: checkpoint restore and now gVisor, but who knows what else? How many config options before we decide this can't be just on by default?

[...]

And we really don't understand what it will impact fully - considering v4 is still resulting in new things that will be broken.

Stoakes expressed the same sentiment, saying that they should "tread carefully" with this feature, and calling for extensive testing and additional justification of the security benefits. Cook thought that, while additional testing was wise, it was unreasonable to ask for testing of every combination of architecture and relevant flags. He advocated for starting with the minimum number of supported platforms, and narrowly testing those.

Xu raised the possibility of removing the kernel command-line argument altogether, and relying only on the kernel configuration setting to enable or disable sealing. This would be sufficient for Android and ChromeOS, and sidestep some of the problems reviewers were pointing out. He later suggested that, since CRIU is not the only thing that breaks when his changes are enabled, the build-time conflict between his changes and CRIU provides a false sense of safety and could be removed. Stoakes strongly objected to that idea, saying that any patch must not make it possible for a user to accidentally break their kernel. He does want to see this patch set merged, but only once it doesn't include a boot parameter that will silently break things if used incorrectly. Matthew Wilcox was harsher, implying that Xu was trying to get the bare minimum needed by Google upstream.

Pedro Falcato floated the idea of leaving it up to user space to seal these mappings, pointing out that the GNU C library (glibc) is already working toward sealing some parts of the address space (an effort that might soon be finished). Sealing the vDSO from glibc is more complicated than it might seem, Xu explained. The dynamic linker doesn't know how large the vDSO is, especially since it varies between architectures and kernel versions. Also, uprobe mappings (which are created later) couldn't be sealed by the dynamic linker.

The fifth version of this patch set is now in the works. By removing the command-line parameter that reviewers objected to, and fixing some of the other technical problems with the patches, hopefully Xu will be able to get the next version accepted. That will take time, however, and require another round of review that may provoke an equally detailed set of comments. Still, the kernel is one step closer to locking down another potential security problem before it becomes relevant — without handing users another way to shoot themselves in the foot.



to post comments

rr needs to remap VDSO etc

Posted Feb 5, 2025 4:33 UTC (Wed) by roc (subscriber, #30627) [Link] (3 responses)

rr is another application that needs to change the mapping of VDSO etc. A prctl or similar option to turn off sealing for future execve()s will be essential for us.

rr needs to remap VDSO etc

Posted Feb 5, 2025 18:30 UTC (Wed) by ljsloz (subscriber, #158382) [Link]

Thanks, noted, I am very concerned about a variety of applications being inadvertently impacted and this only deepens my concern. It feels like at best this should be opt-in at this point.

(- Stoakes of the article)

rr needs to remap VDSO etc

Posted Feb 13, 2025 22:49 UTC (Thu) by jeffxu (guest, #162831) [Link] (1 responses)

could you please share a link to rr ?

rr needs to remap VDSO etc

Posted Feb 14, 2025 8:14 UTC (Fri) by johill (subscriber, #25196) [Link]

This rr, I believe: https://rr-project.org/

too restrictive: sealing by default

Posted Feb 5, 2025 16:49 UTC (Wed) by jreiser (subscriber, #11027) [Link] (4 responses)

There is a model for execution inside a process where the data maintains its presence in memory while the code chains (and changes completely) from phase to phase. Each phase gets invoked with "command-line parameters" which are pointers to the root and various other parts of the persisting in-memory data structure. A very small persistent user-mode supervisor performs the equivalent of execve() from one link in the code chain to the next. This model arose, evolved, and was necessary and useful in environments where physical RAM and address space were constrained, and temporary non-RAM storage was slow or non-existent: say, many large applications before 1975 (sometimes the links in the code chain were called "overlays"); or, today's portable battery-operated devices such as cellular "telephones"! Sealing by default won't work.

too restrictive: sealing by default

Posted Feb 5, 2025 17:55 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> today's portable battery-operated devices such as cellular "telephones"! Sealing by default won't work.

Why? Modern phones easily have gigabytes of RAM.

too restrictive: sealing by default

Posted Feb 5, 2025 19:40 UTC (Wed) by Wol (subscriber, #4433) [Link] (2 responses)

> Why? Modern phones easily have gigabytes of RAM.

And this is what gives us applications that are bloated and slow. What's that law - throwing resources at a project that's running late just makes it later?

You should always write your programs to be as lean and mean as you can. That can be hard at times, but the more you can suppress your desire to (ab)use all the resources at your disposal, the better your program will be.

Cheers,
Wol

too restrictive: sealing by default

Posted Feb 5, 2025 19:44 UTC (Wed) by intelfx (subscriber, #130118) [Link]

> What's that law - throwing resources at a project that's running late just makes it later?

That's a fun one. :)

too restrictive: sealing by default

Posted Feb 13, 2025 13:03 UTC (Thu) by gus3 (guest, #61103) [Link]

You're thinking of Brooks's Law.


Copyright © 2025, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds