A sandbox mode for the kernel

By Jonathan Corbet
February 29, 2024

The Linux kernel follows a monolithic design, and that brings a well-known problem: all code in the kernel has access to the entirety of the kernel's address space. As a result, a bug in (for example) an obscure driver may well be exploitable to wreak havoc on core-kernel data structures. Various attempts have been made over the years to increase the degree of isolation within the kernel. The latest of these, "SandBox Mode" proposed by Petr Tesařík, makes it possible for the kernel to run some limited code safely, but it has encountered a bit of a chilly reception.

Sandbox mode

The intent behind this new mode is to allow the kernel to run a function in a way that it cannot affect the rest of the kernel. In its simplest form, sandbox mode is used by defining a function to be run in this isolated mode:

    #include <linux/sbm.h>

    static SBM_DEFINE_FUNC(untrusted_func, void *input_data, void *output_data);

That function would then be invoked with a sequence like:

    struct sbm sbm;

    sbm_init(&sbm);
    result = sbm_call(&sbm, untrusted_func, SBM_COPY_IN(&sbm, input_buffer, in_size),
			                    SBM_COPY_OUT(&sbm, output_buffer, out_size));

This code will result in a call to untrusted_func(). The input and output buffers will be allocated, and the input data copied, before that function is called with pointers to the new buffers. On a successful return, output data will be copied back, and sbm_call() will return the value returned by untrusted_func().

In the absence of architecture-specific support, that is about all that sandbox mode does; the associated documentation rightly describes this as "weak isolation". It might be enough to trap a simple overflow of the input or output buffers, but it still does not protect the kernel from any stray accesses that go further afield.

In a separate series, Tesařík provided a set of x86-64 architecture hooks that enhance the sandbox to provide stronger isolation. Specifically:

The sandboxed function will be run with a separate set of page tables that limit its address space to the relevant code, the input buffer (mapped read-only), and the output buffer. As a result, the function will have no access to any other memory in the system. This change has some far-ranging implications; for example, it must be undone if an interrupt arrives so that the interrupt handler can run within the kernel's address space.
The CPU is put into user mode, so that it cannot execute any privileged instructions; the function runs as if it were a user-space process.
A separate kernel stack is allocated and the function is called on that stack so that it has no access to the normal kernel stack. There is also a separate exception stack that is used while sandbox mode is in effect.
Any sort of CPU fault causes the immediate termination of the sandbox and an error return to the caller.

At this point, according to the documentation, sandbox mode provides "strong isolation" that should suffice to prevent the sandboxed function from accessing the rest of the kernel.

In search of users

But for what purpose has this mode been created? The documentation says that sandbox mode exists for "parsing data from untrusted sources, especially if the parsing cannot be reasonably done by a user mode helper", but there was no actual user included with the patch series, so there was no way to see what an intended user looks like. That, naturally, led to questions. Andrew Morton remarked that the API seemed overly restrictive and wondered how it would be possible to get any real work done; he asked for an example to clarify the situation, a request that Greg Kroah-Hartman echoed.

Tesařík answered that the framework is "quite limited" in its current form, but that he intended to expand it over time. A bit later, he posted a PGP-key parser that would run within the sandbox mode as an example user, but that did little to increase acceptance of this work. As Dave Hansen pointed out, the kernel does not currently contain a parser for PGP keys, so the new series just raised the question of why that needs to be added too. Hansen said it would be far better to move some existing kernel functionality into the sandbox to show how it could be made safer.

The response to that request was yet another patch series moving the parsing of AppArmor profiles into a sandbox. Supporting this use case required making a number of changes to the sandbox mode itself, including a new "fixup" mechanism designed to make it possible to call specific kernel functions from within the sandbox. So, for example, if code within the sandbox needs to allocate memory, it can call kmalloc(). That call will result in a fault, which will result in the execution of a proxy version of kmalloc() that will restore the kernel's full address space for the duration of the call.

Hansen responded that the "fixup" mechanism looked like a maintenance problem: "Establishing and maintaining this proxy list will be painful. Folks will change the code to call something new and break this *constantly*." He concluded that sandbox mode did not seem like a good idea overall: "I don't see any viable way forward for this approach". He did not even comment on the need to add a special "__nosbm" marker to all functions that might land in the same page as one that has been marked for calls from within sandbox mode — an extra step that seems almost certain to go wrong at some point.

The obvious conclusion is that sandbox mode is unlikely to make it into the mainline in anything resembling its current form. But there is clear value in isolating some kinds of kernel code, if there were only an acceptable way in which it could be done. One possibility is to use BPF, which is intended to provide isolation; non-trivial BPF programs can be tricky to get past the verifier, though, and the fact that they are loaded from user space may make some security-oriented people nervous.

Another possibility might be the user-mode blob feature that was merged into the 4.18 kernel nearly six years ago. It was intended for a similar purpose — the parsing of firewall rules for the BPF-based "bpfilter" subsystem — but has never seen use in the mainline kernel. In response to a query about using this feature instead of a new sandbox mode, Roberto Sassu said that "security people don't feel confident" about using it. The main concern seems to be that, since user-mode blobs run in a separate, user-space process, they would be subject to manipulation by user space; sandbox mode, being fully contained within the kernel, should be better protected.

If complete isolation from user space is also a requirement for this work, then it may be that there are no viable solutions for Linux at this time. Hardening the kernel is a worthy goal, but it is just one of many that have to be traded off in the creation of a kernel that is both useful and maintainable in the real world. In the absence of a better implementation, it would appear that sandbox mode does not offer enough to justify the tradeoffs it would require.

Index entries for this article
Kernel	Security/Kernel hardening

A sandbox mode for the kernel

Posted Feb 29, 2024 16:33 UTC (Thu) by auc (subscriber, #45914) [Link] (5 responses)

Naive question: can nothing be done in this direction using the ring 1/2 levels of Intel like CPUs ? Weren't those initially designed to help isolate e.g. device drivers ?

A sandbox mode for the kernel

Posted Feb 29, 2024 17:00 UTC (Thu) by izbyshev (guest, #107996) [Link]

Even if something can be done, rings 1 and 2 don't exist in the recently proposed X86S.

A sandbox mode for the kernel

Posted Feb 29, 2024 17:04 UTC (Thu) by ErikF (subscriber, #118131) [Link]

x86-64 long mode has removed rings 1 and 2 (https://www.intel.com/content/www/us/en/developer/article...).

A sandbox mode for the kernel

Posted Feb 29, 2024 21:03 UTC (Thu) by draco (subscriber, #1792) [Link] (2 responses)

Linus wrote a withering critique of the ring 1/2 architecture a few years ago: https://www.realworldtech.com/forum/?threadid=200812&...

So I don't think that's going to happen

A sandbox mode for the kernel

Posted Mar 7, 2024 14:21 UTC (Thu) by tesarik (subscriber, #52705) [Link]

FWIW I had a quick glance at CPL 1 and 2:

This thing is x86-specific (whereas my SandBox Mode can be implemented for any platform with a MMU).
Memory protection in Linux is based on paging, which does not offer more than a U/S bit even on x86.

Despite what Linus wrote in the reply linked above, a “Supervisor page” does not mean CPL=0, but rather CPL≠3. So, there is a difference between ring 1 (can access Supervisor pages) and ring 3 (cannot access only User pages), and there is also a difference between ring 0 (can execute privileged instructions) and ring 1 (cannot execute privileged instructions). From this, it would seem that ring 1 matches SandBox Mode requirements quite nicely, but not really. Since kernel pages are accessible from ring 1, I would have to flush the whole TLB (including global pages) on every SandBox Mode entry. If sandbox code runs in ring 3, I can do lazy TLB invalidation, which is a huge performance win.

A sandbox mode for the kernel

Posted Mar 9, 2024 10:56 UTC (Sat) by khim (subscriber, #9252) [Link]

Keep in mind that his critique is specifically about x86 implementation and not about the idea of having more then two dedicated levels of privelege separation.

The core of his critique was this: there is basically absolutely no difference between rings 1-3.

These rings were added to 80286 to support very limited version of what iAPX 432 did and if you would structure your OS and all you programs as bunch of objects that don't use flat address space and use everything via descriptors they are useful… but nobody does that (except maybe OS/2) and x86-64 doesn't even support that mode.

These rings are, indeed, supremely useless for most modern OSes.

That doesn't mean all rings are useless, e.g. there are more rings for virtualization and they are useful. Just not for what here is attempted.

A sandbox mode for the kernel

Posted Feb 29, 2024 19:43 UTC (Thu) by roc (subscriber, #30627) [Link] (1 responses)

Use WebAssembly.

But also, if user-space helpers aren't adequately isolated, fix that! That seems much more generally useful, and leads to a better long-term outcome.

A sandbox mode for the kernel

Posted Mar 7, 2024 14:34 UTC (Thu) by tesarik (subscriber, #52705) [Link]

Yes, SandBox Mode could be redefined as immutable user space. I did consider this option but then decided against it, because I wanted to make it reasonably easy to to move existing kernel code into a SandBox Mode. Making a user-mode driver (UMD) is substantially more effort:

UMD is a user-space application. It cannot use the standard kernel APIs.
Data passed between the kernel and UMD must be serialized and deserialized, plus you may have to add some glue code in the kernel.

That said, if such definition of SandBox Mode is more welcome by the community, it is a viable alternative.

A sandbox mode for the kernel

Posted Feb 29, 2024 19:51 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

WASM.

/me runs and takes a cover.

But really, this should be a good application for eBPF. Especially because the code in question is not performance-critical.

A sandbox mode for the kernel

Posted Mar 1, 2024 13:55 UTC (Fri) by eru (subscriber, #2753) [Link] (11 responses)

since user-mode blobs run in a separate, user-space process, they would be subject to manipulation by user space

How so? Certainly possible if the blob execs or dloads some file, but if the blob only uses kernel-provided data and code, as described in the linked lwn article, I don't see how it could be manipulated (unless there is a security breach allowing normal programs to hack privileged user-mode programs, but then the game is already lost anyway).

Interference from user space

Posted Mar 1, 2024 15:00 UTC (Fri) by corbet (editor, #1) [Link] (10 responses)

Remember that there are people setting up locked-down systems where even root can only do so much. On such a system, the ability to attack a process with ptrace() could be a real concern.

Interference from user space

Posted Mar 1, 2024 22:48 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (3 responses)

The only way I can think of to make this work is to make the userspace helper exceptional in some way, such as:

* The kernel automatically panics if any part of userspace messes with it.
* The kernel automatically denies any attempt by userspace to mess with it.
* It is owned by some kind of super-root user which is superior to "regular" root, and there is no way for init or its children to obtain super-root creds.
* It is a kernel thread, with all of the usual "can't touch this" special-casing, but it runs in user mode instead of kernel mode.

All of the above are probably difficult for one reason or another.

Interference from user space

Posted Mar 1, 2024 23:40 UTC (Fri) by mjg59 (subscriber, #23239) [Link] (2 responses)

Run the userspace helper in an entirely disjoint set of namespaces that aren't children of anything running elsewhere? (would this work? I assume PIDs end up being screwy in some way, but you could special case this namespace in the ptrace and signal path and maybe that would be good enough)

Interference from user space

Posted Mar 2, 2024 23:14 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (1 responses)

What does the namespace data structure look like? Is there any code path that tries to recursively enumerate the whole set of namespaces (for some particular kind of namespace)?

I tend to imagine that, at a minimum, there must be some code path that reclaims unreachable namespaces when the last process in them dies...

Interference from user space

Posted Mar 5, 2024 23:12 UTC (Tue) by laarmen (subscriber, #63948) [Link]

Wouldn't simple refcounting work to reclaim unreachable namespaces?

Interference from user space

Posted Mar 2, 2024 2:08 UTC (Sat) by roc (subscriber, #30627) [Link]

Maybe the people who don't trust root should run everything in a container where they're root in the container's user namespace but not the toplevel user namespace.

Interference from user space

Posted Mar 3, 2024 19:33 UTC (Sun) by eru (subscriber, #2753) [Link] (2 responses)

Haven't used ptrace (directly), but looking at the manpage, it already comes with a complex set of conditions under which it is allowed, or not. Adding one for "disallowed if the process is a blob" would fit there.

Interference from user space

Posted Mar 3, 2024 21:03 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

There's also /proc/pid/fd that can expose file descriptors.

Interference from user space

Posted Mar 4, 2024 15:07 UTC (Mon) by calumapplepie (guest, #143655) [Link]

Wouldn't having the sandbox task isolated from the kernel necessarily mean that information leaks from it are permissible? I think the fear of allowing ptrace is that it enables attackers to take the now-trusted output of the sandbox task and mess with it, re-opening the attack surface. It doesn't matter if we leak the exact details of what the user-mode task is doing, if those details are only dependent on what the original process was trying to do; an attacker who can ptrace the sandbox can ptrace the original. You can't use the sandbox to, say, break KALSR if it's not mapped to the kernel.**

** This is a blatant lie; we would need to be very careful with what data is passed into the sandbox.

Interference from user space

Posted Mar 7, 2024 14:49 UTC (Thu) by tesarik (subscriber, #52705) [Link]

That's not the only concern. User-mode environment is very different from kernel mode. Quite importantly, it cannot run “inside” kernel mode, that is if your kernel code wants to delegate something to a user-mode helper, you must create a new task and wait for its completion. A lot can happen there… For example, it creates an opportunity for priority inversion. You'd better not hold any locks while executing a user-mode helper.

SandBox Mode does not change the task state from runnable. SandBox Mode can finish without a task switch.

Interference from user space

Posted Mar 7, 2024 21:30 UTC (Thu) by alkbyby (subscriber, #61687) [Link]

+1 for sharing as much as possible details on why going with user-space helpers didn't work. Whatever user-space lockdown features people think might be missing, introducing them seems more useful path then adding odd kernel-space trickery. There will always be demand for more lockdowns for user-space.

It would also be interesting to see any data on cost overheads of weak isolation as explained above, strong isolation and "honest" user-space context switchings.

A sandbox mode for the kernel

Posted Mar 1, 2024 18:35 UTC (Fri) by flussence (guest, #85566) [Link]

This doesn't pass the smell test. Set up a pair of memfds for your ROM/RAM, pass them to a userspace helper running in seccomp mode 1. The tools are already there and they don't have an arbitrary x86-64 constraint besides. Am I missing anything?

A sandbox mode for the kernel

Posted Mar 4, 2024 15:36 UTC (Mon) by calumapplepie (guest, #143655) [Link]

> Supporting this use case required making a number of changes to the sandbox mode itself, including a new "fixup" mechanism designed to make it possible to call specific kernel functions from within the sandbox.

This screams "quick and dirty fix". If C had the obscene metaprogramming power of, say, LISP, then it might make sense to try and write code that can appear in both trusted an untrusted environments, being rewritten to match. But allowing access to functions written and compiled for the general kernel in a sandbox... kind of defeats the point of the sandbox. There shouldn't be a "proxy list"; there should be a separate, defined API of functions a sandboxes process can call.

Code within the sandbox doesn't get to call kmalloc(). It gets to call sandy_malloc, which will do all kinds of fun sanity checks. Now, interestingly, sandy_malloc will start to look a lot like glibc's malloc; one wonders if a sensible implementation of these sandboxes will start to look like a magic process that does work for the kernel. I suppose that's what the discussion in some of the other parts of this thread is about; ultimately, I think that trying to have a second implementation of userspace in the kernel is a bad idea. The one we have is already pretty good.

A sandbox mode for the kernel

Posted Mar 7, 2024 10:09 UTC (Thu) by tesarik (subscriber, #52705) [Link]

Just for the record, I have always been aware of many drawbacks. I wrote this blog post back in December in an attempt to sort out my thoughts. (Some implementation details have been changed since then, e.g. there are no “trampoline interrupt handlers”.)

https://sigillatum.tesarici.cz/2023-12-04-sandbox-mode.html