A sandbox mode for the kernel
Sandbox mode
The intent behind this new mode is to allow the kernel to run a function in a way that it cannot affect the rest of the kernel. In its simplest form, sandbox mode is used by defining a function to be run in this isolated mode:
#include <linux/sbm.h>
static SBM_DEFINE_FUNC(untrusted_func, void *input_data, void *output_data);
That function would then be invoked with a sequence like:
struct sbm sbm;
sbm_init(&sbm);
result = sbm_call(&sbm, untrusted_func, SBM_COPY_IN(&sbm, input_buffer, in_size),
SBM_COPY_OUT(&sbm, output_buffer, out_size));
This code will result in a call to untrusted_func(). The input and output buffers will be allocated, and the input data copied, before that function is called with pointers to the new buffers. On a successful return, output data will be copied back, and sbm_call() will return the value returned by untrusted_func().
In the absence of architecture-specific support, that is about all that
sandbox mode does; the associated
documentation rightly describes this as "weak isolation
". It
might be enough to trap a simple overflow of the input or output buffers,
but it still does not protect the kernel from any stray accesses that go
further afield.
In a separate series, Tesařík provided a set of x86-64 architecture hooks that enhance the sandbox to provide stronger isolation. Specifically:
- The sandboxed function will be run with a separate set of page tables that limit its address space to the relevant code, the input buffer (mapped read-only), and the output buffer. As a result, the function will have no access to any other memory in the system. This change has some far-ranging implications; for example, it must be undone if an interrupt arrives so that the interrupt handler can run within the kernel's address space.
- The CPU is put into user mode, so that it cannot execute any privileged instructions; the function runs as if it were a user-space process.
- A separate kernel stack is allocated and the function is called on that stack so that it has no access to the normal kernel stack. There is also a separate exception stack that is used while sandbox mode is in effect.
- Any sort of CPU fault causes the immediate termination of the sandbox and an error return to the caller.
At this point, according to the documentation, sandbox mode provides
"strong isolation
" that should suffice to prevent the sandboxed
function from accessing the rest of the kernel.
In search of users
But for what purpose has this mode been created? The documentation says
that sandbox mode exists for "parsing data from untrusted sources,
especially if the parsing cannot be reasonably done by a user mode
helper
", but there was no actual user included with the patch series,
so there was no way to see what an intended user looks like. That,
naturally, led to questions. Andrew Morton remarked
that the API seemed overly restrictive and wondered how it would be
possible to get any real work done; he asked for an example to clarify the
situation, a request that Greg Kroah-Hartman echoed.
Tesařík answered
that the framework is "quite limited
" in its current form, but that
he intended to expand it over time. A bit later, he posted a
PGP-key parser that would run within the sandbox mode as an example
user, but that did little to increase acceptance of this work. As Dave
Hansen pointed
out, the kernel does not currently contain a parser for PGP keys, so
the new series just raised the question of why that needs to be
added too. Hansen said it would be far better to move some existing kernel
functionality into the sandbox to show how it could be made safer.
The response to that request was yet another patch series moving the parsing of AppArmor profiles into a sandbox. Supporting this use case required making a number of changes to the sandbox mode itself, including a new "fixup" mechanism designed to make it possible to call specific kernel functions from within the sandbox. So, for example, if code within the sandbox needs to allocate memory, it can call kmalloc(). That call will result in a fault, which will result in the execution of a proxy version of kmalloc() that will restore the kernel's full address space for the duration of the call.
Hansen responded
that the "fixup" mechanism looked like a maintenance problem:
"Establishing and maintaining this proxy list will be painful. Folks
will change the code to call something new and break this
*constantly*.
" He concluded
that sandbox mode did not seem like a good
idea overall:
"I don't see any viable way forward for this approach
". He did not
even comment on the need to add a special "__nosbm" marker to all
functions that might land in the same page as one that has been marked for
calls from within sandbox mode — an extra step that seems almost certain to go
wrong at some point.
The obvious conclusion is that sandbox mode is unlikely to make it into the mainline in anything resembling its current form. But there is clear value in isolating some kinds of kernel code, if there were only an acceptable way in which it could be done. One possibility is to use BPF, which is intended to provide isolation; non-trivial BPF programs can be tricky to get past the verifier, though, and the fact that they are loaded from user space may make some security-oriented people nervous.
Another possibility might be the user-mode blob
feature that was merged into the 4.18 kernel nearly six years ago. It
was intended for a similar purpose — the parsing of firewall rules for the
BPF-based "bpfilter" subsystem — but has never seen use in the mainline
kernel. In response to a query about
using this feature instead of a new sandbox mode, Roberto Sassu said
that "security people don't feel confident
" about using it. The
main concern seems to be that, since user-mode blobs run in a separate,
user-space process, they would be subject to manipulation by user space;
sandbox mode, being fully contained within the kernel, should be better
protected.
If complete isolation from user space is also a requirement for this work,
then it may be that there are no viable solutions for Linux at this time.
Hardening the kernel is a worthy goal, but it is just one of many that have
to be traded off in the creation of a kernel that is both useful and
maintainable in the real world. In the absence of a better implementation,
it would appear that sandbox mode does not offer enough to justify the
tradeoffs it would require.
| Index entries for this article | |
|---|---|
| Kernel | Security/Kernel hardening |
Posted Feb 29, 2024 16:33 UTC (Thu)
by auc (subscriber, #45914)
[Link] (5 responses)
Posted Feb 29, 2024 17:00 UTC (Thu)
by izbyshev (guest, #107996)
[Link]
Posted Feb 29, 2024 17:04 UTC (Thu)
by ErikF (subscriber, #118131)
[Link]
Posted Feb 29, 2024 21:03 UTC (Thu)
by draco (subscriber, #1792)
[Link] (2 responses)
So I don't think that's going to happen
Posted Mar 7, 2024 14:21 UTC (Thu)
by tesarik (subscriber, #52705)
[Link]
FWIW I had a quick glance at CPL 1 and 2:
Despite what Linus wrote in the reply linked above, a “Supervisor page” does not mean CPL=0, but rather CPL≠3. So, there is a difference between ring 1 (can access Supervisor pages) and ring 3 (cannot access only User pages), and there is also a difference between ring 0 (can execute privileged instructions) and ring 1 (cannot execute privileged instructions). From this, it would seem that ring 1 matches SandBox Mode requirements quite nicely, but not really. Since kernel pages are accessible from ring 1, I would have to flush the whole TLB (including global pages) on every SandBox Mode entry. If sandbox code runs in ring 3, I can do lazy TLB invalidation, which is a huge performance win.
Posted Mar 9, 2024 10:56 UTC (Sat)
by khim (subscriber, #9252)
[Link]
Keep in mind that his critique is specifically about x86 implementation and not about the idea of having more then two dedicated levels of privelege separation. The core of his critique was this: there is basically absolutely no difference between rings 1-3. These rings were added to 80286 to support very limited version of what iAPX 432 did and if you would structure your OS and all you programs as bunch of objects that don't use flat address space and use everything via descriptors they are useful… but nobody does that (except maybe OS/2) and x86-64 doesn't even support that mode. These rings are, indeed, supremely useless for most modern OSes. That doesn't mean all rings are useless, e.g. there are more rings for virtualization and they are useful. Just not for what here is attempted.
Posted Feb 29, 2024 19:43 UTC (Thu)
by roc (subscriber, #30627)
[Link] (1 responses)
But also, if user-space helpers aren't adequately isolated, fix that! That seems much more generally useful, and leads to a better long-term outcome.
Posted Mar 7, 2024 14:34 UTC (Thu)
by tesarik (subscriber, #52705)
[Link]
Yes, SandBox Mode could be redefined as immutable user space. I did consider this option but then decided against it, because I wanted to make it reasonably easy to to move existing kernel code into a SandBox Mode. Making a user-mode driver (UMD) is substantially more effort: That said, if such definition of SandBox Mode is more welcome by the community, it is a viable alternative.
Posted Feb 29, 2024 19:51 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
/me runs and takes a cover.
But really, this should be a good application for eBPF. Especially because the code in question is not performance-critical.
Posted Mar 1, 2024 13:55 UTC (Fri)
by eru (subscriber, #2753)
[Link] (11 responses)
How so? Certainly possible if the blob execs or dloads some file, but if the blob only uses kernel-provided data and code, as described in the linked lwn article, I don't see how it could be manipulated (unless there is a security breach allowing normal programs to hack privileged user-mode programs, but then the game is already lost anyway).
Posted Mar 1, 2024 15:00 UTC (Fri)
by corbet (editor, #1)
[Link] (10 responses)
Posted Mar 1, 2024 22:48 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (3 responses)
* The kernel automatically panics if any part of userspace messes with it.
All of the above are probably difficult for one reason or another.
Posted Mar 1, 2024 23:40 UTC (Fri)
by mjg59 (subscriber, #23239)
[Link] (2 responses)
Posted Mar 2, 2024 23:14 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
I tend to imagine that, at a minimum, there must be some code path that reclaims unreachable namespaces when the last process in them dies...
Posted Mar 5, 2024 23:12 UTC (Tue)
by laarmen (subscriber, #63948)
[Link]
Posted Mar 2, 2024 2:08 UTC (Sat)
by roc (subscriber, #30627)
[Link]
Posted Mar 3, 2024 19:33 UTC (Sun)
by eru (subscriber, #2753)
[Link] (2 responses)
Posted Mar 3, 2024 21:03 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted Mar 4, 2024 15:07 UTC (Mon)
by calumapplepie (guest, #143655)
[Link]
** This is a blatant lie; we would need to be very careful with what data is passed into the sandbox.
Posted Mar 7, 2024 14:49 UTC (Thu)
by tesarik (subscriber, #52705)
[Link]
SandBox Mode does not change the task state from runnable. SandBox Mode can finish without a task switch.
Posted Mar 7, 2024 21:30 UTC (Thu)
by alkbyby (subscriber, #61687)
[Link]
It would also be interesting to see any data on cost overheads of weak isolation as explained above, strong isolation and "honest" user-space context switchings.
Posted Mar 1, 2024 18:35 UTC (Fri)
by flussence (guest, #85566)
[Link]
Posted Mar 4, 2024 15:36 UTC (Mon)
by calumapplepie (guest, #143655)
[Link]
This screams "quick and dirty fix". If C had the obscene metaprogramming power of, say, LISP, then it might make sense to try and write code that can appear in both trusted an untrusted environments, being rewritten to match. But allowing access to functions written and compiled for the general kernel in a sandbox... kind of defeats the point of the sandbox. There shouldn't be a "proxy list"; there should be a separate, defined API of functions a sandboxes process can call.
Code within the sandbox doesn't get to call kmalloc(). It gets to call sandy_malloc, which will do all kinds of fun sanity checks. Now, interestingly, sandy_malloc will start to look a lot like glibc's malloc; one wonders if a sensible implementation of these sandboxes will start to look like a magic process that does work for the kernel. I suppose that's what the discussion in some of the other parts of this thread is about; ultimately, I think that trying to have a second implementation of userspace in the kernel is a bad idea. The one we have is already pretty good.
Posted Mar 7, 2024 10:09 UTC (Thu)
by tesarik (subscriber, #52705)
[Link]
A sandbox mode for the kernel
A sandbox mode for the kernel
A sandbox mode for the kernel
A sandbox mode for the kernel
A sandbox mode for the kernel
A sandbox mode for the kernel
A sandbox mode for the kernel
A sandbox mode for the kernel
A sandbox mode for the kernel
A sandbox mode for the kernel
since user-mode blobs run in a separate, user-space process, they would be subject to manipulation by user space
Remember that there are people setting up locked-down systems where even root can only do so much. On such a system, the ability to attack a process with ptrace() could be a real concern.
Interference from user space
Interference from user space
* The kernel automatically denies any attempt by userspace to mess with it.
* It is owned by some kind of super-root user which is superior to "regular" root, and there is no way for init or its children to obtain super-root creds.
* It is a kernel thread, with all of the usual "can't touch this" special-casing, but it runs in user mode instead of kernel mode.
Interference from user space
Interference from user space
Interference from user space
Interference from user space
Haven't used ptrace (directly), but looking at the manpage, it already comes with a complex set of conditions under which it is allowed, or not. Adding one for "disallowed if the process is a blob" would fit there.
Interference from user space
Interference from user space
Interference from user space
Interference from user space
Interference from user space
A sandbox mode for the kernel
A sandbox mode for the kernel
A sandbox mode for the kernel
