Hiding a process's executable from itself
The 2019 incident, which came to be known as CVE-2019-5736, involved a sequence of steps that culminated in the overwriting of the runc container-runtime binary from within a container. That binary should not have even been visible within the container, much less writable, but such obstacles look like challenges to a determined attacker. In this case, the attack was able to gain access to this binary via /proc/self/exe, which always refers to the binary executable for the current process.
Specifically, the attack opens the runc process's /proc/self/exe file, creating a read-only file descriptor — inside the container — for the target binary, which lives outside that container. Once runc exits, the attacker is able to reopen that file descriptor for write access; that descriptor can subsequently be used to overwrite the runc binary. Since runc is run with privilege outside of the container runtime, this becomes a compromise of the host as a whole; see the above-linked article for details.
This vulnerability was closed by having runc copy its binary image into a memfd area and sealing it; control is then be passed to that image before entering the container. Sealing prevents modifying the image, but even if that protection fails, the container is running from an independent copy of the binary that will never be used again, so overwriting it is no longer useful. It is a bit of an elaborate workaround, but it plugged the hole at the time.
Scrivano is proposing a different solution to the problem: simply disable access to /proc/self/exe as a way of blocking image-overwrite attacks of this type. Specifically, it adds a new prctl() command called PR_SET_HIDE_SELF_EXE that can be used to disable access to /proc/self/exe. Once this option has been enabled, any attempt to open that file from within the owning process will fail with an ENOENT error — as if the file simply no longer existed at all. Enabling this behavior is a one-way operation; once it has been turned on, it cannot be disabled until the next execve() call, which will reset this option to the disabled state, is made.
This behavior is necessarily opt-in; any program that wants to have its executable image protected from access in this way will have to request it specifically. The intent, though, is that this simple call will be able to replace the more complicated workarounds that are needed to prevent this sort of attack today. A prctl() is a small price to pay if it eliminates the need to create a new copy of the executable image every time a new container is launched.
This new option is thus a simple way of blocking this specific attack, but it leads to some related questions. Hiding the container-runtime binary seems like less satisfying solution than ensuring that this binary cannot be modified regardless of whether it is visible within a container. It seems like it closes off one specific path to a compromise without addressing the underlying problem.
More to the point, perhaps, is the question of just how many operations the kernel developers would like to add to prevent access to specific resources that might possibly be misused. There is, conceivably, no end to the files (under /proc and beyond) that might be useful to an attacker who is determined to take over the system. Adding a new prctl() option — and the necessary kernel code to implement it — for every such file could lead to a mess in the long term. There comes a point where it might make more sense to use a security module to implement this sort of policy.
If the development community feels that way, though, it's not saying so —
or much of anything else. The patch set has been posted three times and has not
received any substantive comments any of those times. There will,
presumably, need to be an eventual
discussion that decides whether this type of mechanism is the best way of
protecting systems against attacks using /proc/self/exe. For the
moment, though, it would appear that this change is simply waiting for
the wider community to take notice of it.
Index entries for this article | |
---|---|
Kernel | /proc |
Kernel | Security/Security technologies |
Posted Jan 23, 2023 17:07 UTC (Mon)
by TheGopher (subscriber, #59256)
[Link] (26 responses)
Posted Jan 23, 2023 17:09 UTC (Mon)
by dskoll (subscriber, #1630)
[Link] (2 responses)
Lots of programs assume that /proc is mounted; not mounting it would cause all kinds of annoying failures like the inability to usefully run ps inside a container, for example.
Posted Jan 23, 2023 18:10 UTC (Mon)
by TheGopher (subscriber, #59256)
[Link] (1 responses)
Posted Jan 24, 2023 8:46 UTC (Tue)
by LtWorf (subscriber, #124958)
[Link]
Posted Jan 23, 2023 17:21 UTC (Mon)
by nickodell (subscriber, #125165)
[Link] (1 responses)
Posted Jan 23, 2023 21:10 UTC (Mon)
by jccleaver (guest, #127418)
[Link]
I think a bigger point is that we should probably start swinging back to running more lightweight VMs and fewer containers.
Posted Jan 23, 2023 17:24 UTC (Mon)
by ejr (subscriber, #51652)
[Link] (17 responses)
Posted Jan 24, 2023 1:32 UTC (Tue)
by ringerc (subscriber, #3071)
[Link] (15 responses)
ps is uselessly container-ignorant at the moment. It has no support for filtering by pid namespace, for example. It does not know how to use the NSPid fields in /proc/$pid/status to show a process's pid in another pid namespace, or to display a process identified by a pidns + a pid in that namespace. It can't even be told to display a tree of processes starting at a specific parent pid, you have to use the otherwise more limited pstree command for that.
And that's just ps. Tools like gdb are completely incapable of functioning usefully across container boundaries and rely on having a gdbserver injected into the target container. Which in turn requires access to /proc, a session with CAP_SYS_PTRACE, etc.
It's a nightmare. Container runtime tools are utterly primitive and provide no assistance whatsoever.
As far as I can tell, believers in os-less containers don't actually think interactive debugging is relevant. I guess you're supposed to use printf() debugging, tracing/APM, and psychic powers.
Posted Jan 24, 2023 8:33 UTC (Tue)
by patrakov (subscriber, #97174)
[Link] (14 responses)
Posted Jan 24, 2023 10:38 UTC (Tue)
by tpo (subscriber, #25713)
[Link] (11 responses)
Objection. It is relevant. I have to do it all the time.
Posted Jan 24, 2023 15:11 UTC (Tue)
by khim (subscriber, #9252)
[Link] (10 responses)
That just means that you don't have production environment, then. I recommend you to get that separate environment to run production in.
Posted Jan 25, 2023 5:20 UTC (Wed)
by ringerc (subscriber, #3071)
[Link] (9 responses)
Sometimes you just have to debug what's in prod, because that's where the user/customer is hitting an issue.
You definitely don't want to have to re-deploy a bunch of new container versions with printf-style debugging hacked in, different compile flags with debuginfo, etc as you try to understand the problem.
Prod should be debuggable. Run services with minimal privileges, in bare-bones containers, yes. But have a way to fetch the debuginfo for compiled executables when required, inject an elevated process with a debugger, etc, so you can debug-in-place if and when required.
Posted Jan 25, 2023 12:00 UTC (Wed)
by khim (subscriber, #9252)
[Link] (8 responses)
If your prod is debuggable then it just means you company is not large enough to have a prod. Oh, sure. There are lots of stories like these. But they are not the reason to give someone access to prod with customer's data. Precisely because they come and go they are annoying (and sometimes take years to find out and fix) but they are not enough to permit access of some random developer to customer's data. Not only you have to do that, but you have to ensure that all your temporary logging changes would be properly vetoed and approved by a security team. Or else regulators would eat you alive. As I have said: if you are allowed to do that the you don't have a prod but a testing environment exposed to customers. While valuable in many cases this requires their explicit assent and it's not a prod. You can afford to do thing differently in testing environment.
Posted Jan 25, 2023 12:22 UTC (Wed)
by tpo (subscriber, #25713)
[Link] (4 responses)
Posted Jan 25, 2023 17:28 UTC (Wed)
by khim (subscriber, #9252)
[Link] (3 responses)
Nope. It's not about context. It's about definition. If you can do ssh to your “production” system and do something there then it's “advanced testing” environment or something else, but that's not a “production environment”. Most startups don't have production and that's normal. But that doesn't mean you need gdb access in production. You don't. It's not a production till you need it.
Posted Jan 25, 2023 18:05 UTC (Wed)
by tpo (subscriber, #25713)
[Link] (2 responses)
Posted Jan 25, 2023 18:09 UTC (Wed)
by tpo (subscriber, #25713)
[Link] (1 responses)
Posted Jan 25, 2023 18:54 UTC (Wed)
by ejr (subscriber, #51652)
[Link]
Posted Jan 25, 2023 16:14 UTC (Wed)
by paulj (subscriber, #341)
[Link] (2 responses)
If you think it is possible to have a full replica in test of your prod, then it just means you havn't worked at the largest tech companies.
Posted Jan 25, 2023 17:35 UTC (Wed)
by khim (subscriber, #9252)
[Link] (1 responses)
What gave you that illusion? Yes, you can not tests everything before deployment, but that just means that your logging and telemetry becomes important. Especially if you deploy not just to your servers but to millions of devices around the world. But this still doesn't change the fact that production is where you can not rung gdb. It's just definition of “production”. It deals with sensitive customer's data. That's why gdb is not welcome there.
Posted Jan 25, 2023 17:46 UTC (Wed)
by paulj (subscriber, #341)
[Link]
It's a bit nonsensical to think that banning gdb from prod will protect customer data from the people who have access either the running instance or the code.
What does happen is that you can only access instances for the software that you are responsible. Everything is as compartmentalised as possible. (Though, this still has limits, given there are groups of people responsible for system level software on various classes of the fleet).
Posted Jan 24, 2023 20:47 UTC (Tue)
by geofft (subscriber, #59789)
[Link]
I don't dispute that there are some use cases in some industries where specific regulations in particular jurisdictions may prohibit this, but it's certainly not universal (and of course there are plenty of use cases that legitimately count as production that aren't in regulated industries).
Posted Jan 27, 2023 15:00 UTC (Fri)
by jwarnica (subscriber, #27492)
[Link]
If you need to do interactive debugging in, er, some highly constrained environment, then you have two bugs, one of which is your system isn't verbose enough, doesn't have good enough logging and metrics, etc. And then whatever the other bug is.
Posted Jan 25, 2023 18:13 UTC (Wed)
by ejr (subscriber, #51652)
[Link]
There are environments where having *any* access to the client's / outside data is forbidden.
There are environments where the deploy-ers can have access to anything to keep the deployed working.
And there are those at various levels between. The number of quite correct for their environment possibilities, to me, argues against any OS taking "sides." I'm not sure if the levels of access (MLS, etc.) are sufficiently well-defined for a general OS kernel to mediate that access.
Whether or not the kernel can be configured to an outside model unfortunately is not separate. Unless maybe to the extremes? And let another level / ring dictate access? I'm just a silly library / application developer.
Posted Jan 24, 2023 9:41 UTC (Tue)
by smcv (subscriber, #53363)
[Link]
Even if you don't want the ability to run tools like ps inside the container, increasingly much lower-level library functionality relies on having /proc mounted, mainly for /proc/self/fd, which is used to emulate fd-relative I/O (for example fexecve(3) was unimplementable without /proc mounted until execveat(2) was added to Linux), and is still necessary even on the latest kernels if you want to pass a fd-relative path to a subprocess that expects a filename as a command-line option.
Posted Jan 24, 2023 12:08 UTC (Tue)
by rcampos (subscriber, #59737)
[Link]
Posted Jan 24, 2023 12:12 UTC (Tue)
by judas_iscariote (guest, #47386)
[Link]
Posted Jan 23, 2023 17:53 UTC (Mon)
by jepler (subscriber, #105975)
[Link] (1 responses)
Posted Jan 23, 2023 18:50 UTC (Mon)
by gscrivano (subscriber, #74830)
[Link]
static const char *
return proc_pid_get_link(dentry, inode, done);
where checkpoint_restore_ns_capable is defined as:
static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
So you must either have CAP_SYS_ADMIN in the initial user namespace, or have CAP_CHECKPOINT_RESTORE in the user namespace.
Posted Jan 23, 2023 18:30 UTC (Mon)
by nomeata (subscriber, #16315)
[Link] (3 responses)
“Once runc exits, the attacker is able to reopen that file descriptor for write access”
the fundamental problem? Why is it ok to turn a read-only filedescriptor into a read-write filedescriptor?
Posted Jan 23, 2023 18:53 UTC (Mon)
by floppus (guest, #137245)
[Link] (2 responses)
As the linked LWN article says, this is an issue specifically for "containers that run with access to the host root user ID (i.e. UID 0), which, sadly, covers most of the containers being run today." That's the real problem.
Posted Jan 24, 2023 12:11 UTC (Tue)
by rcampos (subscriber, #59737)
[Link] (1 responses)
But this is taking some years, and probably some more to be enabled by default. Furthermore, it will not eliminate completely the need to run _some_ things as root on the host.
Therefore, it is still very useful.
Posted Jan 31, 2023 5:31 UTC (Tue)
by donald.buczek (subscriber, #112892)
[Link]
Posted Jan 23, 2023 19:10 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link] (5 responses)
In principle, couldn't the kernel simply allow processes to call unlink(2) on proc files (which would make those files vanish from *that process's* view of procfs, but would not affect any other process's view of procfs)? This would presumably require some bookkeeping on the part of the kernel, but I can't imagine it would be all that onerous.
Posted Jan 24, 2023 12:13 UTC (Tue)
by rcampos (subscriber, #59737)
[Link] (4 responses)
Posted Jan 24, 2023 16:07 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (3 responses)
I dunno, maybe the parent has to resolve the symlink, for symmetry with the "real filesystem" case?
Posted Jan 24, 2023 19:20 UTC (Tue)
by rcampos (subscriber, #59737)
[Link] (2 responses)
Posted Jan 24, 2023 19:56 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
Posted Jan 25, 2023 0:27 UTC (Wed)
by rcampos (subscriber, #59737)
[Link]
This is how the CVE is exploited, you run your command so that it opens /proc/self/exe and you get that pointing to the runc binary.
See this blog post for an example exploit of the CVE: https://kinvolk.io/blog/2019/02/runc-breakout-vulnerabili...
IMHO what seems buggy is that you can open /proc/self/exe and that is the runc binary. I don't know now if that is something we can avoid, though. Probably not, for some reason I don't remember now.
Posted Jan 23, 2023 20:41 UTC (Mon)
by developer122 (guest, #152928)
[Link] (7 responses)
Posted Jan 24, 2023 5:37 UTC (Tue)
by developer122 (guest, #152928)
[Link] (6 responses)
Posted Jan 24, 2023 10:25 UTC (Tue)
by matthias (subscriber, #94967)
[Link] (4 responses)
Usually it is opt-in to have any files inside the container, as the filesystem namespace for the container is explicitly created. The problem is, that procfs includes the executables inside /proc/pid/exe, even if they should not be accessible from within the container. The solution is this prctl that hides the executables again. Not mounting /proc could be another solution. Unfortunately many programs rely on /proc being available.
And the executable is not outside of the container. As soon as /proc is mounted, it is inside of the container, in much the same way as any executable that is accessible through bind mounts. The executable should be outside of the container, but without this new prctl there only was some ugly work-around to achieve this if /proc should be mounted.
Posted Jan 25, 2023 2:26 UTC (Wed)
by developer122 (guest, #152928)
[Link] (3 responses)
That executable was effectively linked into the container when runc was commanded to launch itself, with /proc/self/exe effectively acting like a convoluted symlink. The runc outside was told to run /proc/self/exe which pointed to it's own executable (presumably in the host's /bin), and upon launching it, the /proc/self/exe of the runc instance inside the container continued to point to the same location, all the way back to the runc binary on the host.
If runc will only launch executables that are already inside the container, no such reference can be brought inside.
Posted Jan 25, 2023 6:10 UTC (Wed)
by matthias (subscriber, #94967)
[Link] (2 responses)
And it is quite difficult to test whether someone is tricking you into executing /proc/self/exe. It is not just symlinks. Think of an executable starting with #!/proc/self/exe. The problem is that /proc/self/exe is there in the first place, not that runc can be tricked into executing it. This is just one exploit. There can be others.
Also running docker exec would probably always have a race window, where runc is available inside the container. A malicious process inside the container can just wait for any docker exec to happen.
Posted Jan 25, 2023 22:00 UTC (Wed)
by developer122 (guest, #152928)
[Link] (1 responses)
Fair enough, but the vulnerable executable is only visible to runc.
> It is not just symlinks. Think of an executable starting with #!/proc/self/exe.
Executing it in that way would almost certainly refer to the shell binary inside the container.
Posted Jan 25, 2023 22:37 UTC (Wed)
by matthias (subscriber, #94967)
[Link]
Yes, indeed. runc sets the non-dumpable attribute which should prevent other processes from playing with /proc/pid/exe. However, this gets reset when runc exec()s itself. Therefore the self-exec in the exploits.
>> It is not just symlinks. Think of an executable starting with #!/proc/self/exe.
As a container does not need to have a shell, runc executes commands directly. And if runc does exec() an executable with the shebang inside, then the proc entry refers to the runc binary.
The key point is, all changes to runc to avoid being tricked into execing itself is just curating the symptoms. The underlying problem is the runc executable being part of the container. And this should just be avoided.
Posted Jan 31, 2023 0:32 UTC (Tue)
by vinipsmaker (guest, #126735)
[Link]
Posted Jan 24, 2023 0:15 UTC (Tue)
by josh (subscriber, #17465)
[Link] (4 responses)
Posted Jan 24, 2023 0:38 UTC (Tue)
by walters (subscriber, #7396)
[Link] (3 responses)
Posted Jan 24, 2023 1:54 UTC (Tue)
by josh (subscriber, #17465)
[Link] (2 responses)
Posted Jan 24, 2023 12:28 UTC (Tue)
by adobriyan (subscriber, #30858)
[Link] (1 responses)
Posted Jan 24, 2023 15:31 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link]
Obviously the solution is to not write/release exploitable software in the first place, but we (as a species) seem incapable of doing that. Or at least unwilling to pay the price it would take to do that (longer launch runways, better code review, etc.).
Posted Jan 24, 2023 6:02 UTC (Tue)
by rambolized (guest, #160860)
[Link] (2 responses)
Posted Jan 24, 2023 10:16 UTC (Tue)
by matthias (subscriber, #94967)
[Link] (1 responses)
Posted Jan 24, 2023 12:44 UTC (Tue)
by rambolized (guest, #160860)
[Link]
Posted Jan 30, 2023 22:47 UTC (Mon)
by vinipsmaker (guest, #126735)
[Link]
Lately I've been developing my own sandboxing solution making use of Linux namespaces (for those interested: <https://emilua.gitlab.io/docs/api/0.4/tutorial/linux_name...>). However the use case is not containers. The use case is compartmentalised application development. Long story short, one should be able to make use of Linux namespaces to spawn actors (Lua VMs in my project) inside isolated processes (the goal is to bring a model closer to Capsicum and actor systems to Linux).
In this scenario, the cost of creating a new actor is cheap:
1. fork()'ing a process that was created near main() with very few allocations and fds open.
Containers will usually will have extra steps (that I skip in my project):
3. Mount an image for a whole Linux mini-distro.
However the lack of an exec() call at the end means that I must be careful to not leak resources from the host as exec() is the only call that flushes the address space. /proc/self/exe is one of the things I must be careful about.
The beauty about PR_SET_HIDE_SELF_EXE is that it'll protect not only containers that exec() at the end, but it can also be used in projects such as the one that I'm developing where no exec() call at the end ever happens. I really hope this patch gets merged.
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
It seems like a game of whack-a-mole to me.
Hiding a process's executable from itself
Hide /proc?
Hide /proc?
The middle-ground is really annoying for anyone who has to live-debug something that was built by an OS-oblivious Dev somewhere who felt that no one would ever have to, say, run a ping command to validate network paths and was super-excited to show how they just saved 28K.
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
> Prod should be debuggable.
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
> If you think it is possible to have a full replica in test of your prod
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
proc_map_files_get_link(struct dentry *dentry,
struct inode *inode,
struct delayed_call *done)
{
if (!checkpoint_restore_ns_capable(&init_user_ns))
return ERR_PTR(-EPERM);
}
{
return ns_capable(ns, CAP_CHECKPOINT_RESTORE) ||
ns_capable(ns, CAP_SYS_ADMIN);
}
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
> Fair enough, but the vulnerable executable is only visible to runc.
> Executing it in that way would almost certainly refer to the shell binary inside the container.
O_BENEATH
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Hiding a process's executable from itself
Good interface
2. Initial setup for the new namespaces (e.g. mount() calls).
4. exec() into some binary.