Slowing the flow of core-dump-related CVEs
Because I'm a clown and also I had it with all the CVEs because we provide a **** API for userspace". The handling of core dumps has indeed been a constant source of vulnerabilities; with luck, the 6.16 work will result in rather fewer of them in the future.
The problem with core dumps
A core dump is an image of a process's data areas — everything except the executable text; it can be used to investigate the cause of a crash by examining a process's state at the time things went wrong. Once upon a time, Unix systems would routinely place a core dump into a file called core in the current working directory when a program crashed. The main effects of this practice were to inspire system administrators worldwide to remove core files daily via cron jobs, and to make it hazardous to use the name core for anything you wanted to keep. Linux systems can still create core files, but are usually configured not to.
An alternative that is used on some systems is to have the kernel launch a process to read the core dump from a crashing process and, presumably, do something useful with it. This behavior is configured by writing an appropriate string to the core_pattern sysctl knob. A number of distributors use this mechanism to set up core-dump handlers that phone home to report crashes so that the guilty programs can, hopefully, be fixed.
This is the "**** API
" referred to by Brauner; it indeed has a
number of problems. For example, the core-dump handler is launched by the
kernel as a user-mode helper, meaning that it runs fully privileged in the
root namespace. That, needless to say, makes it an attractive target for
attackers. There are also a number of race conditions that emerge from this
design that have led to vulnerabilities of their own.
See, for example, this recent Qualys advisory describing a vulnerability in Ubuntu's apport tool and the systemd-coredump utility, both of which are designed to process core dumps. In short, an attacker starts by running a setuid binary, then forcing it to crash at an opportune moment. While the core-dump handler is being launched (a step that the attacker can delay in various ways), the crashed process is killed outright with a SIGKILL signal, then quickly replaced by another process with the same process ID. The core-dump handler will then begin to examine the core dump from the crashed process, but with the information from the replacement process.
That process is running in its own attacker-crafted namespace, with some strategic environmental changes. In this environment, the core-dump handler's attempt to pass the core-dump socket to a helper can be intercepted; that allows said process to gain access to the file descriptor from which the core dump can be read. That, in turn, gives the attacker the ability to read the (original, privileged) process's memory, happily pillaging any secrets found there. The example given by Qualys obtains the contents of /etc/shadow, which is normally unreadable, but it seems that SSH servers (and the keys in their memory) are vulnerable to the same sort of attack.
Interested readers should consult the advisory for a much more detailed (and coherent) description of how this attack works, as well as information on some previous vulnerabilities in this area. The key takeaways, though, are that core-dump handlers on a number of widely used distributions are vulnerable to this attack, and that reusable integer IDs as a way to identify processes are just as much of a problem as the pidfd developers have been saying over the years.
Toward a better API
The solution to this kind of race condition is to give the core-dump handler a way to know that the process it is investigating is, indeed, the one that crashed. The 6.16 kernel contains two separate changes toward that goal. The first is this patch from Brauner adding a new format specifier ("%F") for the string written to core_pattern. This specifier will cause the core-dump handler to be launched with a pidfd identifying the crashed process installed as file descriptor number three. Since it is a pidfd, it will always refer to the intended process and cannot be fooled by process-ID reuse.
This change makes it relatively easy to adapt core-dump handlers to avoid the most recently identified vulnerabilities; it has already been backported to a recent set of stable kernels. But it does not change the basic nature of the core_pattern API, which still requires the launch of a new, fully privileged process to handle each crash. It is, instead, a workaround for one of the worst problems with that API.
The longer-term fix is this series from Brauner, which was also merged for 6.16. It adds a new syntax to core_pattern instructing the kernel to write core dumps to an existing socket; a user-space handler can bind to that socket and accept a new connection for each core dump that the kernel sends its way. The handler must be privileged to bind to the socket, but it remains an ordinary process rather than a kernel-created user-mode helper, and the process that actually reads core dumps requires no special privileges at all. So the core-dump handler can bind to the socket, then drop its privileges and sandbox itself, closing off a number of attack vectors.
Once a new connection has been made, the handler can obtain a pidfd for the crashed process using the SO_PEERPIDFD request for getsockopt(). Once again, the pidfd will refer to the actual crashed process, rather than something an attacker might want the handler to treat like the crashed process. The handler can pass the new PIDFD_INFO_COREDUMP option to the PIDFD_GET_INFO ioctl() command to learn more about the crashed process, including whether the process is, indeed, having its core dumped. There are, in other words, a couple of layers of defense against the sort of substitution attack demonstrated by Qualys.
The end result is a system for handling core dumps that is more efficient
(since there is no need to launch new helper processes each time) and which
should be far more resistant to many types of attacks. It may take some
time to roll out to deployed systems, since this change seems unlikely to
be backported to the stable kernels (though distributors may well choose to
backport it to their own kernels). But, eventually, this particular source
of CVEs should become rather less productive than it traditionally has
been.
Index entries for this article | |
---|---|
Kernel | Releases/6.16 |
Kernel | Security/Vulnerabilities |
Posted Jun 6, 2025 15:11 UTC (Fri)
by epa (subscriber, #39769)
[Link] (33 responses)
Posted Jun 6, 2025 16:51 UTC (Fri)
by AClwn (subscriber, #131323)
[Link] (10 responses)
The obvious retort is that Linux doesn't randomize PIDs and it never will, so the only things you lose by extending PIDs to 64 bits are (1) a little bit of space wherever they're stored and (2) an entire class of PID-reuse security vulnerabilities, and that this is a pretty good tradeoff. I have nothing to say to that; I just wanted to mention PID randomization.
Posted Jun 7, 2025 5:45 UTC (Sat)
by epa (subscriber, #39769)
[Link] (1 responses)
Posted Jun 22, 2025 9:10 UTC (Sun)
by l0kod (subscriber, #111864)
[Link]
For more details, see https://git.kernel.org/torvalds/c/d9d2a68ed44bbae598a81cb...
Posted Jun 8, 2025 9:20 UTC (Sun)
by jreiser (subscriber, #11027)
[Link] (6 responses)
Um, no. A Linear Feedback Shift Register (LFSR) that is based on an irreducible polynomial guarantees uniqueness over its entire period, which is near to 2**N. Just initialize it to a random point in its sequence.
Posted Jun 8, 2025 10:03 UTC (Sun)
by dezgeg (subscriber, #92243)
[Link] (5 responses)
Posted Jun 8, 2025 18:11 UTC (Sun)
by bmenrigh (subscriber, #63018)
[Link] (4 responses)
Posted Jun 9, 2025 12:00 UTC (Mon)
by bluca (subscriber, #118303)
[Link] (3 responses)
Posted Jun 12, 2025 8:48 UTC (Thu)
by donald.buczek (subscriber, #112892)
[Link] (2 responses)
Posted Jun 12, 2025 11:15 UTC (Thu)
by bluca (subscriber, #118303)
[Link] (1 responses)
Posted Jun 14, 2025 9:45 UTC (Sat)
by donald.buczek (subscriber, #112892)
[Link]
Note, that although pidfd_open(2) says opening a "/proc/[PID]" directory would be an alternative way to get a PID file descriptor, this is only half true: You can use such a file descriptor with pidfd_* calls, but it is another type of file descriptor with f_type == PROC_SUPER_MAGIC ( 0x9fa0 ) and you can't use the inode number from that kind of file descriptor as a unique process identifier.
I still wish, processes had UUIDs.
Posted Jun 9, 2025 16:37 UTC (Mon)
by dsfch (subscriber, #176007)
[Link]
Posted Jun 6, 2025 16:52 UTC (Fri)
by bluca (subscriber, #118303)
[Link]
Posted Jun 6, 2025 17:09 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (5 responses)
Posted Jun 6, 2025 20:42 UTC (Fri)
by warrax (subscriber, #103205)
[Link] (4 responses)
Posted Jun 6, 2025 20:43 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
Long IDs make that harder.
Posted Jun 7, 2025 5:42 UTC (Sat)
by epa (subscriber, #39769)
[Link] (2 responses)
Even in ordinary command line use like “see a process id in top and then kill it” there is a race condition and some danger if pids are not unique.
Posted Jun 7, 2025 5:53 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link]
That's true, but in practice infrequent, outside of deliberate attacks.
Posted Jun 7, 2025 6:51 UTC (Sat)
by iabervon (subscriber, #722)
[Link]
Posted Jun 6, 2025 17:38 UTC (Fri)
by Nahor (subscriber, #51583)
[Link] (7 responses)
What do you give to a process that still expects a 32-bit pid when the value does not fit?
With pidfs, you solve the problem for modern applications *now*, while keeping backward compatibility for older code forever (or until we choose to remove support for 32-bit PIDs).
> surely if time_t could become 64 bit, we can do the same for pid_t.
You do know how a painful that switch was (and still is, since not everything code base has been updated yet), right?
> they will never be used in the long tail of shell scripts and old code
If they won't be updated to pidfs, why do you believe they will be updated for 64-bit pids?
Posted Jun 7, 2025 2:48 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link] (3 responses)
> What do you give to a process that still expects a 32-bit pid when the value does not fit?
Under the assumption that we're migrating individual PID namespaces rather than a system-wide setting, if a process is in a PID namespace that uses 64-bit PIDs, it should have been migrated to the new API already (or else userspace should not have enabled 64-bit PIDs for this namespace). If it nevertheless asks for a 32-bit PID by calling into the old 32-bit interface, then it gets -ENOSYS or some equivalent, and probably crashes.
Posted Jun 7, 2025 19:29 UTC (Sat)
by Nahor (subscriber, #51583)
[Link] (2 responses)
That looks like a big big can of worms.
How do non-updated apps and updated ones mix? Say an updated shell trying to start an old app or vice-versa?
What/who decides what namespace to use? The kernel? The shell/launcher? The user? How does it/he/she know what namespace to use?
Namespaces work well if a whole ecosystem can be independent from everything else wrt to that namespace. They also work because only values changes, not the types, the binaries are the same (i.e. a shell in one namespace can work as well in another, they will just print different values for pids, or see different files, ...)
Posted Jun 8, 2025 3:16 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
The answers to most of your more specific questions can be summarized as "the distro can do what it sees fit, and if it chooses to do nothing, then it continues to use 32-bit PIDs for everything indefinitely."
Posted Jun 8, 2025 3:20 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link]
Posted Jun 7, 2025 5:38 UTC (Sat)
by epa (subscriber, #39769)
[Link] (2 responses)
Rewriting a C program to use pidfds is a much bigger task, and rewriting a shell script with them is essentially impossible.
Scripting languages like Perl, Python, and Tcl would usually just need the interpreter itself recompiled for 64-bit pids and existing scripts will work unchanged.
Posted Jun 7, 2025 18:58 UTC (Sat)
by Nahor (subscriber, #51583)
[Link]
Most won't, but a script can still make assumption on the pid size, e.g. it only contains 10 digits.
> A C program using pid_t may just need a recompile
Keyword "may".
And in the non-simple cases, the issues are the same for 64-bits pids and pidfds (apps using pid in a "smart" way will be majorly broken, pid64/pidfd cannot be passed as-is when communicating a 32-bit pid apps, ...).
> I imagine a compiler warning can mostly catch this
Only in simple cases, maybe. And AFAIK, currently, compilers will not complain when storing a in64_t in an int32_t without the "-Wconversion" flag (which is not enabled even when using "-Wall -Wextra -pedantic"). And even "-Wconversion" will not complain if there is a cast involved. https://godbolt.org/z/z3jGbYb86
> there will be some code putting a process id into an int
Or putting it in the low bits of an int64_t then use the high bits for something else.
Basically, one can look at what happened during then transition from 32-bit to 64-bit platforms, the switch to large file (>4GB), and the Y38 problems, to see all the possible issues than can arise.
> Rewriting a C program to use pidfds is a much bigger task
I'm not so sure. Since a pidfd is actually a pid_t type, and depending on what pid/pidfd are used for, updating could boil down to calling "pidfd_xyz()" instead of "xyz()", or passing a "XYZ_PIDFD" flag.
For the rest, that can be a big task to fix in either case. For instance, if the problem is someone combining the pid with something else in an int64_t, then a pidfd will still work fine, while the pid64 will need a redesign.
Posted Jul 4, 2025 13:51 UTC (Fri)
by judas_iscariote (guest, #47386)
[Link]
You are assuming a properly carefully wrritten program..there is still code out there that assumes pids are 16 bit and store them in a ushort..There is incorrect casting, there is code not using pid_t at all.. I mean there is a lot of buggy software out there...
Posted Jun 7, 2025 19:05 UTC (Sat)
by donald.buczek (subscriber, #112892)
[Link] (6 responses)
Posted Jun 7, 2025 19:37 UTC (Sat)
by snajpa (subscriber, #73467)
[Link] (5 responses)
Posted Jun 9, 2025 12:14 UTC (Mon)
by bluca (subscriber, #118303)
[Link] (4 responses)
It very much doesn't, so-much-so that relying on that combination for uniqueness caused several CVEs in the past. The start time is not granular enough, and attackers are able to cause a PID + start time clash at their leisure. This is why PIDFDs exist, and we use them when we need to uniquely identify processes for any security-relevant reason (and also more and more non-security-relevant too)
Posted Jun 9, 2025 20:10 UTC (Mon)
by snajpa (subscriber, #73467)
[Link] (3 responses)
Besides, how are you going to use pidfds in this specific case you are replying to? Much confidence in your reply, let's see if you can back that confidence up with something.
Posted Jun 9, 2025 20:26 UTC (Mon)
by bluca (subscriber, #118303)
[Link] (2 responses)
The combination of pidfd inode id plus boot uuid can uniquely identify a process across machines/reboots/everything, so it is suitable for that use case.
Posted Jun 10, 2025 16:11 UTC (Tue)
by snajpa (subscriber, #73467)
[Link] (1 responses)
Posted Jun 10, 2025 16:13 UTC (Tue)
by bluca (subscriber, #118303)
[Link]
Posted Jun 6, 2025 16:27 UTC (Fri)
by Nahor (subscriber, #51583)
[Link] (11 responses)
Having a process running all the time (and thus using RAM) for the (hopefully) rare crash is not what I would call a "more efficient" use of resources.
Either way is probably efficient (a helper in stand-by should be using very little resources, while starting a new process is likely very cheap compared to handling a core dump) but I don't know which would be "more efficient", much less if it's significantly so (*).
(*) and if it is, it's probably context dependent too, i.e. if one is memory-bound or CPU+IO bound
Posted Jun 6, 2025 16:51 UTC (Fri)
by bluca (subscriber, #118303)
[Link] (10 responses)
Posted Jun 6, 2025 17:37 UTC (Fri)
by Nahor (subscriber, #51583)
[Link] (9 responses)
The only case I can imagine where the new method would be more efficient is if one already has a crash handler daemon, and use a core_pattern helper to pass the data from the kernel to the daemon. In this case the helper can now be skipped and the core dump can go directly from the kernel to the daemon. But I doubt this is common usage, if it even exists anywhere, since it would combine the worst of both methods.
Posted Jun 6, 2025 17:54 UTC (Fri)
by bluca (subscriber, #118303)
[Link] (6 responses)
It is, as there's one fewer process. It's already socket activated - the umh only receives the core, it doesn't do any analysis, as it's dangerous. The umh forwards the core to a different socket-activated process, that is ran at minimum privilege, which does the analysis.
> But I doubt this is common usage, if it even exists anywhere, since it would combine the worst of both methods.
It's the most common usage (whether via apport or systemd-coredump or something else), as the article writes just writing files around from the kernel is really bad, and only legacy (or manual) setups do that.
Posted Jun 6, 2025 19:25 UTC (Fri)
by Nahor (subscriber, #51583)
[Link] (5 responses)
If your crash manager already has a persistent process, like systemd does, then yes, you become more efficient. Having a persistent process for them was a sunk cost since systemd already has a persistent process for monitoring services. But the gain in efficiency for them comes from their implementation choices, from the kernel API now matching their usage better, it does not come from a more efficient API. I'm arguing the latter, you're arguing the former.
Posted Jun 6, 2025 20:49 UTC (Fri)
by bluca (subscriber, #118303)
[Link] (4 responses)
It doesn't
Posted Jun 7, 2025 0:41 UTC (Sat)
by Nahor (subscriber, #51583)
[Link] (3 responses)
Uh? Systemd has a persistent process, it's called "systemd" daemon, aka "init", aka PID 1. What do you think monitors the various systemd units, including sockets?
Posted Jun 7, 2025 11:46 UTC (Sat)
by bluca (subscriber, #118303)
[Link] (2 responses)
It's not an _extra_ one as you implied. There's no extra cost, it's already there for other purposes.
Posted Jun 9, 2025 10:04 UTC (Mon)
by paulj (subscriber, #341)
[Link] (1 responses)
Worth it, for me I'd say yes, and I doubt anyone could notice that 1 additional feature, but you can't dismiss the argument others have on the basis there are no extra resources used.
Posted Jun 9, 2025 10:48 UTC (Mon)
by bluca (subscriber, #118303)
[Link]
With this feature it's now kernel -> socket, instead of kernel -> usermode helper -> socket.
Posted Jun 7, 2025 8:41 UTC (Sat)
by james (subscriber, #1325)
[Link] (1 responses)
Posted Jun 7, 2025 15:23 UTC (Sat)
by Nahor (subscriber, #51583)
[Link]
Probably not, but the article does assert the new API is "more efficient (since there is no need to launch new helper processes each time)". I question that (at least a generalization, since it's true for the particular case of systemd, because systemd-coredump's main part is already socket-based, so now it can just drop the core_pattern helper part).
Posted Jun 7, 2025 9:51 UTC (Sat)
by quotemstr (subscriber, #45331)
[Link] (2 responses)
Posted Jun 7, 2025 16:31 UTC (Sat)
by jwadams (subscriber, #123485)
[Link] (1 responses)
Posted Jun 7, 2025 23:30 UTC (Sat)
by quotemstr (subscriber, #45331)
[Link]
Yet it's done. Every major Android app uses some kind of in-process signal-based crash reporting system and it works fine. You barely have to do anything in an async-signal-safe way: just use direct system calls like in https://github.com/linux-on-ibm-z/linux-syscall-support/b..., vfork, and execve a crash handler that ptrace it parent and run normal signal-unsafe code at its leisure to dump the crashing process --- probably to a format better than traditional coredumps, one like minidump.
Speaking of minidumps: Windows crash dumps are all-userspace and work just fine.
At the most, I'd approve of the Linux kernel having a mechanism to signal some kind of registered crash daemon over IPC when another process crashes. This way, the crashing process doesn't need a signal handler or any async-signal-safe code at all. Linux should just delete all the code that produces actual core dumps and delegate the dirty work to userspace.
> possibly racing with other signal handlers
Safely sharing signal handlers is a problem all its own. Besides: futex works fine even in async-signal-safe code.
Process ids again
then quickly replaced by another process with the same process ID
Argh. Why don't we move to 64-bit process ids, and guarantee that they are not reused except after a reboot?
There are some fields expecting a smaller value, but surely if time_t could become 64 bit, we can do the same for pid_t.
As it stands pretty much any use of process ids has this race condition. A lot of effort has gone into pidfds, but they will never be used in the long tail of shell scripts and old code.
Process ids again
Process ids again
Process ids again
- They are unique during the lifetime of the running system thanks to
the 64-bit values: at worse, 2^60 - 2*2^32 useful IDs.
- They are always greater than 2^32 and must then be stored in 64-bit
integer types.
- The initial ID (at boot time) is randomly picked between 2^32 and
2^33, which limits collisions in logs across different boots.
- IDs are sequential, which enables users to order them.
- IDs may not be consecutive but increase with a random 2^4 step, which
limits side channels.
Unique randomized wide PIDs
Unique randomized wide PIDs
Unique randomized wide PIDs
Unique randomized wide PIDs
However, I wonder how userspace can easily determine whether a pidfd inode number comes from a system that guarantees uniqueness.
Unique randomized wide PIDs
Unique randomized wide PIDs
Unique randomized wide PIDs
Process ids again
Process ids again
Process ids again
Process ids again
Process ids again
Process ids again
Process ids again
Process ids again
Process ids again
And if we provide a 64-bit API while keeping the 32-bit values for a while for backward compatibility, how long should we wait before switching? ...And in the interim, the problem persists, even for applications that did update.
Remember that this is not just about updating the kernel API, but also updating code wherever pids are used (internally in applications, or externally, i.e. storage, network, ...)
Process ids again
Process ids again
Or do you expect the user to use different launcher/shell and choose which to interact with depending on what type of apps they use? And have different binaries with different pid size for apps used in both (shell, launcher, UI, ssh, ...)?
How do apps communicates pids with each other if they are not in the same namespace? Say someone uses an updated "top" command and thus gets 64-bit pids, then try to use the shell's builtin "kill" command which is still expecting 32-bit pid?
Process ids again
Process ids again
Process ids again
Process ids again
And even for the simple cases, this depends a lot on how 64-bits pids would be implemented, e.g. would this be a compilation flag? Or a "#define USE_PID64"? Or would this be changing all the "pid_t"/"getpid()"/... to "pid64_t"/"getpid64()"/...?
Or assume that a struct containing a pid has a specific size. Or that fields in that struct after the pid are at specific offsets.
Or ... (don't underestimate what people do when they assume something will be true forever)...
Or people might make the same assumption that you did, that pid64 is just a recompilation, then spend a lot of time tracking down bugs. While with pidfs, they would spend time looking at each call sites first, fixing the problems before they arise and need tracking.
Which one will take more time will depend on the applications. Sometimes it's faster to think things through first, some others it faster to just try&fix. This very much apply here IMHO.
Process ids again
> Why don't we move to 64-bit process ids, and guarantee that they are not reused except after a rebootProcess ids again
Process ids again
Process ids again
Process ids again
Process ids again
Process ids again
Process ids again
More efficient?
More efficient?
More efficient?
More efficient?
Now we can remove the middleman.
More efficient?
More efficient?
More efficient?
>
> It doesn't
More efficient?
More efficient?
More efficient?
More efficient?
but it then needs to start a new process to handle a crash, like core_pattern did, so it's no more efficient than before.
Is this really the sort of efficiency we should be optimising for? I mean, by definition we've just had a program crash: this should not be a common occurrence!
More efficient?
Get the kernel out of this business
Get the kernel out of this business
Get the kernel out of this business