kcmp() breaks loose
The kcmp() system call was added in 2012 to address a problem encountered by the checkpoint/restore in user space (CRIU) effort. The CRIU developers are working (with some success) toward the goal of being able to record the complete state of a set of processes to persistent storage, then restart those processes at some future time, possibly on a different machine. This would be challenging in the best of times, but the CRIU developers have taken on an additional handicap: doing the entire job from user space. Over the years attempts have been made to implement a kernel-based checkpoint mechanism, but none have even come close to being considered for merging. The user-space approach appears to be the only realistic way of solving the checkpoint/restore problem.
CRIU may be banished to user space, but the kernel community has still allowed the addition of features where needed to get the job done. For example, userfaultfd() helps in the migration of process memory, and various features of the clone() system call help with recreating processes that look the same as they did at checkpoint time. These helpers have made the checkpoint/restore job doable while still keeping most of the work out of the kernel.
One problem the CRIU developers encountered early on was determining whether two open file descriptors (possibly found in different processes) refer to the same open file within the kernel. Creating such file descriptors can be done with dup() or clone(); they can be spread across unrelated processes by sending SCM_RIGHTS datagrams. It was easy enough for CRIU to determine that two file descriptors refer to the same file by looking at the relevant entries in /proc; at restore time, that file can be opened again in both places to recreate the situation at checkpoint time.
If, however, two file descriptors refer to the same open file — if, in other words, they refer to the same file structure within the kernel — then replacing them with two independent descriptors at restore time may break the application. CRIU can do the right thing to restore these descriptors correctly, but only if it can detect that they are related at checkpoint time. That detection was not something that the kernel supported at the time.
Querying the provenance of file descriptors is, at its core, asking about the kernel's internal data structures; making that information available must be done with great care. One idea that was discussed early on was to have the kernel export the address of the file structure behind each descriptor; if two descriptors show the same address, then they are entangled and should be recreated in the same mode. But the kernel goes to some length to hide the addresses of its data structures to make attacks harder; this effort is not always successful, but it is deemed worth doing. So exposing addresses in this way is not something that will fly.
Instead, the developers finally added a system call to answer the actual question: are these two descriptors the same? That was kcmp():
int kcmp(pid_t pid1, pid_t pid2, int type, unsigned long idx1, unsigned long idx2);
If type is KCMP_FILE, then the kernel will check whether file descriptor idx1 in the process whose ID is pid1 is the same as descriptor idx2 in pid2. There are a number of other resources that can be queried in the same way, but the question is always the same: are these two the same thing? This is a much safer question for the kernel to answer, but there are still restrictions; in particular, the calling process must have the privilege to use ptrace() on both of the target processes, and all processes involved must be in the same PID namespace.
Even with those restrictions, kcmp() made some people nervous. As a way of containing any possible damage, this system call was only built into kernels configured for checkpoint/restore functionality. If it was absent on most kernels, it could not be used to exploit those kernels, after all.
In the real world, though, the choices made by kernel developers about configuration options mean relatively little. Most users run kernels built by distributors, and distributors have an incentive to enable as many features as possible, even if relatively few users will need them. Most people will not complain about unneeded code in their kernels — code they probably do not even know is there — but they will definitely complain if some feature they need does not work. So, while checkpoint/restore users are relatively rare, distributors (Fedora and Ubuntu, for example) have enabled the feature for those who need it. That has made kcmp() widely available as well.
If you make a feature available, somebody will come along and use it, probably in some way you didn't anticipate. And so, it seems, the Mesa graphics library found a use for kcmp() that has nothing to do with checkpointing. At times, the library can find itself dealing with multiple file descriptors referring to the same underlying DRM devices; in this case, making changes to one will affect the other, probably in unsatisfying ways. To avoid this problem, Mesa checks (with kcmp()) to ensure that file descriptors are independent when needed.
That check will only work properly if kcmp() is actually available in the running kernel, though, and that is not the case on all distributions. Asking those distributors to enable the full checkpoint/restore functionality for kcmp() seems like overkill, so Chris Wilson instead submitted a patch to make kcmp() selectable independently. Describing the need for the patch, Daniel Vetter said:
Michel Dänzer, who implemented this functionality, defended
the use of kcmp() and expressed surprise that it wasn't
universally available. He asked what other solution he should have chosen,
but got no answer. Kees Cook noted that
kcmp() "is a really powerful syscall
", but that its
use is constrained and it's already widely available, "so it may be
okay to expose this
".
The first version of the patch enabled kcmp() by default, but that
runs counter to normal practice even in the absence of any residual
security concerns so, by the third
version, the default was changed to "no". The system call will still
be enabled, though, if either checkpoint/restore or graphics are enabled,
meaning that it will be available on most kernels going forward. It would
be fairly surprising if this patch were not merged for 5.12, and
distributors may well backport it to older kernels as well.
Index entries for this article | |
---|---|
Kernel | System calls/kcmp() |
Posted Feb 11, 2021 16:59 UTC (Thu)
by rvolgers (guest, #63218)
[Link] (11 responses)
https://man7.org/linux/man-pages/man2/kcmp.2.html
> The return value of a successful call to kcmp() is simply the
In other words, it does not just test for equality, it establishes an ordering (to help reduce the number of kcmp calls). It's easy to see how gaining information about the layout of kernel objects is useful to attackers.
"Pointer obfuscation" is mentioned so I assume the values which are compared are no longer actual pointer values. Does anyone have more information on this?
Posted Feb 11, 2021 17:07 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
Posted Feb 11, 2021 17:17 UTC (Thu)
by abatters (✭ supporter ✭, #6932)
[Link]
The obfuscation is done in two steps. First we xor the kernel pointer with a random value, which puts pointer into a new position in a reordered space. Secondly we multiply the xor production with a large odd random number to permute its bits even more (the odd multiplier guarantees that the product is unique ever after the high bits are truncated, since any odd number is relative prime to 2^n).
Posted Feb 12, 2021 13:56 UTC (Fri)
by daenzer (subscriber, #7050)
[Link] (8 responses)
In a follow-up, I suggested another possible solution: Make KCMP_FILE available regardless of CONFIG_CHECKPOINT_RESTORE, but restrict the rest of kcmp to that. But nobody seems to have picked up on it.
P.S. Finally made it into a full-blown LWN article, guess I can retire or at least switch careers now. :)
Posted Feb 12, 2021 16:28 UTC (Fri)
by kleptog (subscriber, #1183)
[Link] (7 responses)
Posted Feb 12, 2021 17:43 UTC (Fri)
by nickodell (subscriber, #125165)
[Link] (6 responses)
Posted Feb 12, 2021 18:35 UTC (Fri)
by matthias (subscriber, #94967)
[Link]
Posted Feb 12, 2021 18:40 UTC (Fri)
by mathstuf (subscriber, #69389)
[Link] (3 responses)
int fd1 = open("/dev/null", "r");
Posted Feb 12, 2021 23:35 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link]
* Whatever systemd/sysvinit/put-your-favorite-alternative-here has attached to the average daemon's stdin/stdout/stderr inside the container.
Posted Feb 14, 2021 4:43 UTC (Sun)
by dullfire (guest, #111432)
[Link] (1 responses)
Posted Feb 14, 2021 13:11 UTC (Sun)
by mathstuf (subscriber, #69389)
[Link]
Posted Feb 12, 2021 18:57 UTC (Fri)
by cjwatson (subscriber, #7322)
[Link]
Posted Feb 15, 2021 8:48 UTC (Mon)
by tdz (subscriber, #58733)
[Link]
Oh, wow. That functionality is actually available! I've long been looking for how to compare open file descriptions. I'd need this for transactional programming in userspace; similar what the checkpoint/restore code does. With kcmp, a lot of ugly hacks can go away and the results should be much more accurate.
kcmp() breaks loose
> result of arithmetic comparison of kernel pointers (when the
> kernel compares resources, it uses their memory addresses).
>
> The easiest way to explain is to consider an example. Suppose
> that v1 and v2 are the addresses of appropriate resources, then
> the return value is one of the following:
>
> 0 v1 is equal to v2; in other words, the two processes
> share the resource.
>
> 1 v1 is less than v2.
>
> 2 v1 is greater than v2.
>
> 3 v1 is not equal to v2, but ordering information is
> unavailable.
>
> On error, -1 is returned, and errno is set appropriately.
>
> kcmp() was designed to return values suitable for sorting. This
> is particularly handy if one needs to compare a large number of
> file descriptors.
kcmp() breaks loose
kcmp() breaks loose
kcmp() breaks loose
kcmp() breaks loose
kcmp() breaks loose
kcmp() breaks loose
kcmp() breaks loose
int fd2 = open("/dev/null", "r");
int fd3 = dup(fd1);
kcmp() breaks loose
* /dev/null, as you say.
* /dev/urandom
* Certain files in /etc
* For forking servers, some kinds of sockets and/or pipes, including named fifos and Unix domain sockets.
* Log files and other /var crap.
* Probably half a dozen other things.
kcmp() breaks loose
kcmp() breaks loose
kcmp() breaks loose
kcmp() breaks loose