kcmp() breaks loose

By Jonathan Corbet
February 11, 2021

Given the large set of system calls implemented by the Linux kernel, it would not be surprising for most people to be unfamiliar with a few of them. Not everybody needs to know the details of setresgid(), modify_ldt(), or lookup_dcookie(), after all. But even developers who have a wide understanding of the Linux system-call set may be surprised by kcmp(), which is not enabled by default in the kernel build. It would seem, though, that the word has gotten out, leading to an effort to make kcmp() more widely available.

The kcmp() system call was added in 2012 to address a problem encountered by the checkpoint/restore in user space (CRIU) effort. The CRIU developers are working (with some success) toward the goal of being able to record the complete state of a set of processes to persistent storage, then restart those processes at some future time, possibly on a different machine. This would be challenging in the best of times, but the CRIU developers have taken on an additional handicap: doing the entire job from user space. Over the years attempts have been made to implement a kernel-based checkpoint mechanism, but none have even come close to being considered for merging. The user-space approach appears to be the only realistic way of solving the checkpoint/restore problem.

CRIU may be banished to user space, but the kernel community has still allowed the addition of features where needed to get the job done. For example, userfaultfd() helps in the migration of process memory, and various features of the clone() system call help with recreating processes that look the same as they did at checkpoint time. These helpers have made the checkpoint/restore job doable while still keeping most of the work out of the kernel.

One problem the CRIU developers encountered early on was determining whether two open file descriptors (possibly found in different processes) refer to the same open file within the kernel. Creating such file descriptors can be done with dup() or clone(); they can be spread across unrelated processes by sending SCM_RIGHTS datagrams. It was easy enough for CRIU to determine that two file descriptors refer to the same file by looking at the relevant entries in /proc; at restore time, that file can be opened again in both places to recreate the situation at checkpoint time.

If, however, two file descriptors refer to the same open file — if, in other words, they refer to the same file structure within the kernel — then replacing them with two independent descriptors at restore time may break the application. CRIU can do the right thing to restore these descriptors correctly, but only if it can detect that they are related at checkpoint time. That detection was not something that the kernel supported at the time.

Querying the provenance of file descriptors is, at its core, asking about the kernel's internal data structures; making that information available must be done with great care. One idea that was discussed early on was to have the kernel export the address of the file structure behind each descriptor; if two descriptors show the same address, then they are entangled and should be recreated in the same mode. But the kernel goes to some length to hide the addresses of its data structures to make attacks harder; this effort is not always successful, but it is deemed worth doing. So exposing addresses in this way is not something that will fly.

Instead, the developers finally added a system call to answer the actual question: are these two descriptors the same? That was kcmp():

    int kcmp(pid_t pid1, pid_t pid2, int type, unsigned long idx1,
    	     unsigned long idx2);

If type is KCMP_FILE, then the kernel will check whether file descriptor idx1 in the process whose ID is pid1 is the same as descriptor idx2 in pid2. There are a number of other resources that can be queried in the same way, but the question is always the same: are these two the same thing? This is a much safer question for the kernel to answer, but there are still restrictions; in particular, the calling process must have the privilege to use ptrace() on both of the target processes, and all processes involved must be in the same PID namespace.

Even with those restrictions, kcmp() made some people nervous. As a way of containing any possible damage, this system call was only built into kernels configured for checkpoint/restore functionality. If it was absent on most kernels, it could not be used to exploit those kernels, after all.

In the real world, though, the choices made by kernel developers about configuration options mean relatively little. Most users run kernels built by distributors, and distributors have an incentive to enable as many features as possible, even if relatively few users will need them. Most people will not complain about unneeded code in their kernels — code they probably do not even know is there — but they will definitely complain if some feature they need does not work. So, while checkpoint/restore users are relatively rare, distributors (Fedora and Ubuntu, for example) have enabled the feature for those who need it. That has made kcmp() widely available as well.

If you make a feature available, somebody will come along and use it, probably in some way you didn't anticipate. And so, it seems, the Mesa graphics library found a use for kcmp() that has nothing to do with checkpointing. At times, the library can find itself dealing with multiple file descriptors referring to the same underlying DRM devices; in this case, making changes to one will affect the other, probably in unsatisfying ways. To avoid this problem, Mesa checks (with kcmp()) to ensure that file descriptors are independent when needed.

That check will only work properly if kcmp() is actually available in the running kernel, though, and that is not the case on all distributions. Asking those distributors to enable the full checkpoint/restore functionality for kcmp() seems like overkill, so Chris Wilson instead submitted a patch to make kcmp() selectable independently. Describing the need for the patch, Daniel Vetter said:

It was maybe stupid, but our userspace started relying on fd comparison through sys_kcomp. So for better or worse, if you want to run the mesa3d gl/vk stacks, you need this. Was maybe not the brightest idea, but since enough distros had this enabled by defaults, it wasn't really discovered, and now we're shipping this everywhere.

Michel Dänzer, who implemented this functionality, defended the use of kcmp() and expressed surprise that it wasn't universally available. He asked what other solution he should have chosen, but got no answer. Kees Cook noted that kcmp() "is a really powerful syscall", but that its use is constrained and it's already widely available, "so it may be okay to expose this".

The first version of the patch enabled kcmp() by default, but that runs counter to normal practice even in the absence of any residual security concerns so, by the third version, the default was changed to "no". The system call will still be enabled, though, if either checkpoint/restore or graphics are enabled, meaning that it will be available on most kernels going forward. It would be fairly surprising if this patch were not merged for 5.12, and distributors may well backport it to older kernels as well.

Index entries for this article
Kernel	System calls/kcmp()

kcmp() breaks loose

Posted Feb 11, 2021 16:59 UTC (Thu) by rvolgers (guest, #63218) [Link] (11 responses)

I was wondering why people sounded so scared of this syscall, so I looked it up:

https://man7.org/linux/man-pages/man2/kcmp.2.html

> The return value of a successful call to kcmp() is simply the
> result of arithmetic comparison of kernel pointers (when the
> kernel compares resources, it uses their memory addresses).
>
> The easiest way to explain is to consider an example. Suppose
> that v1 and v2 are the addresses of appropriate resources, then
> the return value is one of the following:
>
> 0 v1 is equal to v2; in other words, the two processes
> share the resource.
>
> 1 v1 is less than v2.
>
> 2 v1 is greater than v2.
>
> 3 v1 is not equal to v2, but ordering information is
> unavailable.
>
> On error, -1 is returned, and errno is set appropriately.
>
> kcmp() was designed to return values suitable for sorting. This
> is particularly handy if one needs to compare a large number of
> file descriptors.

In other words, it does not just test for equality, it establishes an ordering (to help reduce the number of kcmp calls). It's easy to see how gaining information about the layout of kernel objects is useful to attackers.

"Pointer obfuscation" is mentioned so I assume the values which are compared are no longer actual pointer values. Does anyone have more information on this?

kcmp() breaks loose

Posted Feb 11, 2021 17:07 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (1 responses)

I suppose the kernel could have a random mask (generated at boot) that it XORs with each pointer before comparison. That should preserve *an* order, but not the actual in-memory layout order, no?

kcmp() breaks loose

Posted Feb 11, 2021 17:17 UTC (Thu) by abatters (✭ supporter ✭, #6932) [Link]

kernel/kcmp.c says how it is done:

The obfuscation is done in two steps. First we xor the kernel pointer with a random value, which puts pointer into a new position in a reordered space. Secondly we multiply the xor production with a large odd random number to permute its bits even more (the odd multiplier guarantees that the product is unique ever after the high bits are truncated, since any odd number is relative prime to 2^n).

kcmp() breaks loose

Posted Feb 12, 2021 13:56 UTC (Fri) by daenzer (subscriber, #7050) [Link] (8 responses)

FWIW, Mesa only needs KCMP_FILE, and doesn't care about the difference between positive return values. All it needs to know is whether or not two file descriptors reference the same struct file in the kernel. That's what "this functionality" refers to in my post.

In a follow-up, I suggested another possible solution: Make KCMP_FILE available regardless of CONFIG_CHECKPOINT_RESTORE, but restrict the rest of kcmp to that. But nobody seems to have picked up on it.

P.S. Finally made it into a full-blown LWN article, guess I can retire or at least switch careers now. :)

kcmp() breaks loose

Posted Feb 12, 2021 16:28 UTC (Fri) by kleptog (subscriber, #1183) [Link] (7 responses)

The extra return values are needed for scale if you have lots of FDs. With the extra return values the algorithm for comparing everything to everything else goes from O(n^2) to O(n log n). Given the obfuscation described above I don't see a problem returning the extra info. I can see value in tools like lsof being able to tell if files are the same, and they need to work with lots of FDs.

kcmp() breaks loose

Posted Feb 12, 2021 17:43 UTC (Fri) by nickodell (subscriber, #125165) [Link] (6 responses)

But you only need to perform a kcmp check if the two file descriptors refer to the same file. If they refer to different files, they cannot possibly be the same FD. Is there a situation where you have lots of FDs pointing to the *same* file?

kcmp() breaks loose

Posted Feb 12, 2021 18:35 UTC (Fri) by matthias (subscriber, #94967) [Link]

Checking which FDs belong to the same file will boil down to sort them (according to some criteria) and then compare them. Without some kind of sorting you will not end up with O(n*log(n)), but with pairwise testing, i.e. O(n^2). If kcmp() can provide an order (almost) for free, this is probably much cheaper than first constructing such an order in userspace and then using kcmp() only on those files that are the same.

kcmp() breaks loose

Posted Feb 12, 2021 18:40 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (3 responses)

Isn't it trivial to open a file multiple times?

int fd1 = open("/dev/null", "r");
int fd2 = open("/dev/null", "r");
int fd3 = dup(fd1);

kcmp() breaks loose

Posted Feb 12, 2021 23:35 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

More prosaically, in the case where you need to hibernate an entire container, lots of processes are likely to have certain files open at any given time:

* Whatever systemd/sysvinit/put-your-favorite-alternative-here has attached to the average daemon's stdin/stdout/stderr inside the container.
* /dev/null, as you say.
* /dev/urandom
* Certain files in /etc
* For forking servers, some kinds of sockets and/or pipes, including named fifos and Unix domain sockets.
* Log files and other /var crap.
* Probably half a dozen other things.

kcmp() breaks loose

Posted Feb 14, 2021 4:43 UTC (Sun) by dullfire (guest, #111432) [Link] (1 responses)

Incidentally: "/dev/null" is probably one of the very few cases where userspace won't actually care if it gets restored as the same file entry as another process or not.

kcmp() breaks loose

Posted Feb 14, 2021 13:11 UTC (Sun) by mathstuf (subscriber, #69389) [Link]

While true, one still needs to know that `/dev/null` is the backing file of a given fd to know whether to "ignore" it or not when restoring.

kcmp() breaks loose

Posted Feb 12, 2021 18:57 UTC (Fri) by cjwatson (subscriber, #7322) [Link]

I'd have thought that it would be quite common for many processes to share FDs due to fork().

kcmp() breaks loose

Posted Feb 15, 2021 8:48 UTC (Mon) by tdz (subscriber, #58733) [Link]

> determining whether two open file descriptors (possibly found in different processes) refer to the same open file within the kernel

Oh, wow. That functionality is actually available! I've long been looking for how to compare open file descriptions. I'd need this for transactional programming in userspace; similar what the checkpoint/restore code does. With kcmp, a lot of ugly hacks can go away and the results should be much more accurate.