|This article brought to you by LWN subscribers|
Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.
A February linux-kernel mailing list discussion of a patch that extends the use of the CAP_COMPROMISE_KERNEL capability soon evolved into a discussion of the specific uses (or abuses) of the CAP_SYS_RAWIO capability within the kernel. However, in reality, the discussion once again exposes some general difficulties in the Linux capabilities implementation—difficulties that seem to have no easy solution.
The discussion began when Kees Cook submitted a patch to guard writes to model-specific registers (MSRs) with a check to see if the caller has the CAP_COMPROMISE_KERNEL capability. MSRs are x86-specific control registers that are used for tasks such as debugging, tracing, and performance monitoring; those registers are accessible via the /dev/cpu/CPUNUM/msr interface. CAP_COMPROMISE_KERNEL (formerly known as CAP_SECURE_FIRMWARE) is a new capability designed for use in conjunction with UEFI secure boot, which is a mechanism to ensure that the kernel is booted from an on-disk representation that has not been modified.
If a process has the CAP_COMPROMISE_KERNEL capability, it can perform operations that are not allowed in a secure-boot environment; without that capability, such operations are denied. The idea is that if the kernel detects that it has been booted via the UEFI secure-boot mechanism, then this capability is disabled for all processes. In turn, the lack of that capability is intended to prevent operations that can modify the running kernel. CAP_COMPROMISE_KERNEL is not yet part of the mainline kernel, but already exists as a patch in the Fedora distribution and Matthew Garrett is working towards its inclusion in the mainline kernel.
H. Peter Anvin wondered whether CAP_SYS_RAWIO did not already suffice for Kees's purpose. In response, Kees argued that CAP_SYS_RAWIO is for governing reads: "writing needs a much stronger check". Kees went on to elaborate:
This in turn led to a short discussion about whether a capability was the right way to achieve the goal of restricting certain operations in a secure-boot environment. Kees was inclined to think it probably was the right approach, but deferred to Matthew Garrett, implementer of much of the secure-boot work on Fedora. Matthew thought that a capability approach seemed the best fit, but noted:
In the current mainline kernel, the CAP_SYS_RAWIO capability is checked in the msr_open() function: if the caller has that capability, then it can open the MSR device and perform reads and writes on it. The purpose of Kees's patch is to add a CAP_COMPROMISE_KERNEL check on each write to the device, so that in a secure-boot environment the MSR devices are readable, but not writeable. The problem that Matthew alludes to is that this approach has the potential to break user space because, formerly, there was no capability check on MSR writes. An application that worked prior to the introduction of CAP_COMPROMISE_KERNEL can now fail in the following scenario:
The last of the above steps would formerly have succeeded, but, with the addition of the CAP_COMPROMISE_KERNEL check, it now fails. In a subsequent reply, Matthew noted that QEMU was one program that was broken by a scenario similar to the above. Josh Boyer noted that Fedora has had a few reports of applications breaking on non-secure-boot systems because of scenarios like this. He highlighted why such breakages are so surprising to users and why the problem is seemingly unavoidable:
Really though, the main issue is that you cannot introduce new caps to enforce finer grained access without breaking something.
Shortly afterward, Peter stepped back to ask a question about the bigger picture: why should CAP_SYS_RAWIO be allowed on a secure-boot system? In other words, rather than adding a new CAP_COMPROMISE_KERNEL capability that is disabled in secure-boot environments, why not just disable CAP_SYS_RAWIO in such environments, since it is the possession of that capability that permits compromising a booted kernel?
That led Matthew to point out a major problem with CAP_SYS_RAWIO:
To see what Matthew is talking about, we need to look at a little history. Back in January 1999, when capabilities first appeared with the release of Linux 2.2, CAP_SYS_RAWIO was a single-purpose capability. It was used in just a single C file in the kernel source, where it governed access to two system calls: iopl() and ioperm(). Those system calls permit access to I/O ports, allowing uncontrolled access to devices (and providing various ways to modify the state of the running kernel); hence the requirement for a capability in order to employ the calls.
The problem was that CAP_SYS_RAWIO rapidly grew to cover a range of other uses. By the time of Linux 2.4.0, there were 37 uses across 24 of the kernel's C source files, and looking at the 3.9-rc2 kernel, there are 69 uses in 43 source files. By either measure, CAP_SYS_RAWIO is now the third most commonly used capability inside the kernel source (after CAP_SYS_ADMIN and CAP_NET_ADMIN).
Peter had some choice words to describe the abuse of CAP_SYS_RAWIO to protect operations on SCSI devices. The problem, of course, is that in order to perform relatively harmless SCSI operations, an application requires the same capability that can trivially be used to damage the integrity of a secure-boot system. And that, as Matthew went on to point out, is the point of CAP_COMPROMISE_KERNEL: to disable the truly dangerous operations (such as MSR writes) that CAP_SYS_RAWIO permits, while still allowing the less dangerous operations (such as the SCSI device operations).
All of this leads to a conundrum that was nicely summarized by Matthew. On the one hand, CAP_COMPROMISE_KERNEL is needed to address the problem that CAP_SYS_RAWIO has become too diffuse in its meaning. On the other hand, the addition of CAP_COMPROMISE_KERNEL checks in places where there were previously no capability checks in the kernel means that applications that drop all capabilities will break. There is no easy way out of this difficulty. As Peter noted: "We thus have a bunch of unpalatable choices, **all of which are wrong**".
Some possible resolutions of the conundrum were mentioned by Josh Boyer earlier in the thread: CAP_COMPROMISE_KERNEL could be treated as a "hidden" capability whose state could be modified only internally by the kernel. Alternatively, CAP_COMPROMISE_KERNEL might be specially treated, so that it can be dropped only by a capset() call that operates on that capability alone; in other words, if a capset() call specified dropping multiple capabilities, including CAP_COMPROMISE_KERNEL, the state of the other capabilities would be changed, but not the state of CAP_COMPROMISE_KERNEL. The problem with these approaches is that they special-case the treatment of CAP_COMPROMISE_KERNEL in a surprising way (and surprises in security-related APIs have a way of coming back to bite in the future). Furthermore, it may well be the case that analogous problems are encountered in the future with other capabilities; handling each of these as a special case would further add to the complexity of the capabilities API.
The discussion in the thread touched on a number of other difficulties with capabilities. Part of the solution to the problem of the overly broad effect of CAP_SYS_RAWIO (and CAP_SYS_ADMIN) might be to split the capability into smaller pieces—replace one capability with several new capabilities that each govern a subset of the operations governed by the old capability. Each privileged operation in the kernel would then check to see whether the caller had either the old or the new privilege. This would allow old binaries to continue to work while allowing new binaries to employ the new, tighter capability. The risk with this approach is, as Casey Schaufler noted, the possibility of an explosion in the number of capabilities, which would further complicate administering capabilities for applications. Furthermore, splitting capabilities in this manner doesn't solve the particular problem that the CAP_COMPROMISE_KERNEL patches attempt to solve for CAP_SYS_RAWIO.
With 502 uses in the 3.9-rc2 kernel, CAP_SYS_ADMIN is the most egregious example of this problem. That problem itself would appear to spring from the Linux kernel development model: the decisions about which capabilities should govern new kernel features typically are made by individual developer in a largely decentralized and uncoordinated manner. Without having a coordinated big picture, many developers have adopted the seemingly safe choice, CAP_SYS_ADMIN. A related problem is that it turns out that a number of capabilities allow escalation to full root privileges in certain circumstances. To some degree, this is probably unavoidable, and it doesn't diminish the fact that a well-designed capabilities scheme can be used to reduce the attack surface of applications.
One approach that might help solve the problem of overly broad capabilities is hierarchical capabilities. The idea, mentioned by Peter, is to split some capabilities in a fashion similar to the way that the root privilege was split into capabilities. Thus, for instance, CAP_SYS_RAWIO could become a hierarchical capability with sub-capabilities called (say) CAP_DANGEROUS and CAP_MOSTLY_HARMLESS. A process that gained or lost CAP_SYS_RAWIO would implicitly gain or lose both CAP_DANGEROUS and CAP_MOSTLY_HARMLESS, in the same way that transitions to and from an effective user ID of 0 grant and drop all capabilities. In addition, sub-capabilities could be raised and dropped independently of their "siblings" at the same hierarchical level. However, sub-capabilities are not a concept that currently exists in the kernel, and it's not clear whether the existing capabilities API could be tweaked in such a way that they could be implemented sanely. Digging deeper into that topic remains an open challenge.
The CAP_SYS_RAWIO discussion touched on a long list of difficulties in the current Linux capabilities implementation: capabilities whose range is too broad, the difficulties of splitting capabilities while maintaining binary compatibility (and, conversely, the administrative difficulties associated with defining too large a set of capabilities), the as-yet poor adoption of binaries with file capabilities vis-a-vis traditional set-user-ID binaries, and the (possible) need for an API for hierarchical capabilities. It would seem that capabilities still have a way to go before they can deliver on the promise of providing a manageable mechanism for providing discrete, non-elevatable privileges to applications.
Copyright © 2013, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds