Namespaced file capabilities

By Jonathan Corbet
June 30, 2017

The kernel's file capabilities mechanism is a bit of an awkward fit with user namespaces, in that all namespaces have the same view of the capabilities associated with a given executable file. There is a patch set under consideration that adds awareness of user namespaces to file capabilities, but it has brought forth some disagreement on how such a mechanism should work. The question is, in brief: how should a set of file capabilities be picked for any given user namespace?

The Linux capabilities mechanism is meant to allow privileges to be granted to processes in a manner that is more fine-grained than the classic Unix "root can do anything" approach. So, for example, an otherwise unprivileged program that needs to be able to send a signal to an unrelated process could be given CAP_KILL rather than full root privileges. Capabilities have not revolutionized privilege management as had once been hoped, but they can still have their uses.

In a typical Unix system, privileged operations are made available to ordinary users by way of setuid programs. In a system with capabilities, it is natural to want to associate capabilities with an executable program instead, once again in the hope of limiting the amount of privilege that must be granted. File capabilities, added for the 2.6.24 kernel, provide that feature.

User namespaces allow a set of processes to run as root within the namespace, while mapping the root ID (and possibly others) to normal IDs for actions (such as filesystem access) involving the rest of the system. A process running as root within a user namespace can create a setuid-root binary that will only work as intended within that namespace; it will not be usable to escalate privileges outside of the namespace. The same is not true of file capabilities, though; all user namespaces have the same view of the capabilities associated with an executable file and, since processes in a user namespace lack privilege in the root namespace, they cannot change those capabilities.

File capabilities are implemented using extended attributes; in particular, they are stored in the security.capability attribute. The kernel handles the security.* extended-attribute namespace specially; only a privileged program (one possessing the CAP_SYS_ADMIN capability in particular) can change those attributes. So it is not possible for an unprivileged container running within a user namespace to add capabilities to a file; there is, in any case, no way to store extended attributes such that they are only visible within a given user namespace.

The proposed patch set, posted by Stefan Berger, aims to change that by extending the extended-attribute syntax. This is done by decorating attributes with syntax describing the user ID (in the root namespace) associated with UID zero within a user namespace. So, for example, if a user with UID 1000 starts a user namespace, processes running as root within that namespace will access the filesystem with the original ID of 1000. If that user adds capabilities to a file within the user namespace, those capabilities will actually be stored in an extended attribute named:

    security.capability@uid=1000

Outside the namespace, this new attribute will have no effect. Within any namespace mapped to UID 1000, though, that attribute will appear as simply security.capability, so the program contained within that file will run with those capabilities in its masks.

This mechanism does not apply to extended attributes in general; it is, instead, restricted to a specific set of attributes that the kernel cares about. In the patch set, security.capability is obviously one of those attributes; the other is security.selinux, allowing for namespace-specific SELinux labels on files. The SELinux attribute was later removed, though, after SELinux maintainer Stephen Smalley pointed out that it would not work as intended.

Casey Schaufler objected to this mechanism, noting that if two user namespaces are both running mapped to UID 1000 and sharing a directory tree, file capabilities set in one of those namespaces will be visible in the other. He argued that the user ID is the wrong key to use for file capabilities; instead, he said, there should be some sort of persistent ID associated with the user namespace itself. Serge Hallyn (who had posted a namespaced file-capabilities patch of his own that had served as inspiration for Berger's work) disagreed, though, saying that the feature was working as designed.

James Bottomley, instead, objected that this mechanism will work poorly on systems where user IDs for containers are allocated dynamically. He asked for a simple @uid suffix, which would be picked up in any user namespace. Hallyn indicated openness to adding that suffix as an additional feature.

It would seem that most of the concerns about the feature itself have been headed off, so this patch set may be well on its way toward acceptance. That does, of course, leave out the biggest point of contention of all, one that was inevitable in retrospect: the proper formatting of the namespace-specific extended-attribute names. So the final form of the attribute may be something like security.ns@uid=1000@@.capability when the dust settles. Otherwise, though, namespaced file capabilities may be a kernel feature in the relatively near future.

Index entries for this article
Kernel	Capabilities
Kernel	Namespaces/User namespaces
Security	Linux kernel/Linux/POSIX capabilities
Security	Namespaces

Namespaced file capabilities

Posted Jul 1, 2017 11:55 UTC (Sat) by darwish (guest, #102479) [Link]

Casey's comment on the danger & unobviousness of linking the whole affair with UIDs is spot on.

Unfortunately a lot of kernel features get implemented from the view point of kernel developers rather than the view of user space developers and administrators in question.

Namespaced file capabilities

Posted Jul 3, 2017 20:18 UTC (Mon) by drag (guest, #31333) [Link] (8 responses)

One of the ways that containers are distributed between users is to create file system images.

How does this deal with situations were you have a file system created by one user on a system gets copied and re-used by another user on the same or different system?

You can't know the UID of the user using the container ahead of time.

Namespaced file capabilities

Posted Jul 5, 2017 0:23 UTC (Wed) by hallyn (subscriber, #22558) [Link] (7 responses)

You do not manually have to fill in the UID. It's supposed to be transparent to you. From the point of view of a process in the container, the file capabilities are completely uid-agnostic.

When you then 'setcap cap_net_raw+pe /bin/ping', the kernel will automatically rewrite the xattr for you as one tagged with the kuid of root in your container. When you just 'getcap /bin/ping', it will show it as a regular security.capability.

So if you create a tarfile (respecting the xattrs, which is a trick in itself :) containing that file inside one container, then untar it in the other, the capability on the new file will have the correct new root kuid.

Namespaced file capabilities

Posted Jul 5, 2017 17:20 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (6 responses)

> So if you create a tarfile (respecting the xattrs, which is a trick in itself :) containing that file inside one container, then untar it in the other, the capability on the new file will have the correct new root kuid.

I believe the question was about moving filesystem _images_ from one context to another, not tarfiles. If I understand correctly, the filesystem image will retain the original root kuid in the xattrs, which is not too different from normal UID/GID handling when filesystem images are moved between systems. However, unlike the root UID and GID which can be fixed with chown/chgrp, I'm not sure there is a good way to change the root kuid in the xattrs short of recreating all the capabilities.

Namespaced file capabilities

Posted Jul 5, 2017 18:39 UTC (Wed) by hallyn (subscriber, #22558) [Link] (5 responses)

How exactly is the filesystem image being moved?

You mention chown/chgrp. So long as that is happening, then any capability xattrs are being automatically removed anyway. So just as you have to re-set the setuid and setgid bits, you'll have to re-set the xattrs you care to preserve. As soon as there is agreement on the format for this, I'll write a patch for lxd's fuidshift to do this for namespaced xattrs.

If you're using something like shiftfs, then shifts will simply have to do the right shifting for the xattrs just as it does for ownership.

What other ways are you thinking of?

The two examples listed above are ways that root on the host could 'move' the filesystem image on behalf of the container. The tar example has the advantage that an unprivileged user on the host can do it entirely without becoming host-root, so long as both contexts are allocated to the user. Supporting that, and doing so without any risk of leaking privilege out of the user namespace, are an important feature here.

Namespaced file capabilities

Posted Jul 6, 2017 15:37 UTC (Thu) by nybble41 (subscriber, #55106) [Link] (4 responses)

> What other ways are you thinking of?

I suppose the ideal case would be mounting the filesystem image from inside the user namespace, so that the image contains the container's UIDs and GIDs and there is no need to remap anything. So far as I know, that isn't allowed yet because there are too many potential vulnerabilities in the filesystem code to permit mounting untrusted block devices, including loopback devices. (Somehow removable media gets a free pass here. I understand that a potentially remote vulnerability is worse than one that requires physical access, but I don't buy the argument that limited physical access to plug in a USB device is equivalent to having root privileges. Just consider all the photo kiosks with exposed USB ports... the user is not intended to have full control, and root access could expose quite a bit of private data as well as provide a vector for spreading malware.)

> You mention chown/chgrp. So long as that is happening, then any capability xattrs are being automatically removed anyway.

I was thinking of systemd-nspawn --private-users-chown rather than fuidshift, but the same principle applies. It doesn't appear that systemd-nspawn preserves files capabilities when reassigning UIDs anyway, so this won't be any different from the current situation. You would still need to run a script as root inside the container afterward to restore the capabilities.

Namespaced file capabilities

Posted Jul 6, 2017 15:58 UTC (Thu) by hallyn (subscriber, #22558) [Link] (3 responses)

(my desktop doesn't auto-mount usb sticks :)

Indeed, actual mounting of filesystems in a user namespace is some time away, and if/when it does happen it's likely to be through fuse.

Two notes regarding avoiding the need to re-attach the capability xattrs. First, it's currently the case that if you go ahead and set a global capability (no uid= tag), it will be respected in all namespaces. Secondly, as James had suggested, we could add a 'uid=' (not followed by a number) tag, which would mean "this capability will work in any user namespace other than the initial one (or rather any where root is not mapped to kuid 0)." For the case where your host init system, or docker as host root, is arranging things, this could be useful.

I bet systemd would accept a patch to have it preserve namespaced file capabilities (once they are supported), so that you wouldn't have to have a script do it inside the container.

Namespaced file capabilities

Posted Jul 6, 2017 16:47 UTC (Thu) by nybble41 (subscriber, #55106) [Link] (2 responses)

> (my desktop doesn't auto-mount usb sticks :)

Neither does mine, but any user logged in to a local console can manually mount a USB device of their choice through udisks and exploit filesystem vulnerabilities. On my personal systems that isn't so much of a problem, because I'm both the only local user and the admin (via sudo). It's more of a problem for devices such as kiosks which have other protection against the usual physical attacks—such as being located in a public place and monitored by security cameras—but need to expose USB ports, SD card slots, etc. for use by untrusted individuals. I suppose they could mount the raw block device via FUSE as an untrusted user, bypassing the kernel's filesystem code altogether, but it would be better if filesystems could just treat data from block devices as untrusted input. Besides the obvious security implications this would also help protect the system against more pedestrian data corruption.

I really have to wonder what the point is of an option like "nodev" or "nosuid" when (a) creating device nodes or SUID-root executable the normal way already requires root (or equivalent capabilities) and (b) if a user can create an accessible device node or SUID-root executable _without_ root, by directly modifying the filesystem, they are already presumed to be able to mount a corrupted filesystem, which is equivalent to having root.

Namespaced file capabilities

Posted Jul 6, 2017 17:21 UTC (Thu) by smcv (subscriber, #53363) [Link] (1 responses)

> any user logged in to a local console can manually mount a USB device of their choice through udisks

Only if the policies loaded by polkit say they can. The default policies provided with udisks assume that your system is a typical laptop, desktop or server, where physical access means the attacker has essentially already won; but on a kiosk-style system you don't have to use those defaults.

Namespaced file capabilities

Posted Jul 6, 2017 19:05 UTC (Thu) by nybble41 (subscriber, #55106) [Link]

> Only if the policies loaded by polkit say they can. ... on a kiosk-style system you don't have to use those defaults.

Right, but part of the function of the kiosk (for e.g. a self-service photo order kiosk) is loading files from user-supplied removable media. I would imagine one doesn't typically use udisks for this, but they still need to access the files somehow.