LWN: Comments on "Namespaces in operation, part 5: User namespaces"

Namespaces in operation, part 5: User namespaces

marcozov — Sun, 14 Aug 2022 13:05:34 +0000

Thanks, I also found the related change in the code!

Namespaces in operation, part 5: User namespaces

izbyshev — Sun, 14 Aug 2022 10:20:43 +0000

Looks like the issue with setgroups() that another comment talks about: https://lwn.net/Articles/635559/.

Namespaces in operation, part 5: User namespaces

marcozov — Sun, 14 Aug 2022 09:52:11 +0000

Thanks for the article, it is really nice to read and follow.

I'm having an issue when trying to reproduce part of the mentioned steps (which should all be doable as non-root, as far as I understood).
In particular, when running `./demo_userns x`, I *can* run `echo '0 1000 1' > /proc/$DEMO_PID/uid_map` successfully (and the output of thee demo_userns program is updated with the new user id, 0), but I *cannot* run `echo '0 1000 1' > /proc/$DEMO_PID/gid_map`.
If I try to run those echo commands as root, they both work, but this sounds a bit against the purpose of this article (which is about being able to do root actions in a restricted environment -- the new user namespace).

If I proceed with the demo, I have a similar problem when I run `./userns_child_exec -U -M '0 1000 1' -G '0 1000 1' bash`:
```
write /proc/10568/gid_map: Operation not permitted
bash: initialize_job_control: no job control in background: Bad file descriptor
```
Removing the -G part makes the error go away here as well.

Any clue on how I can debug this? As far as I understood, if a process is the parent user namespace it should automatically have the necessary capabilities (cap_set_uid, cap_set_gid) to write to the uid_map / gid_map files of the process in the new user namespace.
Is there anything that I can check to validate this? If I run `cat /proc/$$/status | grep Cap`, I get:
```
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
```
which I'm not sure how to interpret.
In particular, I'm referring to the three rules defined under `Rules for writing to mapping files`: the first one should always hold (based on my understanding, CAP_SETUID/CAP_SETGID are always valid for processes in the parent user namespace), the second one seems to hold as well (the spawned terminal is indeed in the user namespace of the shell that run `./demo_userns x`), the third one seems a bit more ambiguous: the first statement holds (that's basically defined via `0 1000 1`), the second one not really (according the to the `cat /proc/$$/status | grep Cap` output) --> but to me it looks like only one of the two has to hold. Furthermore, if this is the problem, I would expect that writing to the uid_map would lead to the same error.

Namespaces in operation, part 5: User namespaces

fusillator — Sun, 25 Nov 2018 17:39:23 +0000

> Moreover I don't get why the writing of the mapping isn't accomplished in the childFunc which has full set of capabilities in the context of the new namespace before executing the shell (following the first point of the third rules in the section Rules for writing to mapping files this should be feasible), this would avoid the need of a sync mechanism.

Since the command userns_child_exec takes the user mappings as argument, mappings to arbitrary user IDs (group IDs) in the parent user namespace must be allowed (see the second point of third rule in the section Rules for writing to mapping files).
Conversely, if the mapping is made from the cloned child in the new namespace, it's only possible to map the user id of the parent process in the parent namespace to any uid in the new namespace, root included.
When a process clones itself the user and group ids in the parent namespace are inherited by the child.

> In order to write on the mapping file from an unprivileged user (in the context of the parent namespace) the capabilities CAP_SETUID, CAP_SETGID needs to be granted to the calling process (see the first rule in the section Rules for writing to mapping files). The author doesn't show how to grant these privileges, a way is enabling the effective and permitted flags on the userns_child_exec binary.

Other rules control the capabilities propagation between namespaces with a parental relationship, from the successive article https://lwn.net/Articles/540087/
"When a user namespace is created, the kernel records the effective user ID of the creating process as being the "owner" of the namespace. A process whose effective user ID matches that of the owner of a user namespace and which is a member of the parent namespace has all capabilities in the namespace."
So the required capabilities cap_set{uid,gid} are granted to the unprivileged parent process on the new namespace by default.

Namespaces in operation, part 5: User namespaces

fusillator — Sun, 25 Nov 2018 03:36:41 +0000

This is impressive but a bit hard to follow for a noob like me.

I think these are the main points of the articles:
When a user namespace is created by clone, the first cloned process in the new namespace is granted a full set of capabilities in the new namespace.
Invoking exec* functions changes the calling process capabilities following the rules for the transformation of capabilities during exec*:
1) pI' = pI
2) pP' = (X & fP) | (pI & fI)
3) pE' = fE & pP'
Hence the code ns_child_exec gets the permission error because the child shell in the new userspace is executed as an unprivileged user and the permitted and effective/ inheritable flags weren't set on appropriately.

The code userns_child_exec.c takes care of writing the root map for the new userspace from the parent.
Firstly it creates the new userspace by mean of clone, then it ensures the map files is written by the father process before the cloned child launches the shell using a pipe for interprocess communication: the pipe is duplicated when the process clones, so each process (father and child) will have their own copies of two file descriptors (pipe_fd[0] for reading, pipe_fd[1] for writing) pointing to the same pipe. The father exploits the write endpoint of the pipe closing the write channel when it completed the user mapping. The cloned child closes his write endpoint and exploits the read endpoint (it reads a character on the pipe expecting to be NULL - eof) to sync with the father ensuring the map was written by the calling process in the context of the father namespace before the shell execution.

These are my considerations:
In order to write on the mapping file from an unprivileged user (in the context of the parent namespace) the capabilities CAP_SETUID, CAP_SETGID needs to be granted to the calling process (see the first rule in the section Rules for writing to mapping files). The author doesn't show how to grant these privileges, a way is enabling the effective and permitted flags on the userns_child_exec binary.

Moreover I don't get why the writing of the mapping isn't accomplished in the childFunc which has full set of capabilities in the context of the new namespace before executing the shell (following the first point of the third rules in the section Rules for writing to mapping files this should be feasible), this would avoid the need of a sync mechanism.

The author states that to avoid losing the capabilities the parent needs to change the user mapping before executing the shell (since the shell capabilities flags aren't set appropriately for the exec transformation).
Anyway I think that if this sync check wasn't accomplished the shell might execute with an unpriviledged uid /proc/sys/kernel/overflowuid until the parent will be able to write the mapping. The sync is necessary to avoid race condition during the shell execution and to ensure the shell is running as a privileged process in the new userspace from the very beginning.

Namespaces in operation, part 5: User namespaces

mkerrisk — Sun, 29 Oct 2017 08:49:21 +0000

Slides from my October 2017 presentation on User Namespaces at Open Source Summit Europe can be found here.

Complexity?

bandrami — Tue, 14 Apr 2015 04:51:00 +0000

> NT-derivative systems have used ACLs for a long time, and they're much more capable.

And this is not *necessarily* a good thing. Taking Unix permissions and then adding capabilities and ACLs triples (more than triples, really) the logic required to statically verify a configuration. I guess I sort of appreciate the idea that a library that isn't there can't be misconfigured -- I don't run ACLs or CAP_*s or namespaces on my production Linux servers for that reason even though that takes rebuilding the kernel. It's the same argument I have with mandatory access control systems: knobs I can twist are knobs that I can twist the wrong way. I want my security system to be so brain-dead that I can verify it at 3am in a loud server room with a client calling me every 30 seconds.

Namespaces in operation, part 5: User namespaces

mkerrisk — Thu, 05 Mar 2015 08:32:26 +0000

Note that because of the Linux 3.19 changes that fixed a user namespace security loophole related to the setgroups() system call, the userns_child_exec.c program needs modifications in order to be able to use GID maps on Linux 3.19 and later (and also on earlier stable kernel series that backported the changes). A revised (and backward compatible) version of this program with the necessary changes can be found in the revised user_namespaces(7) man page that will appear in a few days time. (Look for the definition and use of the proc_setgroup_write() function in the example program.)

Example fails on today's Ubuntu 13.04 daily

BernardB — Thu, 07 Mar 2013 14:05:45 +0000

Okay, having dug deeper, it turns out that the examples require CONFIG_USER_NS. As the article points out, 3.8 was still missing the changes for XFS and other filesystems. Unsurprisingly, Ubuntu 13.04 chose XFS and NFS support over CONFIG_USER_NS. Bummer :P

"Soon after 13.04 they will be fully supported." -- http://permalink.gmane.org/gmane.linux.kernel.containers....

Example fails on today's Ubuntu 13.04 daily

BernardB — Thu, 07 Mar 2013 13:42:21 +0000

No luck trying this out on today's Ubuntu 13.04 daily build:

$ id
uid=1000(bernard) gid=1000(bernard)
$ uname -a
Linux dev32 3.8.0-11-generic #20-Ubuntu SMP Tue Mar 5 20:33:22 UTC 2013 i686 athlon i686 GNU/Linux
$ gcc -o demo_userns demo_userns.c  -lcap
$ ./demo_userns
clone: Invalid argument
$ sudo ./demo_userns # It was worth a shot!
clone: Invalid argument
$ strace -e clone ./demo_userns 
clone(child_stack=0x814a064, flags=0x10000000|SIGCHLD) = -1 EINVAL (Invalid argument)
clone: Invalid argument
$ apt-cache policy linux-image-`uname -r`
linux-image-3.8.0-11-generic:
  Installed: 3.8.0-11.20
  Candidate: 3.8.0-11.20
  Version table:
 *** 3.8.0-11.20 0
        500 http://gb.archive.ubuntu.com/ubuntu/ raring/main i386 Packages
        100 /var/lib/dpkg/status

I've yet to delve into the kernel source to find where EINVAL is coming from, but can anyone see if I am missing something obvious? Or maybe it's because Ubuntu's done something magic to their kernel? (The Makefile in their Linux sources purports to be 3.8.2).

Namespaces in operation, part 5: User namespaces

kevinm — Thu, 07 Mar 2013 02:40:09 +0000

So, a UID in the parent namespace that isn't mapped in the child namespace is mapped to a default UID; but what about a UID in the child namespace that isn't mapped - what UID will that have in the parent namespace (for example, a process in the child namespace with UID=0 uses seteuid(9999) where child namespace UID 9999 isn't included in any mapping.

Complexity?

etienne — Tue, 05 Mar 2013 10:29:44 +0000

> permissions are granted, normally, to users, not programs

Maybe that is not complex enough, and permissions should be granted to what the program is doing:
- if the program is updating itself (when no package manager) it should have rights to overwrite its own binaries
- if the program is configuring itself (when user changes something) it should have rights to change its configuration files
- if the program is being only "used", it shall do none of the above.

Ever seen a security system blocking half of the upgrade of a package?
I did not say I would like to manage such a system...

Complexity?

malor — Tue, 05 Mar 2013 10:07:46 +0000

The old Unix permissions system actually isn't very complex, which is its central problem. The permissions are very coarse, and it's very hard to describe complex security arrangements using those very dull tools. It's primarily based on user/group/other, read/write/execute, and the various permutations of those three permissions, granted to those three broad categories. And then you've got system-wide capabilities, which either grant or deny access to users to do things that can be dangerous to the system as a whole. As they presently stand, Unix permissions are very coarsely defined, and can be very far-reaching. Granting a given permission to a program can have nasty security implications that are difficult to understand.

On the Windows side, NT-derivative systems have used ACLs for a long time, and they're much more capable. The permissions themselves are fairly fine-grained, and then you can specify to a gnat's eyebrow exactly who should and should not get them. As long as you realize that the permissions system is looking for any possible excuse to deny a permission, and only if it A) can't find any reason to reject someone, and B) finds an explicit authorization, will it finally grant a permission. Just think of the NT permissions system as a big asshole, and the whole system ends up being easily understandable, and very powerful.

But, ACLs have a very fundamental problem: permissions are granted, normally, to users, not programs, so they do almost nothing to protect programs from each other. If they're being run by the same user (say, "malor"), then they can mess each other up. If I'm running Internet Explorer, then it has any permission that I do, and if it's hijacked, it can erase or corrupt anything that I could erase or corrupt.

Namespaces are kind of an ugly hack that seem to have three basic goals:

Preserve compability with the old Unix blunt instruments;
Allow finer-grained permission controls;
Assign permissions based on programs, rather than users

Once this stuff has been really integrated into the system software, running Firefox as "malor" should grant a very limited exposure to my other files, should it be hijacked. The browser process might be restricted to creating new files in a download directory only, with no other write access anywhere in the filesystem. A separate, user-facing program might have the authorization to rewrite user configuration files, like bookmarks or the settings in about:config. By separating them in this way, it will be enormously harder for a remote exploit, even in a full-featured language like Java, to escape the virtual sandbox it's in. It probably won't be impossible, but it should be much more difficult, perhaps requiring a specific exploit be written to attack your particular combination of OS and Firefox, making it non-feasible for mass exploit attempts.

This is kind of the same thing that Microsoft and Apple are trying to do with their DRM-based software stores, and highly restricted environments, but in this case, YOU hold the keys, not Microsoft or Apple.

The overall solution ends up being kind of ugly, because of the simultaneous need to maintain compatibility with a 40-year-old permissions system, and also to implement a bunch of new permission types that have never existed in Unix before, but I'll tell you this: I'll take an ugly system I can control myself over an imposed system by a corporation any day. If I want the best security, where programs are isolated from one another, but I also want to own my own hardware, Linux namespaces seem to be the way forward.

Hurd?

cesarb — Sat, 02 Mar 2013 19:15:51 +0000

> As we noted in an earlier article, one of the motivations for implementing user namespaces is to give non-root applications access to functionality that was formerly limited to the root user.

Wasn't that one of the motivations for the microkernel design of GNU Hurd?

Namespaces in operation, part 5: User namespaces

darwish07 — Fri, 01 Mar 2013 23:51:19 +0000

Thanks for providing such an interesting, and quite informative, article!

Complexity?

hummassa — Thu, 28 Feb 2013 17:26:24 +0000

IMHO, (2).

Windows-like ACLs (again IMHO) are simpler to apply but cause more esoteric and difficult-to-debug problems.

Namespaces in operation, part 5: User namespaces

mabshoff — Thu, 28 Feb 2013 14:33:36 +0000

> From a quick google, I found this: [SNIP]

Yeah, that was the first hit I got, too, but I discarded it for the reason listed below.

> So I still stand by my previous comment. Around a megabyte :)

Well, that specific patch is for a RHEL 5 based kernel, i.e. on top of their version of 2.6.18. The RHEL 6 based 2.6.32 kernel patch weights in at currently 1.3 MB (see [1]). And that patch dates from March 4th 2011, so I would hardly call it current :p.

Anyway, with ploop and some of their other bits being out of mainline for now their patch is a little like the RT patch set: growing some time and shrinking some other time, but as patches move into mainline from it new patches for new functionality get added on top. At least after many years of living mostly out of mainline their efforts like CRIU have shown that you can merge it into mainline assuming all interested parties collaborate, and that is a really positive development imho.

Cheers,

Michael

[1] http://download.openvz.org/kernel/branches/2.6.32/2.6.32-...

Complexity?

renox — Thu, 28 Feb 2013 10:31:44 +0000

Unix user/group management has always looked very complex to me, I wonder if this is because
1) I've not invested enough effort understand Unix management
2) the problem is itself very complex
3) this is an historical baggage/legacy and other approaches (Plan9? Windows?) could provide the same type of services but in a simpler way..

Thoughts?

Namespaces in operation, part 5: User namespaces

SEJeff — Thu, 28 Feb 2013 04:25:25 +0000

From a quick google, I found this:
http://openvz.org/Kernel_build#Rebuilding_kernel_from_sou...

[jeff@omniscience tmp]$ wget -q http://download.openvz.org/kernel/branches/2.6.18/028stab...
[jeff@omniscience tmp]$ du -hs patch-ovz028stab056.1-combined.gz
1.2M patch-ovz028stab056.1-combined.gz
[jeff@omniscience tmp]$ gzip -d patch-ovz028stab056.1-combined.gz
[jeff@omniscience tmp]$ du -hs patch-ovz028stab056.1-combined
4.6M patch-ovz028stab056.1-combined

I did the same thing about a year ago and the results were the same. So I still stand by my previous comment. Around a megabyte :)

Namespaces in operation, part 5: User namespaces

mabshoff — Wed, 27 Feb 2013 22:34:16 +0000

Well, I am not quite sure where the 1 MB patch figure comes from, but all the RHEL 6.x based patches weigh in at 27 MB unpacked. Note that this is 2.6.32 vanilla -> RHEL 6.x+ovz, so I do assume that the vast majority of that diff is the RHEL 6.x changes. Either way, as you mentioned a massive amount of code from the people working for Parallels has been merged, so I would be curious what the RHEL 7.0 diff will look like. I guess we will know in a couple months.

Cheers,

Michael

Namespaces in operation, part 5: User namespaces

ebiederm — Wed, 27 Feb 2013 22:11:16 +0000

Oh I would say that the user namespaces at least are much closer to the original vserver approach (which uses a fixed number of the high bits as the container id) and fair bit better than either approach as all of the weird corner cases of mixing userspace uids and gids and the kernel uids and gids are handled.

That is what the remaining XFS work is about ensuring that XFS doesn't mix user space uids with in-kernel uids without adding the appropriate translations, and making it hard to mess confuse those two kinds of uids in the future. XFS has a very unique architecture for it's in-kernel filesystem data structures and many more user facing ioctls than most filesystems which means it can't be treated like just another filesystem.

What was not mentioned is that when a process in a user namespace interacts files, the interaction is the same as interacting with processes. When a file is created the uid of the process is mapped into the initial user namespace those mapped uids are stored on disk. Meanwhile when the process in a user namespace stats those files the uids are mapped back into it's namespace so it sees the uids it wrote with instead of the uids that are stored on disk.

This allows quotas and other filesystem features to work with user namespaces without any changes to the on-disk format.

Namespaces in operation, part 5: User namespaces

SEJeff — Wed, 27 Feb 2013 20:52:16 +0000

@einstein: Parallels (virtuozzo/openvz authors) have been some of the primary contributors to the upstream namespace support in the kernel. While I cringe at seeing the 1Mb+ patch that openvz is, I've got to give them props for going about things the right (and very long) way of getting small bits upstream at a time.

Namespaces in operation, part 5: User namespaces

einstein — Wed, 27 Feb 2013 19:59:54 +0000

It looks like we're oh so slowly and painfully discovering and re-inventing openvz a little bit at a time. Hopefully we'll get there before too many more years.

Namespaces in operation, part 5: User namespaces

nix — Wed, 27 Feb 2013 19:50:48 +0000

Simply excellent documentation. Would that all docs were like this.