|
|
Log in / Subscribe / Register

SO_PEERCGROUP: which container is calling?

By Jonathan Corbet
March 18, 2014
As various container solutions on Linux approach maturity, distribution developers are thinking more about the infrastructure needed to manage a system full of containers. Toward that goal, Vivek Goyal recently posted a patch allowing a process to determine which control group contains a process at the other end of a Unix-domain socket. The patch is relatively simple, but it still kicked off a lengthy discussion making it clear that, among other things, there is still resistance to using modern Linux kernel facilities to implement new features.

The patch in question adds a new command (SO_PEERCGROUP) to the getsockopt() system call. A process can invoke this command on an open Unix-domain socket and get back the name of the control group containing the process at the other end. Or something close to that: what is returned is the control group the peer process was in when the connection was established; that process may have moved in the meantime. The information may thus be a bit outdated, but SO_PEERCGROUP mirrors the existing SO_PEERCRED command in this regard. Connection-time information is deemed to be good enough for the targeted use case, which is allowing the system security services daemon (SSSD) to make policy decisions based on which container it is talking to.

The main critic of this patch was Andy Lutomirski, who had a number of complaints with it. In the end, though, the key point may have been described in this message:

My a priori opinion is that this is a terrible idea. cgroups are a nasty interface, and letting knowledge of cgroups leak into the programs that live in the groups (as opposed to the cgroup manager) seems like a huge mistake to me.

Part of this complaint was a bit off the mark: the idea is to not require awareness of control groups for processes running inside containers. But, even without that, Andy appears to be against the use of control groups in general. He is certainly not alone in that point of view.

Andy came up with three alternative approaches by which a daemon process could identify which container is connecting to it, but those have run into resistance as well. The first of those was to put the containers inside user namespaces. The user-ID mapping performed by user namespaces would then allow each connecting process to be identified with the existing SO_PEERCRED mechanism or with an SCM_CREDENTIALS control message. Adding user namespaces to the mix should also make containers more secure, he said.

The objection to this approach was best summed up by Vivek:

Using user namespaces sounds like the right way to do it (at least conceptually). But I think hurdle here is that people are not convinced yet that user namespaces are secure and work well. IOW, some people don't seem to think that user namespaces are ready yet.

Simo Sorce echoed these concerns and also added that he is not in a position to make the target container mechanism (Docker) use user namespaces. Eric Biederman, the developer of user namespaces, asked for specifics of any problems and observed: "It seems strange to work around a feature that is 99% of the way to solving their problem with more kernel patches."

Strange or not, there does not appear to be a lot of interest in exploring the use of user namespaces as a solution to this particular problem. Like control groups, user namespaces are a relatively new, Linux-specific mechanism; getting developers to adopt such features is often a challenge. In this case, concerns about a lack of maturity can only serve to deprive user namespaces of testing, prolonging any such immaturity further.

Andy's second suggestion was to get the container information out of /proc, using the process ID of the connecting process. Simo responded that use of process IDs can suffer from race conditions; processes can come and go quickly on some systems. The third idea was to just keep a separate socket open into each container; this idea was dismissed as being on the messy and inelegant side, but nobody said that it wouldn't work.

The end result was a conversation that, by all appearances, convinced nobody. In the process, it has highlighted a question that often comes up in the kernel community: once we add interesting new features, to what extent can we integrate those features with others or expect developers to use them? Expect to see this kind of debate more often as the kernel continues to develop and acquires more features that were never envisioned by any of the Unix standards bodies. A lot of work is going into adding new capabilities to the kernel; it would seem strange if we were so unconvinced by our own work that we did not expect others to make use of it.

Index entries for this article
KernelContainers


to post comments

SO_PEERCGROUP: which container is calling?

Posted Mar 20, 2014 5:48 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (12 responses)

What we really need is a way to have race-free process management. Cgroups goes a long way towards it, but it lacks an important piece - process handles.

Since we already have eventfd, timerfd and signalfd - it's only natural to have pidfd() which would allow to open processes as files, with the usual refcounted semantics.

SO_PEERCGROUP: which container is calling?

Posted Mar 20, 2014 8:36 UTC (Thu) by iq-0 (subscriber, #36655) [Link] (1 responses)

So true.

But that doesn't really address the problem above. But it could if one would also have a way to:
- Get a pidfd for a process connected to using a unix socket (like the SO_PEERCGROUP only than a SO_PEERPIDFD or so)
- Have a method to ask the kernel to compare two pidfd and answer if it's the same process (not elegant) or better have the kernel always return the same fd when requesting multiple instances of a pidfd, though that might have some tricky performance consequences when used often

For this specific problem (you receive a connection, or even better, you receive a message/dgram) you actually want a simple unique proces identifying token that:
- Has no clear meaning (aka: shouldn't be able to distinguish if it's the original "init" proces or if proces A is older/newer than B)
- Is unique as long as somebody could have some reference to it (like having an open pidfd() or somebody having the procdir still open or so)

In that case the pidfd() logic can be used to track any and all processes it's interested in and as long as it has that information it can safely match tokens that it got via eg. recvmsg() with one of those processes (without needing any additional syscall).

SO_PEERCGROUP: which container is calling?

Posted Mar 20, 2014 19:25 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

An interesting way to have something like pidfd is to open /proc/<pid> directory and hold the descriptor open. The kernel should also keep track of the open /proc/pids and do not re-use them for other processes.

That should take care of most of the PID races. It should also take care of establishing the identity of the process in question.

SO_PEERCGROUP: which container is calling?

Posted Mar 20, 2014 18:57 UTC (Thu) by luto (subscriber, #39314) [Link] (9 responses)

The trouble with pidfd (or at least my understanding of it) is that it doesn't really help with process hierarchies. We need a way to find ancestors of a process.

SO_PEERCGROUP: which container is calling?

Posted Mar 20, 2014 19:20 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (8 responses)

Process hierarchies are largely useless on the classic Unix systems, they are far too fluid and racy.

Cgroups really solves this problem by providing more logical process grouping.

SO_PEERCGROUP: which container is calling?

Posted Mar 20, 2014 19:47 UTC (Thu) by luto (subscriber, #39314) [Link] (7 responses)

We have PR_SET_CHILD_SUBREAPER, which makes hierarchies work quite sensibly. When I say "hierarchy", I mean the hierarchy before reparenting kills it.

The systemd folks seem to be pushing in the direction of forcing cgroups and the process hierarchy to be consistent, which suggests that what they really should be tracking is the process hierarchy, rather than overlying cgroups on top just to track things.

SO_PEERCGROUP: which container is calling?

Posted Mar 20, 2014 19:58 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

PR_SET_CHILD_SUBREAPER (which is a recent and non-standard option in itself) might work for the PID1 or similar-level programs, but it's far less useful for stuff like Docker's system daemon.

There are several issues:

1) It only works for the child processes. A system-level daemon in Docker might not be the parent of the containers. Think about a situation where you might want a system-level PostgreSQL to use cgroups for authorization.

2) It doesn't solve the problem of enumeration. With cgroups it's easy to get all the processes that are confined in it and it's quite natural to use them get important statistics like RAM usage.

SO_PEERCGROUP: which container is calling?

Posted Mar 20, 2014 20:05 UTC (Thu) by luto (subscriber, #39314) [Link] (5 responses)

For Docker, etc, at the end of the day, there has to be some coordination between the container manager and whatever is trying to identify processes. (To further muddy the waters, there's setns.)

For enumeration, there's /proc/pid/task/tid/children, which is currently hidden behind CONFIG_CHECKPOINT_RESTORE.

But things like "you might want a system-level PostgreSQL to use cgroups for authorization" bug me. *Why* do you want PostgreSQL to use cgroups? What are you actually trying to achieve? Why is cgroups the right thing to use?

SO_PEERCGROUP: which container is calling?

Posted Mar 20, 2014 20:14 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

Suppose that we have a process hierarchy in '/proc/sys/fs/cgroup/container1', perhaps also namespaced.

I want to give access to these processes connecting through PostgreSQL's Unix socket as a PostgreSQL user named 'Bob'. This way processes from this container can use their databases without password-based authentication.

It's certainly possible to add something like system-level credentials daemon to authorize such requests and it looks like systemd is moving in this direction. But it also looks far less elegant.

And regarding /proc/pid/task/tid/children - it looks like an attempt to re-create cgroups. Which is kinda dumb, since cgroups already has a 'freeze' controller to stop process groups. It would have been natural to add support for state serialization to it.

So yeah, it looks like another cgroups-related clusterk-suck.

SO_PEERCGROUP: which container is calling?

Posted Mar 20, 2014 20:30 UTC (Thu) by luto (subscriber, #39314) [Link] (3 responses)

Sticking a process hierarchy in '/proc/sys/fs/cgroup/container1', by itself, is simply not valid as a credential. It does not prevent outside processes from ptracing in. It does not prevent outside processes from messing with the stuff inside via proc.

To be fair, process hierarchies are also not useful as a security boundary. I mentioned them because I assumed that the intended use case of SO_PEERCGROUP was to figure out where syslog messages came. I don't think that the pid should be used for authentication.

The right solution IMO is to either use namespaces directly (and to improve the API if needed) or to use multiple sockets.

SO_PEERCGROUP: which container is calling?

Posted Mar 20, 2014 20:44 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Of course. But in the cases where we want to do a per-cgroup authentication it's usually the case that each cgroup can't access other cgroups (because of different user IDs or namespaces).

Using namespaces is a bit problematic. They are not always _necessary_ for the cgroups use-cases. For instance, we use cgroups mostly to contain the resource usage so a malfunctioning program can't OOM processes from other cgroups. There's no need to create a separate namespace - all processes use the same NFS-mounted /home partition, for example.

A couple of our services use SO_PEERCRED to get the /proc/pid/cgroup information to charge resource usage to a correct task. Isolating malicious processes ptrace()-ing processes of the same user in another cgroup is not an issue for us.

SO_PEERCGROUP: which container is calling?

Posted Mar 20, 2014 20:50 UTC (Thu) by luto (subscriber, #39314) [Link] (1 responses)

If you're not worried about malicious processes, why not just have the process in question send over its cgroup name?

SO_PEERCGROUP: which container is calling?

Posted Mar 20, 2014 20:58 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

We're not worried up to a point. A user's task ptrace()-ing another task of the same user is OK. In this case it's user's own problem to find out why resources used were charged to a wrong task.

However, giving users ability to impersonate other users is a no-no.

Currently, we use SO_PEERCRED to get the target process ID and read /proc/pid/cgroup file. All of the sensitive information is encoded in the cgroup name so, it's not really a problem for us if there's a PID race.


Copyright © 2014, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds