LWN.net Logo

Stepping closer to practical containers: "syslog" namespaces

By Michael Kerrisk
December 5, 2012

The abstract goal of containers is, in effect, to provide a group of processes with the illusion that that they are the only processes on the system. When fully implemented, this feature has the potential to realize many practical benefits, such as light-weight virtualization and checkpoint/restore.

In order to give the processes in a container the illusion that there are no other processes on the system, various global system resources must be wrapped in abstractions that make it appear that each container has its own instance of the resources. This has been achieved by the addition of "namespaces" for a number of global resources. Each namespace provides an isolated view of a particular global resource to the set of processes that are members of that namespace.

Step by step, more and more global resources have been wrapped in namespaces, and before we look at another step in this path it's worth reviewing the progress to date.

Namespaces so far

The first step in the journey was mount namespaces, which can be used to provide a group of processes with a private view of the mount points that make up the filesystem hierarchy. Mount namespaces first appeared in the mainline kernel in 2002, with the release of Linux 2.4.19. The clone() flag used to create mount namespaces was given the rather generic name CLONE_NEWNS for "new namespace", implying that no one was then really considering the possibility that there might be other kinds of namespaces; at that time, of course, containers were no more than a gleam in the eyes of some developers.

However, as the concept of containers took hold, a number of other namespaces have followed. Network namespaces were added to provide a group of processes with a private view of the network (network devices, IP addresses, IP routing tables, port number space, and so on). PID namespaces isolated the global "PID number space" resource, so that processes in separate PID namespaces can have the same PIDs—in particular, each namespace can have its own 'init' (PID 1), the "ancestor of all processes". PID namespaces also allow techniques such as freezing the processes in a container and then restoring them on another system while maintaining the same PIDs.

Several other global resources have likewise been wrapped in namespaces, so that there are also IPC namespaces (initially implemented to isolate System V IPC identifiers and later to isolate instances of the virtual filesystems used in the implementation of POSIX message queues) and UTS namespaces (which wrap the nodename and domainname identifiers returned by uname(2)). Work on one of the more complex namespaces, user namespaces, was started in about Linux 2.6.23 and seems to be edging towards completion. When complete, user namespaces will allow per-namespace mappings of user and group IDs, so that, for example, it will be possible for a process to be root inside a container without having root privileges in the system as a whole.

Of course, a Linux system has a large number of global resources, each of which could conceivably be wrapped in a namespace. At the more extreme end, for example, even a resource such as the system time could be wrapped, so that different containers could maintain different concepts of the time. (A time namespace was once proposed, but the implementation was not merged.) The trick is to determine the minimum set of resources that need to be wrapped for the practical implementation of containers. (Of course, this "minimum set" may well grow over time, as people develop new uses for containers.) A related question is how those wrappings should be grouped so as to avoid an explosion of namespaces that would increase application complexity. So, for example, System V IPC and POSIX message queues could conceivably have been wrapped in different namespaces, but the kernel developers concluded that it makes practical sense to group them in a single "IPC" namespace.

The global kernel log problem

What is necessary for the practical implementation of containers sometimes only becomes clear when one starts doing, well, practical things. Thus, it was that in early 2010 Jean-Marc Pigeon reported that he had written a small utility to build containers using the clone() system call that worked fine, except that "HOST and all containers share the SAME /proc/kmsg, meaning kernel syslog information are scrambled (useless)".

What Jean-Marc was discovering is that the kernel log is one of the global resources that is not wrapped in a namespace. He went on to note another ill-effect: "I have in iptables, reject packet logging on the HOST, [but as soon as] rsyslog is started on one container, I can't see my reject packet log any more." In other words, starting a syslog daemon on the host or any container sucks up all of the kernel log messages produced on the host or in any container. The point here about iptables is particularly relevant: the inability to isolate kernel log messages from iptables is a significant practical problem when trying to employ the network namespaces facility that the kernel already provides.

In response to Jean-Marc's question about how the problem could be fixed, Serge Hallyn replied:

Well, the results of do_syslog() should be containerized. Kernel messages (oopses for instance) should always go to the initial container. Shouldn't be hard to do, but the question is what do we tie it to? User namespace? Network namespace? … I'm tempted to say userns makes the most sense - if you start a new userns you likely always want private syslog, whereas with netns and pidns you may not.

do_syslog() is the kernel function that encapsulates the main logic of the syslog(2) system call. That system call retrieves messages from the kernel log ring buffer (and performs a range of control operations on the log buffer) that is populated by messages created using the kernel's printk() function. Thus, though discussions on this topic have tended to use the term "syslog namespace", that is something of a misnomer: what is really meant is wrapping the kernel log resource in a namespace.

To avoid possible confusion, it is probably worth noting that the syslog(2) system call is a quite different thing from the syslog(3) library function, which writes messages to the UNIX domain datagram socket (/dev/log) from which the user-space syslog daemon (rsyslogd or similar) retrieves messages. (Because of this collision of names, the GNU C library exposes the syslog(2) system call under a quite different name: klogctl().) A picture helps clarify things:

[Interrelationship of
logging primitives]

First attempts at a solution

In the event, "containerizing" do_syslog() turned out to be more difficult than Serge thought. His first shot at addressing the problem (a "gross hack" to "provide each user namespace with its own syslog ring buffer") quickly uncovered a further difficulty: the kernel's printk() is sometimes called in contexts where there is no way to determine in which of the per-namespace ring buffers a message should be logged. For example, if the kernel is executing a network interrupt (to process an incoming network packet) and wants to log a message, that message should not be sent to the per-namespace kernel log of the interrupted process. Rather, the message should be sent to the kernel log associated with the network namespace for the network device; however, the kernel data structures provide no way to obtain a reference to that kernel log.

Jean-Marc himself also made an attempt at implementing a solution. However, Serge pointed out that Jean-Marc's patch suffered some of the same problems as his own earlier attempt. Serge went on to describe what he thought would be the correct solution, which would require the creation of a separate syslog namespace. His proposed solution can be paraphrased as follows:

  1. The core of vprintk_emit() (which contains most of implementation of the printk() function) should be moved into a new nsvprintk_emit() function that takes an argument that specifies a syslog namespace.

  2. vprintk_emit() would then become a wrapper around nsvprintk_emit() that specifies the "initial" syslog namespace (i.e., the syslog namespace of the host system).

  3. A namespace-aware version of printk(), called (say) nsprintk(), should be created. That function would take a syslog namespace argument and pass it to nsvprintk_emit().

  4. The kernel log ring buffer should be "containerized" as per Serge's initial patch. Thus each syslog namespace would have its own ring buffer, and syslog(2) would operate on the per-namespace ring buffer of the calling process.

  5. At call sites in the kernel code where it is not appropriate to use the syslog namespace of the currently executing process, calls to printk() should be replaced with calls to nsprintk() that pass a suitable syslog namespace argument.

Although Jean-Marc made a few more efforts to rework his patch in the following weeks, the effort ultimately petered out without much further comment or consensus on a solution. It seems that Serge and other kernel developers realized that the problem was more complex than first thought and they had neither the time to implement a solution themselves nor to help Jean-Marc toward implementing a solution.

The main difficulty lies in the last of the points above, and its solution was not really elaborated in Serge's mail. The kernel data structures and code need to be modified to add suitable hooks to handle the "no current process context problem"—the cases where printk() is called from a context in which the currently executing process can't be used to identify a suitable syslog namespace to which a message should be logged.

Restarting work on a solution

Work in this area then seems to have gone quiet for more than two years, until a few days ago when Serge proposed a new proof-of-concept patch set, pretty much along the lines he described two years earlier. His description of the patch noted that:

The syslog ns is tied to a user namespace. You must create a new user namespace before you can create a new sylog ns. The syslog ns is created through a new command (11) to the __NR_syslog system call.

Once a task enters a new syslog ns, it's "dmesg", "dmesg -c" and /dev/kmsg actions affect only itself, so that user-created syslog messages no longer are confusingly combined in the host's syslog.

In other words, Serge's patch provides isolation for the kernel log by implementing a new dedicated namespace for that purpose (rather than providing the isolation by attaching the implementation to one of the existing namespaces). Each syslog namespace instance would be tied to a particular user namespace.

Normally, new namespaces of each type are created by suitable flags to the clone() system call. Thus, for example, there are clone flags such as CLONE_NEWUTS and CLONE_NEWUSER. However, a while ago, the kernel developers realized that the flag space for clone() was exhausted. (Providing additional flag space was one of the motivations behind the proposal to add an eclone() system call, a proposal that was ultimately unsuccessful.) For this reason, Serge proposed instead to use a new command to the syslog() system call to create syslog namespace instances.

Serge went on to note:

"printk" itself always goes to the initial syslog_ns, and consoles belong only to the initial syslog_ns. However printks relating to a specific network namespace, for instance, can now be targeted to the syslog ns for the user ns which owns the network ns, aiding in debugging in a container.

Serge's patch would solve the "no current process context problem" as follows. As noted above, this case is handled by an nsprintk()-style function that takes an argument (of type struct syslog_ns *) that identifies the syslog namespace to which the log message should be sent. The value for that argument can be obtained via the struct net structure for the network namespace instance: in the current user namespace implementation (git tree), when a network namespace is created using clone(), a pointer to the corresponding user namespace instance of the caller is stored in the net structure. Serge's patch in turn provides a linkage from that user namespace structure to the corresponding syslog namespace.

Eric Biederman, the maintainer of the user namespace git tree, agreed with Serge's overall approach, but queried one particular point:

I am not a fan of how this ties into the user namespace. I would prefer closer or looser ties. The recursive reference count loop where a userns refers to a syslogns and that syslogns refers to the same userns is unpleasant.

In Serge's implementation, the syslog and user namespaces are maintained as separate structures, but, as the recursive pointers between the two namespace structures and the need to create a new user namespace before creating a syslog namespace indicate, instances of each namespace are not truly independent. In Eric's view then, the syslog and user namespace structures should either be more fully decoupled, or they should be much more tightly coupled.

Eric went on later to note that:

There is an argument to be made that syslog messages are the kind of security identifiers like uid, gids, and keys that should be part of a user namespace. I'm not fully convinced but there are some DOS attacks that would naturally prevent.

The discussion ultimately led Serge to conclude that the syslog resource should instead be grouped as part of the user namespace rather than as a separate namespace:

I can't really think of a good case for not putting the syslogns straight into the userns (i.e. not having a separate syslogns), so I'd say let's go that route.

Serge's patch seems to have inspired another group to try implementing syslog namespaces. A couple of days after Serge's patch, Rui Xiang posted some patches that he and his colleague Libo Chen had developed to implement similar functionality. Rui began by noting a couple of the obvious differences in their patch set:

In Serge's patch [...] syslog_namespace was tied to a user namespace. We add syslog_ns tied to nsproxy instead, and implement ns_printk in ip_table context.

We add syslog_namespace as a part of nsproxy, and a new flag CLONE_SYSLOG to unshare syslog area.

Using nsproxy is the conventional way of dealing with the namespaces associated with a process: it is a structure that contains pointers to structures describing each of the namespaces that a process is associated with. This contrasts with Serge's original approach, which hung the syslog namespace off the user namespace.

Rui's team also took advantage of a detail that Serge perhaps overlooked: there happens to be one spare bit in the flag space for clone() because the CLONE_STOPPED flag was removed several kernel releases ago. Therefore, Rui's team repurposed that bit. Normally, it would not be safe to recycle flag bits in this way, but the CLONE_STOPPED flag has a special history. It was initially proposed for use specifically in the NPTL threading implementation, but the final implementation abandoned the flag in favor of a different approach. As such, CLONE_STOPPED is likely never to have had serious user-space users.

Unsurprisingly, the overall approaches of the two patch sets have many similarities, but there are differences in details such as how a syslog namespace is associated with a struct net in order to solve the "no current process context problem".

Although kernel flame wars between competing implementations are what often make the biggest headlines in the online press, the subsequent exchange between Serge, Rui, and Libo demonstrated that life on developer mailing lists is usually more cordial. Serge asked:

I understand that user namespaces aren't 100% usable yet, but looking long term, is there a reason to have the syslog namespace separate from user namespace?

In response, Rui noted:

Actually we don't have strong preference. We'll think more about it. Hope we can make consensus with Eric.

That in turn led Serge to ask Rui and Libo if his patch set might suffice for their needs, with the gracious note that:

I'm not at all wedded to my patchset. I'm happy to go with something else entirely. My set was just a proof of concept.

There is one other notable difference in functionality between the two patch sets. In Serge's patch set, system consoles belonged (by intention) only to the initial syslog namespace, meaning that kernel log messages from other syslog namespace instances can't be displayed on consoles. By contrast, Rui and Libo's patches include consoles in the syslog namespace, so that kernel messages from syslog namespaces other than the initial namespace can be displayed on consoles. Rui and Libo would like this functionality in order to be able to obtain kernel log messages from containers when monitoring embedded devices that provide access to the console over a serial port.

The summary of the discussion is that there are useful pieces in both patches. Serge plans to revise his patch to merge the syslog namespace functionality into user namespaces, add the console functionality desired by Rui and Libo, and add some in-kernel uses of the namespace-aware printk() interface as a proof-of-concept for the implementation (as was done in the patches by Rui and Libo).

Concluding remarks

The history of the work to provide syslog namespaces (or as it might better be termed, namespace isolation for the kernel log) presents a microcosm of work on namespaces in general. As has often been the case, the implementation of namespaces often turns out to be surprisingly complex. Much of that complexity hinges on detailed questions of functionality (for example, the behavior of consoles in this case) and the question of whether resources should be grouped inside a new namespace or within an existing namespace. In the case of syslog namespaces, it looks like a number of decisions have been made; there will probably be a few more rounds of patches, but there seems to be general consensus on the direction forward. Thus, there is a reasonable chance that proper namespace isolation of kernel logging will appear in the kernel sometime around Linux 3.9 or soon afterward.


(Log in to post comments)

Stepping closer to practical containers: "syslog" namespaces

Posted Dec 6, 2012 19:43 UTC (Thu) by mhelsley (subscriber, #11324) [Link]

It seems to me that all of this assumes only one namespace could be relevant to any given printk. I don't think that's a safe assumption -- it might be that multiple namespaces with differing user/syslog namespaces are relevant to the same printk. What should happen then?

Stepping closer to practical containers: "syslog" namespaces

Posted Dec 7, 2012 2:22 UTC (Fri) by Beolach (subscriber, #77384) [Link]

Possibly w/ multiple nsprintk() calls, specifying all of the relevant namespaces (1 in each call). Assuming you can identify them... but hopefully that wouldn't be difficult (if it's not possible, then maybe it is a safe assumption that the message is only relevant to one namespace).

Stepping closer to practical containers: "syslog" namespaces

Posted Dec 9, 2012 22:08 UTC (Sun) by nlucas (subscriber, #33793) [Link]

Couldn't this be simpler and more flexible by combining the structured syslog messages patch and an user-space filter?

Something like having a single kernel ring buffer (like now), that a single daemon would listen to, but have per namespace buffers that would be feed by that daemon according with user-space rules.

Stepping closer to practical containers: "syslog" namespaces

Posted Dec 10, 2012 0:54 UTC (Mon) by dlang (✭ supporter ✭, #313) [Link]

That's a really good point, if the log is a JSON structure, add a tag section that tells you all the possible namespace info, and then you can have a program filter it and distribute it to one or more destinations.

I would say that this filtering should probably be happening in userspace, where the tools like rsyslog already support very extensive logic for doing this.

This includes the ability to deliver to additional copies of rsyslog that are running inside your namespace containers

Stepping closer to practical containers: "syslog" namespaces

Posted Dec 13, 2012 5:38 UTC (Thu) by eternaleye (guest, #67051) [Link]

The problem there is information disclosure - there may well be various pieces of information you don't want a container to see, and a userspace filter inside of said container can't do that.

Stepping closer to practical containers: "syslog" namespaces

Posted Dec 13, 2012 5:41 UTC (Thu) by eternaleye (guest, #67051) [Link]

Ah, looks like I misread your statement, and you were suggesting that the initial namespace's syslog daemon would redispatch to containers. That makes more sense, but it also increases the maintenance burden - you need a way of mapping the tags to the delivery endpoint of the syslog daemon in the container, and it still will be apparent that the source is an external syslog daemon (tcp, udp, what have you) rather than fulfilling the stated goal of the container appearing to be a fully isolated system.

Stepping closer to practical containers: "syslog" namespaces

Posted Dec 13, 2012 9:43 UTC (Thu) by nlucas (subscriber, #33793) [Link]

You will already need to do that if you do it on the kernel, and it would be more difficult to modify the mapping (and puts policy on the kernel). The only difference is the kernel now doesn't need to have a default policy (it doesn't break anything because it's a new feature). The default can just be that all containers start in the same "syslog" namespace (like now).

As for not being transparent, I'm talking of a ring-buffer per container, so it's fully transparent for the applications. dmesg will output the namespace ring-buffer contents.

The initial namespace "syslog" daemon would dispatch messages from the unfiltered ring-buffer to the other namespaces ring-buffers. The difference would be that the syslog(2) syscall would be namespace aware and return messages only from the current namespace ring-buffer.

Stepping closer to practical containers: "syslog" namespaces

Posted Jan 31, 2013 4:44 UTC (Thu) by satish6541 (guest, #89092) [Link]

Hi,
I am currently using chroot jails and network namespace combination. I am executing my syslog in both my root as well as in my chroot environment. In this case should the kernel logging messages be jail specific? I can sometimes see the kernel messages inside my chroot jail, and most of the times not. (for example making the interface up or down).

Interfaces mapped to a namespace of a chroot jail, the kernel logging happens in the default root and not in the chroot. (Is this behaviour in conjunction to the problem mentioned in your article? ).

Please let me know your views.

Regards,


Stepping closer to practical containers: "syslog" namespaces

Posted Jun 30, 2013 14:07 UTC (Sun) by jeanmarc (guest, #91641) [Link]

Kernel 3.9.4
Still same problem, container iptable logs go to host OR container, seems to be a round robin distribution, same for HOST iptable logs (HOST syslog reaching container! this NOT good at all). Note: openvz kernel do not have this issue for a very long time now (I would say 6 years).

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds