LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 3.7-rc8, reluctantly released by Linus on December 3. "I really didn't want it to come to this, but I was uncomfortable doing the 3.7 release yesterday due to last-minute issues, and decided to sleep on it. And today, I ended up even *less* comfortable about it due to the resurrection of a kswapd issue, so I decided that I'm going to do another -rc after all." As he points out, that implies that the 3.8 merge window will run close to the holidays.

Stable updates: 3.6.9, 3.4.21 and 3.0.54 were released on December 3. Meanwhile, 3.2.35 is in the review process; its release can be expected at any time.

Comments (none posted)

Quotes of the week

I'm all in favour of "whence", which is indeed the name of that lseek argument - since mediaeval times I believe.

It's good to have words like that in the kernel source: while you're in the mood, please see if you can find good homes for "whither" and "thrice" and "widdershins".

Hugh Dickins

Took vacation last week, spent most of it doing userspace coding. It was joyous.
Rusty Russell

If yes, yet again this illustrates why the use of atomic types leads people down the path of believing that their code somehow becomes magically safe through the use of this smoke-screen. IMHO, every use of atomic_t must be questioned and carefully analysed before it gets into the kernel - many are buggy through assumptions that atomic_t buys you something magic.
Russell King

Comments (14 posted)

A FALLOC_FL_NO_HIDE_STALE followup

By Jonathan Corbet
December 5, 2012
Last week's edition included an article on the addition of the FALLOC_FL_NO_HIDE_STALE flag to the fallocate() system call. Some developers, objecting to the patch and the way it got into the kernel, had called for it to be reverted before the 3.7 release went final. At the time, Linus had not made any remarks in the discussion or indicated whether he would accept the revert.

That situation changed after Linus was prompted by Martin Steigerwald. His response was clear enough:

If you want something reverted, you show me the *technical* reason for it. Not the "ooh, I'm so annoyed by how this was done" reason for it.

And if your little feelings got hurt, get your mommy to tuck you in, don't email me about it. Because I'm not exactly known for my deep emotional understanding and supportive personality, am I?

There were some technical reasons offered in the discussion, along with the more general process-oriented complaints. But it seems clear that Linus has not found that discussion convincing. So, in the absence of a surprise from somewhere, it seems that the new fallocate() flag will remain for the 3.7 release, at which point it will become part of the kernel's user-space ABI.

Comments (18 posted)

Kernel development news

Stepping closer to practical containers: "syslog" namespaces

By Michael Kerrisk
December 5, 2012

The abstract goal of containers is, in effect, to provide a group of processes with the illusion that that they are the only processes on the system. When fully implemented, this feature has the potential to realize many practical benefits, such as light-weight virtualization and checkpoint/restore.

In order to give the processes in a container the illusion that there are no other processes on the system, various global system resources must be wrapped in abstractions that make it appear that each container has its own instance of the resources. This has been achieved by the addition of "namespaces" for a number of global resources. Each namespace provides an isolated view of a particular global resource to the set of processes that are members of that namespace.

Step by step, more and more global resources have been wrapped in namespaces, and before we look at another step in this path it's worth reviewing the progress to date.

Namespaces so far

The first step in the journey was mount namespaces, which can be used to provide a group of processes with a private view of the mount points that make up the filesystem hierarchy. Mount namespaces first appeared in the mainline kernel in 2002, with the release of Linux 2.4.19. The clone() flag used to create mount namespaces was given the rather generic name CLONE_NEWNS for "new namespace", implying that no one was then really considering the possibility that there might be other kinds of namespaces; at that time, of course, containers were no more than a gleam in the eyes of some developers.

However, as the concept of containers took hold, a number of other namespaces have followed. Network namespaces were added to provide a group of processes with a private view of the network (network devices, IP addresses, IP routing tables, port number space, and so on). PID namespaces isolated the global "PID number space" resource, so that processes in separate PID namespaces can have the same PIDs—in particular, each namespace can have its own 'init' (PID 1), the "ancestor of all processes". PID namespaces also allow techniques such as freezing the processes in a container and then restoring them on another system while maintaining the same PIDs.

Several other global resources have likewise been wrapped in namespaces, so that there are also IPC namespaces (initially implemented to isolate System V IPC identifiers and later to isolate instances of the virtual filesystems used in the implementation of POSIX message queues) and UTS namespaces (which wrap the nodename and domainname identifiers returned by uname(2)). Work on one of the more complex namespaces, user namespaces, was started in about Linux 2.6.23 and seems to be edging towards completion. When complete, user namespaces will allow per-namespace mappings of user and group IDs, so that, for example, it will be possible for a process to be root inside a container without having root privileges in the system as a whole.

Of course, a Linux system has a large number of global resources, each of which could conceivably be wrapped in a namespace. At the more extreme end, for example, even a resource such as the system time could be wrapped, so that different containers could maintain different concepts of the time. (A time namespace was once proposed, but the implementation was not merged.) The trick is to determine the minimum set of resources that need to be wrapped for the practical implementation of containers. (Of course, this "minimum set" may well grow over time, as people develop new uses for containers.) A related question is how those wrappings should be grouped so as to avoid an explosion of namespaces that would increase application complexity. So, for example, System V IPC and POSIX message queues could conceivably have been wrapped in different namespaces, but the kernel developers concluded that it makes practical sense to group them in a single "IPC" namespace.

The global kernel log problem

What is necessary for the practical implementation of containers sometimes only becomes clear when one starts doing, well, practical things. Thus, it was that in early 2010 Jean-Marc Pigeon reported that he had written a small utility to build containers using the clone() system call that worked fine, except that "HOST and all containers share the SAME /proc/kmsg, meaning kernel syslog information are scrambled (useless)".

What Jean-Marc was discovering is that the kernel log is one of the global resources that is not wrapped in a namespace. He went on to note another ill-effect: "I have in iptables, reject packet logging on the HOST, [but as soon as] rsyslog is started on one container, I can't see my reject packet log any more." In other words, starting a syslog daemon on the host or any container sucks up all of the kernel log messages produced on the host or in any container. The point here about iptables is particularly relevant: the inability to isolate kernel log messages from iptables is a significant practical problem when trying to employ the network namespaces facility that the kernel already provides.

In response to Jean-Marc's question about how the problem could be fixed, Serge Hallyn replied:

Well, the results of do_syslog() should be containerized. Kernel messages (oopses for instance) should always go to the initial container. Shouldn't be hard to do, but the question is what do we tie it to? User namespace? Network namespace? … I'm tempted to say userns makes the most sense - if you start a new userns you likely always want private syslog, whereas with netns and pidns you may not.

do_syslog() is the kernel function that encapsulates the main logic of the syslog(2) system call. That system call retrieves messages from the kernel log ring buffer (and performs a range of control operations on the log buffer) that is populated by messages created using the kernel's printk() function. Thus, though discussions on this topic have tended to use the term "syslog namespace", that is something of a misnomer: what is really meant is wrapping the kernel log resource in a namespace.

To avoid possible confusion, it is probably worth noting that the syslog(2) system call is a quite different thing from the syslog(3) library function, which writes messages to the UNIX domain datagram socket (/dev/log) from which the user-space syslog daemon (rsyslogd or similar) retrieves messages. (Because of this collision of names, the GNU C library exposes the syslog(2) system call under a quite different name: klogctl().) A picture helps clarify things:

[Interrelationship of
logging primitives]

First attempts at a solution

In the event, "containerizing" do_syslog() turned out to be more difficult than Serge thought. His first shot at addressing the problem (a "gross hack" to "provide each user namespace with its own syslog ring buffer") quickly uncovered a further difficulty: the kernel's printk() is sometimes called in contexts where there is no way to determine in which of the per-namespace ring buffers a message should be logged. For example, if the kernel is executing a network interrupt (to process an incoming network packet) and wants to log a message, that message should not be sent to the per-namespace kernel log of the interrupted process. Rather, the message should be sent to the kernel log associated with the network namespace for the network device; however, the kernel data structures provide no way to obtain a reference to that kernel log.

Jean-Marc himself also made an attempt at implementing a solution. However, Serge pointed out that Jean-Marc's patch suffered some of the same problems as his own earlier attempt. Serge went on to describe what he thought would be the correct solution, which would require the creation of a separate syslog namespace. His proposed solution can be paraphrased as follows:

  1. The core of vprintk_emit() (which contains most of implementation of the printk() function) should be moved into a new nsvprintk_emit() function that takes an argument that specifies a syslog namespace.

  2. vprintk_emit() would then become a wrapper around nsvprintk_emit() that specifies the "initial" syslog namespace (i.e., the syslog namespace of the host system).

  3. A namespace-aware version of printk(), called (say) nsprintk(), should be created. That function would take a syslog namespace argument and pass it to nsvprintk_emit().

  4. The kernel log ring buffer should be "containerized" as per Serge's initial patch. Thus each syslog namespace would have its own ring buffer, and syslog(2) would operate on the per-namespace ring buffer of the calling process.

  5. At call sites in the kernel code where it is not appropriate to use the syslog namespace of the currently executing process, calls to printk() should be replaced with calls to nsprintk() that pass a suitable syslog namespace argument.

Although Jean-Marc made a few more efforts to rework his patch in the following weeks, the effort ultimately petered out without much further comment or consensus on a solution. It seems that Serge and other kernel developers realized that the problem was more complex than first thought and they had neither the time to implement a solution themselves nor to help Jean-Marc toward implementing a solution.

The main difficulty lies in the last of the points above, and its solution was not really elaborated in Serge's mail. The kernel data structures and code need to be modified to add suitable hooks to handle the "no current process context problem"—the cases where printk() is called from a context in which the currently executing process can't be used to identify a suitable syslog namespace to which a message should be logged.

Restarting work on a solution

Work in this area then seems to have gone quiet for more than two years, until a few days ago when Serge proposed a new proof-of-concept patch set, pretty much along the lines he described two years earlier. His description of the patch noted that:

The syslog ns is tied to a user namespace. You must create a new user namespace before you can create a new sylog ns. The syslog ns is created through a new command (11) to the __NR_syslog system call.

Once a task enters a new syslog ns, it's "dmesg", "dmesg -c" and /dev/kmsg actions affect only itself, so that user-created syslog messages no longer are confusingly combined in the host's syslog.

In other words, Serge's patch provides isolation for the kernel log by implementing a new dedicated namespace for that purpose (rather than providing the isolation by attaching the implementation to one of the existing namespaces). Each syslog namespace instance would be tied to a particular user namespace.

Normally, new namespaces of each type are created by suitable flags to the clone() system call. Thus, for example, there are clone flags such as CLONE_NEWUTS and CLONE_NEWUSER. However, a while ago, the kernel developers realized that the flag space for clone() was exhausted. (Providing additional flag space was one of the motivations behind the proposal to add an eclone() system call, a proposal that was ultimately unsuccessful.) For this reason, Serge proposed instead to use a new command to the syslog() system call to create syslog namespace instances.

Serge went on to note:

"printk" itself always goes to the initial syslog_ns, and consoles belong only to the initial syslog_ns. However printks relating to a specific network namespace, for instance, can now be targeted to the syslog ns for the user ns which owns the network ns, aiding in debugging in a container.

Serge's patch would solve the "no current process context problem" as follows. As noted above, this case is handled by an nsprintk()-style function that takes an argument (of type struct syslog_ns *) that identifies the syslog namespace to which the log message should be sent. The value for that argument can be obtained via the struct net structure for the network namespace instance: in the current user namespace implementation (git tree), when a network namespace is created using clone(), a pointer to the corresponding user namespace instance of the caller is stored in the net structure. Serge's patch in turn provides a linkage from that user namespace structure to the corresponding syslog namespace.

Eric Biederman, the maintainer of the user namespace git tree, agreed with Serge's overall approach, but queried one particular point:

I am not a fan of how this ties into the user namespace. I would prefer closer or looser ties. The recursive reference count loop where a userns refers to a syslogns and that syslogns refers to the same userns is unpleasant.

In Serge's implementation, the syslog and user namespaces are maintained as separate structures, but, as the recursive pointers between the two namespace structures and the need to create a new user namespace before creating a syslog namespace indicate, instances of each namespace are not truly independent. In Eric's view then, the syslog and user namespace structures should either be more fully decoupled, or they should be much more tightly coupled.

Eric went on later to note that:

There is an argument to be made that syslog messages are the kind of security identifiers like uid, gids, and keys that should be part of a user namespace. I'm not fully convinced but there are some DOS attacks that would naturally prevent.

The discussion ultimately led Serge to conclude that the syslog resource should instead be grouped as part of the user namespace rather than as a separate namespace:

I can't really think of a good case for not putting the syslogns straight into the userns (i.e. not having a separate syslogns), so I'd say let's go that route.

Serge's patch seems to have inspired another group to try implementing syslog namespaces. A couple of days after Serge's patch, Rui Xiang posted some patches that he and his colleague Libo Chen had developed to implement similar functionality. Rui began by noting a couple of the obvious differences in their patch set:

In Serge's patch [...] syslog_namespace was tied to a user namespace. We add syslog_ns tied to nsproxy instead, and implement ns_printk in ip_table context.

We add syslog_namespace as a part of nsproxy, and a new flag CLONE_SYSLOG to unshare syslog area.

Using nsproxy is the conventional way of dealing with the namespaces associated with a process: it is a structure that contains pointers to structures describing each of the namespaces that a process is associated with. This contrasts with Serge's original approach, which hung the syslog namespace off the user namespace.

Rui's team also took advantage of a detail that Serge perhaps overlooked: there happens to be one spare bit in the flag space for clone() because the CLONE_STOPPED flag was removed several kernel releases ago. Therefore, Rui's team repurposed that bit. Normally, it would not be safe to recycle flag bits in this way, but the CLONE_STOPPED flag has a special history. It was initially proposed for use specifically in the NPTL threading implementation, but the final implementation abandoned the flag in favor of a different approach. As such, CLONE_STOPPED is likely never to have had serious user-space users.

Unsurprisingly, the overall approaches of the two patch sets have many similarities, but there are differences in details such as how a syslog namespace is associated with a struct net in order to solve the "no current process context problem".

Although kernel flame wars between competing implementations are what often make the biggest headlines in the online press, the subsequent exchange between Serge, Rui, and Libo demonstrated that life on developer mailing lists is usually more cordial. Serge asked:

I understand that user namespaces aren't 100% usable yet, but looking long term, is there a reason to have the syslog namespace separate from user namespace?

In response, Rui noted:

Actually we don't have strong preference. We'll think more about it. Hope we can make consensus with Eric.

That in turn led Serge to ask Rui and Libo if his patch set might suffice for their needs, with the gracious note that:

I'm not at all wedded to my patchset. I'm happy to go with something else entirely. My set was just a proof of concept.

There is one other notable difference in functionality between the two patch sets. In Serge's patch set, system consoles belonged (by intention) only to the initial syslog namespace, meaning that kernel log messages from other syslog namespace instances can't be displayed on consoles. By contrast, Rui and Libo's patches include consoles in the syslog namespace, so that kernel messages from syslog namespaces other than the initial namespace can be displayed on consoles. Rui and Libo would like this functionality in order to be able to obtain kernel log messages from containers when monitoring embedded devices that provide access to the console over a serial port.

The summary of the discussion is that there are useful pieces in both patches. Serge plans to revise his patch to merge the syslog namespace functionality into user namespaces, add the console functionality desired by Rui and Libo, and add some in-kernel uses of the namespace-aware printk() interface as a proof-of-concept for the implementation (as was done in the patches by Rui and Libo).

Concluding remarks

The history of the work to provide syslog namespaces (or as it might better be termed, namespace isolation for the kernel log) presents a microcosm of work on namespaces in general. As has often been the case, the implementation of namespaces often turns out to be surprisingly complex. Much of that complexity hinges on detailed questions of functionality (for example, the behavior of consoles in this case) and the question of whether resources should be grouped inside a new namespace or within an existing namespace. In the case of syslog namespaces, it looks like a number of decisions have been made; there will probably be a few more rounds of patches, but there seems to be general consensus on the direction forward. Thus, there is a reasonable chance that proper namespace isolation of kernel logging will appear in the kernel sometime around Linux 3.9 or soon afterward.

Comments (8 posted)

Optimizing stable pages

By Jonathan Corbet
December 5, 2012
The term "stable pages" refers to the concept that the system should not modify the data in a page of memory while that page is being written out to its backing store. Much of the time, writing new data to in-flight pages is not actively harmful; it just results in the writing of the newer data sooner than might be expected. But sometimes, modification of in-flight pages can create trouble; examples include hardware where data integrity features are in use, higher-level RAID implementations, or filesystem-implemented compression schemes. In those cases, unexpected data modification can cause checksum failures or, possibly, data corruption.

To avoid these problems, the stable pages feature was merged for the 3.0 development cycle. This relatively simple patch set simply ensures that any thread trying to modify an under-writeback page blocks until the pending write operation is complete. This patch set, by Darrick Wong, appeared to solve the problem; by blocking inopportune data modifications, potential problems were avoided and everybody would be happy.

Except that not everybody was happy. In early 2012, some users started reporting performance problems associated with stable pages. In retrospect, such reports are not entirely surprising; any change that causes processes to block and wait for asynchronous events is unlikely to make things go faster. In any case, the reported problems were more severe than anybody expected, with multi-second stalls being observed at times. As a result, some users (Google, for example) have added patches to their kernels to disable the feature. The performance costs are too high, and, in the absence of a use case like those described above, there is no real advantage to using stable pages in the first place.

So now Darrick is back with a new patch set aimed at improving this situation. The core idea is simple enough: a new flag (BDI_CAP_STABLE_WRITES) is added to the backing_dev_info structure used to describe a storage device. If that flag is set, the memory management code will enforce stable pages as is done in current kernels. Without the flag, though, attempts to write a page will not be forced to wait for any current writeback activity. So the flag gives the ability to choose between a slow (but maybe safer) mode or a higher-performance mode.

Much of the discussion around this patch set has focused on just how that flag gets set. One possibility is that the driver for the low-level storage device will turn on stable pages; that can happen, for example, when hardware data integrity features are in use. Filesystem code could also enable stable pages if, for example, it is compressing data transparently as that data is written to disk. Thus far, things work fine: if either the storage device or the filesystem implementation requests stable pages, they will be enforced; otherwise things will run in the faster mode.

The real question is whether the system administrator should be able to change this setting. Initial versions of the patch gave complete control over stable pages to the user by way of a sysfs attribute, but a number of developers complained about that option. Neil Brown pointed out that, if the flag could change at any time, he could never rely on it within the MD RAID code; stable pages that could disappear without warning at any time might as well not exist at all. So there was little disagreement that users should never be able to turn off the stable-pages flag. That left the question of whether they should be able to enable the feature, even if neither the hardware nor the filesystem needs it, presumably because it would make them feel safer somehow. Darrick had left that capability in, saying:

I dislike the idea that if a program is dirtying pages that are being written out, then I don't really know whether the disk will write the before or after version. If the power goes out before the inevitable second write, how do you know which version you get? Sure would be nice if I could force on stable writes if I'm feeling paranoid.

Once again, the prevailing opinion seemed to be that there is no actual value provided to the user in that case, so there is no point in making the flag user-settable in either direction. As a result, subsequent updates from Darrick took that feature out.

Finally, there was some disagreement over how to handle the ext3 filesystem, which is capable of modifying journal pages during writeback even when stable pages are enabled. Darrick's patch changed the filesystem's behavior in a significant way: if the underlying device indicates that stable pages are needed and the filesystem is to be mounted in the data=ordered mode, the filesystem will complain and mount it read-only. The idea was that, now that the kernel could determine that a specific configuration was unsafe, it should refuse to operate in that mode.

At this point, Neil returned to point out that, with this behavior, he would not be able to set the "stable pages required" flag in the MD RAID code. Any system running an ext3 filesystem over an MD volume would break, and he doesn't want to deal with the subsequent bug reports. Neil has requested a variant on the flag whereby the storage level could request stable pages on an optional basis. If stable pages are available, the RAID code can depend on that behavior to avoid copying the data internally. But that code can still work without stable pages (by copying the data, thus stabilizing it) as long as it knows that stable pages are unavailable.

Thus far, no patches adding that feature have appeared; Darrick did, however, post a patch set aimed at simply fixing the ext3 problem. It works by changing the stable page mechanism to not depend on the PG_writeback page flag; instead, it uses a new flag called PG_stable. That allows the journaling layer to mark its pages as being stable without making them look like writeback pages, solving the problem. Comments from developers have pointed out some issues with the patches, not the least of which is that page flags are in extremely short supply. Using a flag to work around a problem with a single, old filesystem may not survive the review process.

The end result is that, while the form of the solution to the stable page performance issue is reasonably clear, there are still a few details to be dealt with. There appears to be enough interest in fixing this problem to get something worked out. Needless to say, that will not happen for the 3.8 development cycle, but having something in place for 3.9 looks like a reasonable goal.

Comments (26 posted)

The PowerClamp driver

By Jonathan Corbet
December 5, 2012
The kernel's power management subsystem has become increasingly effective over recent years, to the point that our CPU power management is said to be second to none. But, while the kernel endeavors to minimize the power consumed by a given workload, it lacks mechanisms to put an overall limit on the amount of power consumed. The recently-announced PowerClamp driver by Jacob Pan and Arjan van de Ven is intended to change that situation on Intel processors.

Most users will never want to use PowerClamp. As a general rule, when one has purchased hardware with a given computational capability, one wants that full capability to be available when needed. But there are situations where it makes sense to run a system below its full speed. Data centers have power-consumption and cooling constraints that can argue against running all systems flat-out all the time. Even the owner of an individual laptop or handheld system may wish to ensure that its operating temperature does not exceed a given value; an overly hot laptop can be uncomfortable to work with, even if it is still working within its specified temperature range. So there can be value in telling the system to run slower at times.

The PowerClamp driver allows the system administrator to set a desired idle percentage by way of a sysfs attribute. That percentage is capped at 50% in the current implementation. Once a percentage has been set, the kernel monitors the actual idle time for each processor in the system. Should a processor's idle time fall below the desired idle percentage, a special kernel thread (called kidle_inject/N, where N is the number of the CPU to which the thread is assigned) is created to take corrective action.

That thread operates as a high-priority realtime process, so it is able to respond quickly when needed. Its job is relatively simple: look at the amount of idle time on its assigned CPU and calculate the difference from the desired idle time. Then, periodically, the thread will run, disable the clock tick, and force the CPU into a sleep state for the required amount of time. The sleeping is done for a given number of jiffies, so the sleep states tend to be relatively long — a necessary condition for an effective reduction in power usage.

Naturally, the PowerClamp thread will continue to monitor actual idle time as it operates, adjusting the amount of forced sleep time as needed. It also monitors the amount of desired sleep time that is lost to interrupts. Interrupts remain enabled during the forced sleep, so they can bring the processor back to an operational state before the PowerClamp driver would have otherwise done so. Over time, the amount of sleep time lost in this manner is tracked; the driver will then attempt to compensate by increasing the amount of forced sleep time to try to pull the CPU back to the original idle time target.

By itself, PowerClamp can come close to achieving the desired level of idle time on a system with a changing workload. Often, though, the real goal is not idle time as such; instead, the purpose is to keep the system within a given level of power consumption or a set of thermal limits. Doing that will require the implementation of additional logic in user space. By monitoring the parameter of interest, a user-space process can implement a control loop that adjusts the desired level of idle time as needed. The PowerClamp driver can respond relatively quickly to those changes, giving the control process an effective tool for the management of the amount of power used by the system.

The driver has been through a couple of revisions with little in the way of substantive comments. This patch poses a relatively small risk to the system, since it does not do anything if the feature is not in use. It could thus conceivably be ready for merging as soon as the 3.8 development cycle. Some more information can be found in the documentation file included with the patch.

Comments (14 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Architecture-specific

Security-related

Virtualization and containers

Miscellaneous

  • Lucas De Marchi: kmod 12 . (December 5, 2012)

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds