Brief items
The current development kernel is 3.7-rc8, reluctantly
released by Linus on December 3.
"
I really didn't want it to come to this, but I was uncomfortable
doing the 3.7 release yesterday due to last-minute issues, and decided to
sleep on it. And today, I ended up even *less* comfortable about it due to
the resurrection of a kswapd issue, so I decided that I'm going to do
another -rc after all." As he points out, that implies that the 3.8
merge window will run close to the holidays.
Stable updates:
3.6.9,
3.4.21 and 3.0.54 were released on December 3.
Meanwhile, 3.2.35 is in the review process;
its release can be expected at any time.
Comments (none posted)
I'm all in favour of "whence", which is indeed the name of that
lseek argument - since mediaeval times I believe.
It's good to have words like that in the kernel source: while
you're in the mood, please see if you can find good homes for
"whither" and "thrice" and "widdershins".
—
Hugh Dickins
Took vacation last week, spent most of it doing userspace coding.
It was joyous.
—
Rusty
Russell
If yes, yet again this illustrates why the use of atomic types
leads people down the path of believing that their code somehow
becomes magically safe through the use of this smoke-screen. IMHO,
every use of atomic_t must be questioned and carefully analysed
before it gets into the kernel - many are buggy through assumptions
that atomic_t buys you something magic.
—
Russell King
Comments (14 posted)
By Jonathan Corbet
December 5, 2012
Last week's edition included
an article on the
addition of the FALLOC_FL_NO_HIDE_STALE flag to the
fallocate() system call. Some developers, objecting to the
patch and the way it got into the kernel, had called for it to be reverted
before the 3.7 release went final. At the time, Linus had not made any
remarks in the discussion or indicated whether he would accept the revert.
That situation changed after Linus was prompted by Martin Steigerwald. His response was clear enough:
If you want something reverted, you show me the *technical* reason
for it. Not the "ooh, I'm so annoyed by how this was done" reason
for it.
And if your little feelings got hurt, get your mommy to tuck you
in, don't email me about it. Because I'm not exactly known for my
deep emotional understanding and supportive personality, am I?
There were some technical reasons offered in the discussion, along with the
more general process-oriented complaints. But it seems clear that Linus
has not found that discussion convincing. So, in the absence of a surprise
from somewhere, it seems that the new fallocate() flag will remain
for the 3.7 release, at which point it will become part of the kernel's
user-space ABI.
Comments (18 posted)
Kernel development news
By Michael Kerrisk
December 5, 2012
The abstract goal of containers is, in
effect, to provide a group of processes with the illusion that that they
are the only processes on the system. When fully implemented, this feature
has the potential to realize many practical benefits, such as light-weight
virtualization and checkpoint/restore.
In order to give the processes in a container the illusion that there
are no other processes on the system, various global system resources must
be wrapped in abstractions that make it appear that each container has its
own instance of the resources. This has been achieved by the addition of
"namespaces" for a number of global resources. Each namespace provides an
isolated view of a particular global resource to the set of processes that
are members of that namespace.
Step by step, more and more global resources have been wrapped in
namespaces, and before we look at another step in this path it's worth
reviewing the progress to date.
Namespaces so far
The first step
in the journey was mount namespaces, which can be used to provide a group
of processes with a private view of the mount points that make up the
filesystem hierarchy. Mount namespaces first appeared in the mainline
kernel in 2002, with the release of Linux 2.4.19. The clone()
flag used to create mount namespaces was given the rather generic name
CLONE_NEWNS for "new namespace", implying that no one was then
really considering the possibility that there might be other kinds of
namespaces; at that time, of course, containers were no more than a gleam
in the eyes of some developers.
However, as the concept of containers took hold, a number of other
namespaces have followed. Network
namespaces were added to provide a group of processes with a private
view of the network (network devices, IP addresses, IP routing tables, port
number space, and so on). PID namespaces
isolated the global "PID number space" resource, so that processes in
separate PID namespaces can have the same PIDs—in particular, each
namespace can have its own 'init' (PID 1), the "ancestor of all
processes". PID namespaces also allow techniques such as freezing the
processes in a container and then restoring them on another system while
maintaining the same PIDs.
Several other global resources have likewise been wrapped in
namespaces, so that there are also IPC
namespaces (initially implemented to isolate System V IPC identifiers
and later to isolate instances of the
virtual filesystems used in the implementation of POSIX message queues) and
UTS namespaces (which wrap the
nodename and domainname identifiers returned by uname(2)).
Work on one of the more complex namespaces, user namespaces, was started in about Linux
2.6.23 and seems to be edging towards
completion. When complete, user namespaces will allow per-namespace mappings
of user and group IDs, so that, for example, it will be possible for a process
to be root inside a container without having root privileges in the system
as a whole.
Of course, a Linux system has a large number of global resources, each
of which could conceivably be wrapped in a namespace. At the more extreme
end, for example, even a resource such as the system time could be wrapped,
so that different containers could maintain different concepts of the
time. (A time namespace was once proposed,
but the implementation was not merged.) The trick is to determine the
minimum set of resources that need to be wrapped for the practical
implementation of containers. (Of course, this "minimum set" may well grow
over time, as people develop new uses for containers.) A related question
is how those wrappings should be grouped so as to avoid an explosion of
namespaces that would increase application complexity. So, for example,
System V IPC and POSIX message queues could conceivably have been
wrapped in different namespaces, but the kernel developers concluded that
it makes practical sense to group them in a single "IPC" namespace.
The global kernel log problem
What is necessary for the practical implementation of containers
sometimes only becomes clear when one starts doing, well, practical
things. Thus, it was that in early 2010 Jean-Marc Pigeon reported that he had written a small utility
to build containers using the clone() system call that worked
fine, except that "HOST and all containers share the SAME /proc/kmsg,
meaning kernel syslog information are scrambled (useless)".
What Jean-Marc was discovering is that the kernel log is one of the
global resources that is not wrapped in a namespace. He went on to note
another ill-effect: "I have in iptables, reject packet logging on the
HOST, [but as soon as] rsyslog is started on one container, I can't see my
reject packet log any more." In other words, starting a
syslog daemon on the host or any container sucks up all of the
kernel log messages produced on the host or in any container. The point
here about iptables is particularly relevant: the inability to
isolate kernel log messages from iptables is a significant
practical problem when trying to employ the network namespaces facility
that the kernel already provides.
In response to Jean-Marc's question about how the problem could be
fixed, Serge Hallyn replied:
Well, the results of do_syslog() should be containerized. Kernel messages (oopses for
instance) should always go to the initial container. Shouldn't be hard to
do, but the question is what do we tie it to? User namespace? Network
namespace? … I'm tempted to say userns makes the most sense - if
you start a new userns you likely always want private syslog, whereas with
netns and pidns you may not.
do_syslog() is the kernel function that encapsulates the main
logic of the syslog(2)
system call. That system call retrieves messages from the kernel log ring
buffer (and performs a range of control operations on the log buffer) that
is populated by messages created using the kernel's printk()
function. Thus, though discussions on this topic have tended to use the
term "syslog namespace", that is something of a misnomer: what is really
meant is wrapping the kernel log resource in a namespace.
To avoid possible confusion, it is probably worth noting that the
syslog(2) system call is a quite different thing from the syslog(3)
library function, which writes messages to the UNIX domain datagram socket
(/dev/log) from which the user-space syslog daemon
(rsyslogd or similar) retrieves messages. (Because of this
collision of names, the GNU C library exposes the syslog(2) system
call under a quite different name: klogctl().) A picture helps
clarify things:
First attempts at a solution
In the event, "containerizing" do_syslog() turned out to be
more difficult than Serge thought. His first
shot at addressing the problem (a "gross hack" to "provide each
user namespace with its own syslog ring buffer") quickly uncovered
a further difficulty: the kernel's
printk() is sometimes called in contexts where there is no way to
determine in which of the per-namespace ring buffers a message should be
logged. For example, if the kernel is executing a network interrupt (to
process an incoming network packet) and wants to log a message, that
message should not be sent to the per-namespace kernel log of the
interrupted process. Rather, the message should be sent to the kernel log
associated with the network namespace for the network device; however,
the kernel data structures provide no way to obtain a reference to that
kernel log.
Jean-Marc himself also made an attempt
at implementing a solution. However, Serge pointed out that Jean-Marc's patch suffered
some of the same problems as his own earlier attempt. Serge went on to
describe what he thought would be the correct solution, which would require
the creation of a separate syslog namespace. His proposed solution can be
paraphrased as follows:
- The core of vprintk_emit() (which contains most of
implementation of the printk() function) should be moved into
a new nsvprintk_emit() function that takes an argument that specifies a
syslog namespace.
- vprintk_emit() would then become a wrapper around
nsvprintk_emit() that specifies the "initial" syslog namespace
(i.e., the syslog namespace of the host system).
- A namespace-aware version of printk(), called (say)
nsprintk(), should be created. That function would take a syslog
namespace argument and pass it to nsvprintk_emit().
- The kernel log ring buffer should be "containerized" as per Serge's
initial patch. Thus each syslog namespace would have its own ring buffer,
and syslog(2) would operate on the per-namespace ring buffer of
the calling process.
- At call sites in the kernel code where it is not appropriate to use the
syslog namespace of the currently executing process, calls to
printk() should be replaced with calls to nsprintk() that
pass a suitable syslog namespace argument.
Although Jean-Marc made a few more efforts to rework his patch in the
following weeks, the effort ultimately petered out without much further
comment or consensus on a solution. It seems that Serge and other kernel developers realized that the
problem was more complex than first thought and they had neither the time
to implement a solution themselves nor to help Jean-Marc toward
implementing a solution.
The main difficulty lies in the last of the points above, and its
solution was not really elaborated in Serge's mail. The kernel data
structures and code need to be modified to add suitable hooks to handle the
"no current process context problem"—the cases where
printk() is called from a context in which the currently executing
process can't be used to identify a suitable syslog namespace to which a
message should be logged.
Restarting work on a solution
Work in this area then seems to have gone quiet for more than two
years, until a few days ago when Serge proposed a new proof-of-concept patch set, pretty much
along the lines he described two years earlier. His description of the
patch noted that:
The syslog ns is tied to a user
namespace. You must create a new user namespace before you can create a
new sylog ns. The syslog ns is created through a new command (11) to
the __NR_syslog system call.
Once a task enters a new syslog ns, it's "dmesg", "dmesg -c" and /dev/kmsg
actions affect only itself, so that user-created syslog messages no longer
are confusingly combined in the host's syslog.
In other words, Serge's patch provides isolation for the kernel log by
implementing a new dedicated namespace for that purpose (rather than
providing the isolation by attaching the implementation to one of the
existing namespaces). Each syslog namespace instance would be tied to a
particular user namespace.
Normally, new namespaces of each type are created by suitable flags to
the clone() system call. Thus, for example, there are clone flags
such as CLONE_NEWUTS and CLONE_NEWUSER. However, a while
ago, the kernel developers realized that the flag space for
clone() was exhausted. (Providing additional flag space was one of
the motivations behind the proposal to add
an eclone() system call, a proposal that was ultimately
unsuccessful.) For this reason, Serge proposed instead to use a new command
to the syslog() system call to create syslog namespace instances.
Serge went on to note:
"printk" itself always goes
to the initial syslog_ns, and consoles belong only to the initial
syslog_ns. However printks relating to a specific network namespace, for
instance, can now be targeted to the syslog ns for the user ns which owns
the network ns, aiding in debugging in a container.
Serge's patch would solve the "no current process context problem" as
follows. As noted above, this case is handled by an
nsprintk()-style function that takes an argument (of type
struct syslog_ns *) that identifies the syslog namespace
to which the log message should be sent. The value for that argument can be
obtained via the struct net structure for the network
namespace instance: in the current user namespace implementation (git
tree), when a network namespace is created using clone(), a
pointer to the corresponding user namespace instance of the caller is
stored in the net structure. Serge's patch in turn provides a
linkage from that user namespace structure to the corresponding syslog
namespace.
Eric Biederman, the maintainer of the user namespace git tree, agreed with Serge's overall approach, but
queried one particular point:
I am not a fan of how this ties into the user namespace. I would prefer
closer or looser ties. The recursive reference count loop where a userns
refers to a syslogns and that syslogns refers to the same userns is
unpleasant.
In Serge's implementation, the syslog and user namespaces are
maintained as separate structures, but, as the recursive pointers between
the two namespace structures and the need to create a new user namespace
before creating a syslog namespace indicate, instances of each namespace
are not truly independent. In Eric's view then, the syslog and user
namespace structures should either be more fully decoupled, or they should
be much more tightly coupled.
Eric went on later to note that:
There is an argument to be made that syslog messages are the kind of
security identifiers like uid, gids, and keys that should be part of a user
namespace. I'm not fully convinced but there are some DOS attacks that
would naturally prevent.
The discussion ultimately led Serge to conclude that the syslog resource should
instead be grouped as part of the user namespace rather than as a separate
namespace:
I can't really think of a good case for not putting the syslogns straight
into the userns (i.e. not having a separate syslogns), so I'd say let's go
that route.
Serge's patch seems to have inspired another group to try implementing
syslog namespaces. A couple of days after Serge's patch, Rui Xiang posted some patches that he and his colleague
Libo Chen had developed to implement similar functionality. Rui began by
noting a couple of the obvious differences in their patch set:
In Serge's patch [...] syslog_namespace was tied to a user namespace. We add
syslog_ns tied to nsproxy instead, and implement ns_printk in ip_table
context.
We add syslog_namespace as a part of nsproxy, and a new flag
CLONE_SYSLOG to unshare syslog area.
Using nsproxy is the conventional way of dealing with the
namespaces associated with a process: it is a structure that contains
pointers to structures describing each of the namespaces that a process is
associated with. This contrasts with Serge's original approach, which hung
the syslog namespace off the user namespace.
Rui's team also took advantage of a detail that Serge perhaps
overlooked: there happens to be one spare bit in the flag space for
clone() because the CLONE_STOPPED flag was removed
several kernel releases ago. Therefore, Rui's team repurposed that
bit. Normally, it would not be safe to recycle flag bits in this way, but
the CLONE_STOPPED flag has a special history. It was initially
proposed for use specifically in the NPTL threading implementation, but the
final implementation abandoned the flag in favor of a different
approach. As such, CLONE_STOPPED is likely never to have had
serious user-space users.
Unsurprisingly, the overall approaches of the two patch sets have many
similarities, but there are differences in details such as how a syslog
namespace is associated with a struct net in order to solve
the "no current process context problem".
Although kernel flame wars between competing implementations are what
often make the biggest headlines in the online press, the subsequent
exchange between Serge, Rui, and Libo demonstrated that life on developer
mailing lists is usually more cordial. Serge asked:
I understand that user namespaces aren't 100% usable yet, but looking
long term, is there a reason to have the syslog namespace separate
from user namespace?
In response, Rui noted:
Actually we don't have strong preference. We'll think more about it. Hope
we can make consensus with Eric.
That in turn led Serge to ask Rui and
Libo if his patch set might suffice for their needs, with the gracious note
that:
I'm not at all wedded to my patchset. I'm happy to go with something else
entirely. My set was just a proof of concept.
There is one other notable difference in functionality between the two
patch sets. In Serge's patch set, system consoles belonged (by intention)
only to the initial syslog namespace, meaning that kernel log
messages from other syslog namespace instances can't be displayed on
consoles. By contrast, Rui and Libo's patches include consoles in the
syslog namespace, so that kernel messages from syslog namespaces other than
the initial namespace can be displayed on consoles. Rui and Libo would like
this functionality in order to be able to obtain kernel log messages from
containers when monitoring embedded devices that provide access to the
console over a serial port.
The summary of the discussion is that there are useful pieces in
both patches. Serge plans to revise his
patch to merge the syslog namespace functionality into user namespaces, add
the console functionality desired by Rui and Libo, and add some in-kernel
uses of the namespace-aware printk() interface as a
proof-of-concept for the implementation (as was done in the patches by Rui
and Libo).
Concluding remarks
The history of the work to provide syslog namespaces (or as it might
better be termed, namespace isolation for the kernel log) presents a
microcosm of work on namespaces in general. As has often been the case, the
implementation of namespaces often turns out to be surprisingly
complex. Much of that complexity hinges on detailed questions of
functionality (for example, the behavior of consoles in this case) and the
question of whether resources should be grouped inside a new namespace or
within an existing namespace. In the case of syslog namespaces, it looks
like a number of decisions have been made; there will probably be a few
more rounds of patches, but there seems to be general consensus on the
direction forward. Thus, there is a reasonable chance that proper namespace
isolation of kernel logging will appear in the kernel sometime around Linux
3.9 or soon afterward.
Comments (8 posted)
By Jonathan Corbet
December 5, 2012
The term "stable pages" refers to the concept that the system should not
modify the data in a page of memory while that page is being written out to
its backing store. Much of the time, writing new data to in-flight pages
is not actively
harmful; it just results in the writing of the newer data sooner than might
be expected. But sometimes, modification of in-flight pages can create
trouble; examples include hardware where data integrity features are in
use, higher-level RAID implementations, or filesystem-implemented
compression schemes. In those cases, unexpected data modification can
cause checksum failures or, possibly, data corruption.
To avoid these problems, the stable pages
feature was merged for the 3.0 development cycle. This relatively
simple patch set simply ensures that any thread trying to modify an
under-writeback page blocks until the pending write operation is complete.
This patch set, by Darrick Wong, appeared to solve the problem; by blocking
inopportune data modifications, potential problems were avoided and
everybody would be happy.
Except that not everybody was happy. In early 2012, some users started reporting performance problems associated with
stable pages. In retrospect, such reports are not entirely surprising;
any change that causes processes to block and wait for asynchronous events
is unlikely to make things go faster. In any case, the reported problems
were more severe than anybody expected, with multi-second stalls being
observed at times. As a result, some users (Google, for example) have added patches to
their kernels to disable the feature. The performance costs are too high,
and, in the absence of a use case like those described above, there is no
real advantage to using stable pages in the first place.
So now Darrick is back with a new patch set
aimed at improving this situation. The core idea is simple enough: a new
flag (BDI_CAP_STABLE_WRITES) is added to the
backing_dev_info structure used to describe a storage device. If
that flag is set, the memory management code will enforce stable pages as
is done in current kernels. Without the flag, though, attempts to write a
page will not be forced to wait for any current writeback activity. So the
flag gives the ability to choose between a slow (but maybe safer) mode or a
higher-performance mode.
Much of the discussion around this patch set has focused on just how that
flag gets set. One possibility is that the driver for the low-level
storage device will turn on stable pages; that can happen, for example,
when hardware data integrity features are in use. Filesystem code could
also enable stable pages if, for example, it is compressing data
transparently as that data is written to disk. Thus far, things work fine:
if either the storage device or the filesystem implementation requests
stable pages, they will be enforced; otherwise
things will run in the faster mode.
The real question is whether the system administrator should be able to
change this setting. Initial versions of the patch gave complete control over
stable pages to the user by way of a sysfs attribute, but a number of
developers complained about that option. Neil Brown pointed out that, if the flag could change at
any time, he could never rely on it within the MD RAID code; stable pages
that could disappear without warning at any time might as well not exist at
all. So there was
little disagreement that users should never be able to turn off the
stable-pages flag. That left the question of whether they should be able
to enable the feature, even if neither the hardware nor the
filesystem needs it, presumably because it would make them feel safer
somehow. Darrick had left that capability in, saying:
I dislike the idea that if a program is dirtying pages that are
being written out, then I don't really know whether the disk will
write the before or after version. If the power goes out before
the inevitable second write, how do you know which version you get?
Sure would be nice if I could force on stable writes if I'm feeling
paranoid.
Once again, the prevailing opinion seemed to be that there is no actual
value provided to the user in that case, so there is no point in making the
flag user-settable in either direction. As a result, subsequent updates
from Darrick took that feature out.
Finally, there was some disagreement over how to handle the ext3
filesystem, which is capable of modifying journal pages during writeback
even when stable pages are enabled. Darrick's patch changed the
filesystem's behavior in a significant way: if the underlying device
indicates that stable pages are needed and the filesystem is to be mounted
in the data=ordered mode, the filesystem will complain and mount
it read-only. The idea was that, now that the kernel could determine that
a specific configuration was unsafe, it should refuse to operate in that
mode.
At this point, Neil returned to point out
that, with this behavior, he would not be able to set the "stable pages
required" flag in the MD RAID code. Any system running an ext3 filesystem
over an MD volume would break, and he doesn't want to deal with the
subsequent bug reports. Neil has requested a variant on the flag whereby
the storage level could request stable pages on an optional basis. If
stable pages are available, the RAID code can depend on that behavior to
avoid copying the data internally. But that code can still work without
stable pages (by copying the data, thus stabilizing it) as long as it knows
that stable pages are unavailable.
Thus far, no patches adding that feature have appeared;
Darrick did, however, post a patch set
aimed at simply fixing the ext3 problem. It works by changing the stable
page mechanism to not depend on the PG_writeback page flag;
instead, it uses a new flag called PG_stable. That allows the
journaling layer to mark its pages as being stable without making them look
like writeback pages, solving the problem. Comments from developers have
pointed out some issues with the patches, not the least of which is that
page flags are in extremely short supply. Using a flag to work around a
problem with a single, old filesystem may not survive the review process.
The end result is that, while the form of the solution to the stable page
performance issue is reasonably clear, there are still a few details to be
dealt with. There appears to be enough interest in fixing this problem
to get something worked out. Needless to say, that will not happen for the
3.8 development cycle, but having something in place for 3.9 looks like a
reasonable goal.
Comments (26 posted)
By Jonathan Corbet
December 5, 2012
The kernel's power management subsystem has become increasingly effective
over recent years, to the point that our CPU power management is said to be
second to none. But, while the kernel endeavors to minimize the power
consumed by a given workload, it lacks mechanisms to put an overall limit
on the amount of power consumed. The recently-announced
PowerClamp driver by Jacob Pan and Arjan van
de Ven is intended to change that situation on Intel processors.
Most users will never want to use PowerClamp. As a general rule,
when one has purchased hardware with a given computational capability, one
wants that full capability to be available when needed. But there are
situations where it makes sense to run a system below its full speed. Data
centers have power-consumption and cooling constraints that can argue
against running all systems flat-out all the time. Even the owner of an
individual laptop or handheld system may wish to ensure that its operating
temperature does not exceed a given value; an overly hot laptop can be
uncomfortable to work with, even if it is still working within its
specified temperature range. So there can be value in telling the system
to run slower at times.
The PowerClamp driver allows the system administrator to set a desired idle
percentage by way of a sysfs attribute. That percentage is capped at 50%
in the current implementation. Once a percentage has been set, the kernel
monitors the actual idle time for each processor in the system. Should a
processor's idle time fall below the desired idle percentage, a special
kernel thread
(called kidle_inject/N, where N is the number of the CPU
to which the thread is assigned) is created to take corrective
action.
That thread operates as a high-priority realtime process, so it is able to
respond quickly when needed. Its job is relatively simple: look at the amount
of idle time on its assigned CPU and calculate the difference from the
desired idle time. Then, periodically, the thread will run, disable the
clock tick, and force the CPU into a sleep state for the required amount
of time. The sleeping is done for a given number of jiffies, so
the sleep states tend to be relatively long — a necessary condition for an
effective reduction in power usage.
Naturally, the PowerClamp thread will continue to monitor actual idle time
as it operates, adjusting the amount of forced sleep time as needed. It
also monitors the amount of desired sleep time that is lost to interrupts.
Interrupts remain enabled during the forced sleep, so they can bring the
processor back to an operational state before the PowerClamp driver would
have otherwise done so. Over time, the amount of sleep time lost in this
manner is tracked; the driver will then attempt to compensate by increasing
the amount of forced sleep time to try to pull the CPU back to the original
idle time target.
By itself, PowerClamp can come close to achieving the desired level of idle
time on a system with a changing workload. Often, though, the real goal is
not idle time as such; instead, the purpose is to keep the system within a
given level of power consumption or a set of thermal limits. Doing that
will require the implementation of additional logic in user space. By
monitoring the parameter of interest, a user-space process can implement a
control loop that adjusts the desired level of idle time as needed. The
PowerClamp driver can respond relatively quickly to those changes, giving the
control process an effective tool for the management of the amount of power
used by the system.
The driver has been through a couple of revisions with little in the way of
substantive comments. This patch poses a relatively small risk to the
system, since it
does not do anything if the feature is not in use. It could thus conceivably
be ready for merging as soon as the 3.8 development cycle. Some more
information can be found in the documentation
file included with the patch.
Comments (14 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
- Lucas De Marchi: kmod 12 .
(December 5, 2012)
Page editor: Jonathan Corbet
Next page: Distributions>>