The current development kernel is 3.7-rc1
on October 14. See the separate
article, below, for a summary of the final items added during the 3.7 merge
Stable updates: 3.0.46, 3.4.14, 3.5.7
and 3.6.2 were released on
October 12. 3.5.7 is the end of the line for updates in the 3.5
Comments (none posted)
Apparently it is a bad idea to compose and send a patch while in a
C++ standards committee meeting where people are arguing about
— Paul McKenney
I believe the answer is that recent vulnerabilities have lead us to
abandon the idea that we can trust anything in userspace, and
retreat to the kernel. The concept that the kernel is more secure
because we didn't include lots of crap seems to be a heretical
Long experience with file systems shows us that they are like fine
wine; they take time to mature. Whether you're talking about
ext2/3/4, btrfs, Sun's ZFS, Digital's ADVFS, IBM's JFS or GPFS
etc., and whether you're talking about file systems developed using
open source or more traditional corporate development processes, it
takes a minimum of 3-5 years and 50-200 PY's of effort to create a
fully production-ready file system from scratch.
— Ted Ts'o
I went to prepare a patch to fix this, and ended up finding no such
problem to fix - which fits with how no such problem has been
— No-such-signoff-by: Hugh Dickins
The requirement for a FIPS 140-2 module is to disable the entire
module if any component of its self test or integrity test
There are two solutions that were contemplated for disabling the
module: having a kind of global status of the crypto API that makes
it non-responsive in case of an integrity/self-test error. The
other solution is to simply terminate the entire kernel. As the
former one also will lead to a kernel failure eventually as many
parts of the kernel depend on the crypto API, the implementation of
the latter option was chosen.
— Stephan Mueller
; don't try to load a
unsigned module in FIPS mode
What is the proper amount of time to wait upon receiving an email
containing obviously incorrect statements about Linux kernel code
before sending a "you have got to be kidding" email response.
Should I just hope the sender realizes their foolishness on their
own and give them N hours to rescind the statement and fix up their
insane patch and resend it, thereby giving them a grace period? If
so, what is the proper value for N?
Or is it fair game to let loose and channel up the Torvalds-like
daemons within my keyboard, with the hope that it would actually do
some good and they would learn from their mistakes?
— Greg Kroah-Hartman
Comments (10 posted)
Al Viro has been busily reworking the creation of kernel threads, making
the code cleaner and less architecture-specific. As part of that exercise,
he has posted a lengthy document on how kernel thread creation works now
and the changes that he is making. Worth a read for those interested in
how this part of the core kernel works. "Old implementation of kernel_thread() had been rather convoluted. In the
best case, it filled struct pt_regs according to its arguments and passed
them to do_fork(). The goal was to fool the code doing return from
syscall into *not* leaving the kernel mode, so that newborn would have
(after emptying its kernel stack) end up in a helper function with the
right values in registers. Running in kernel mode.
Full Story (comments: none)
Kernel development news
Linus pulled a total of 10,409 non-merge changesets into the mainline
before closing the merge window for the 3.7 development cycle. That makes
3.7 one of the most active development cycles in recent history; only 3.2,
with 10,214 changesets in the merge window, comes close. Clearly, there is
a lot going on in the kernel development community.
Interestingly, Linus expressed some skepticism about some of this cycle's
work in the 3.7-rc1 announcement. For
example, the discussion on the 64-bit ARM patch
set concluded some time ago, but Linus came in with a late opinion of
[L]et's see how many years we'll need before the arm people do what
every single other 64-bit arch has ever done: merge back with the
32-bit code. As usual, people claimed that there were tons of
reasons why *this* time was different, and as usual it's almost
certainly going to be BS in the end, and a few years from now we'll
have big patches trying to merge it all back. But maybe it really
*was* different this time. Snicker.
He also expressed some grumpiness about the user-space API header file split — an enormous
set of patches that is only partially merged for 3.7. Header file
cleanups, he says, are just too much pain for the benefit that results, so
he will not consider any more of them in the future.
Grumbles notwithstanding, he pulled all of this work — and much more — for
3.7. The user-visible changes merged since last week's summary include:
- Support for signed kernel modules has
been merged. With this feature turned on, the kernel will refuse to
load modules that have not been signed with a recognized key. Among
other users, full support of UEFI secure boot requires this
capability. There is also a mode where unsigned modules will still be
loaded, but the kernel will be tainted in the process.
- NFS 4.1 support is no longer considered experimental.
- The MD RAID layer now supports TRIM ("discard") operations.
- New hardware support includes TI LM355x and LM3642 LED controllers,
Atmel At91 two-wire interface controllers (replaced driver), and
Renesas R-Car I2C controllers.
Changes visible to kernel developers include:
- The "UAPI disintegration" patch sets have been pulled into quite
a few subsystem trees, causing a lot of header file (and related)
churn. A fair amount of this work was deferred to 3.8 as well,
though, so this job is not yet done.
- The kerneldoc subsystem can now output documents in the HTML5 format.
- The kernel now has a generic cooling subsystem based on cpufreq; see
for (a few) details.
- It's worth noting that some kernel developers have expressed
grumpiness about the increase in build time caused by the addition of
the signed module feature. Anybody whose work involves doing lots of
fast kernel builds will probably want to turn that feature off.
At this point it is time to perform the final stabilization work on all
these changes. If things go according to the usual schedule, that should
result in the final 3.7 release sometime in early December.
Comments (none posted)
Other than the merging of the server-side component of TCP Fast Open, one of the few
user-space API changes that has gone into the just-closed 3.7 merge window
is the addition of a new EPOLL_CTL_DISABLE operation for the
epoll_ctl() system call. It's interesting to look at this
operation as an illustration of the sometimes unforeseen complexities of
dealing with multithreaded applications; that examination is the subject of this
article. However, the addition of the EPOLL_CTL_DISABLE feature
highlights some common problems in the design of the APIs that the kernel
presents to user space. (To be clear: EPOLL_CTL_DISABLE is the
fix to a past design problem, not a design problem itself.) These
design problems will be the subject of a follow-on article next week.
Understanding the need for EPOLL_CTL_DISABLE requires an
understanding of several features of the epoll API. For those who are
unfamiliar with epoll, we begin with a high-level picture of how the API
works. We then look at the problem that EPOLL_CTL_DISABLE is
designed to solve, and how it solves that problem.
An overview of the epoll API
The (Linux-specific) epoll API allows an application to monitor
multiple file descriptors in order to determine which of the descriptors
are ready to perform I/O. The API was designed as a more efficient
replacement for the traditional select() and
poll() system calls. Roughly speaking, the performance of those
older APIs scales linearly with the number of file descriptors being
behavior makes select() and poll() poorly suited for
modern network applications that may handle thousands of file descriptors
The poor performance of select() and poll() is an
inescapable consequence of their design. For each monitoring operation,
both system calls require the application to give the kernel a complete
list of all of the file descriptors that are of interest. And on each call,
the kernel must re-examine the state of all of those descriptors and then
pass a data structure back to the application that describes the readiness
of the descriptors.
The underlying problem of the older APIs is that they don't allow an
application to inform the kernel about its ongoing interest in a
(typically unchanging) set of file descriptors. If the kernel had that
information, then, as each file descriptor became ready, it could record
the fact in preparation for the next request by the application for the set
of ready file descriptors. The epoll API allows exactly that approach, by
splitting the monitoring API up across three system calls:
- epoll_create() creates an internal kernel data structure
("an epoll instance") that is used to record the set of file descriptors
that the application is interested in monitoring. The call returns a file
descriptor that is used in the remaining epoll APIs.
- epoll_ctl() allows the application to inform the kernel
about the set of file descriptors it would like to monitor by adding
(EPOLL_CTL_ADD) and removing (EPOLL_CTL_DEL) file
descriptors from the interest list of the epoll
instance. epoll_ctl() can also modify (EPOLL_CTL_MOD) the
set of events that are to be monitored for a file descriptor that is
already in the interest list. Once a file descriptor has been recorded in
the interest list, the kernel tracks I/O events for the file descriptor
(e.g., the arrival of new input); if the event causes the file descriptor
to become ready, the kernel places the descriptor on the ready list
of the epoll instance, in preparation for the next call to
- epoll_wait() requests the kernel to return one or more
ready file descriptors. The kernel satisfies this request by simply
fetching items from the ready list (the call can block if there
are no descriptors that are yet ready). The application uses
epoll_wait() each time it wants to check for changes in the
readiness of file descriptors. What is notable about epoll_wait()
is that the application does not need to pass in a list of file descriptors
on each call: the kernel already has that information via preceding calls
to epoll_ctl(). In addition, there is no need to rescan the
complete set of file descriptors to see which are ready; the kernel has
already been recording that information on an ongoing basis because it
knows which file descriptors the application is interested in.
Schematically, the epoll API operates as shown in the following diagram:
Because the kernel is able to maintain internal state about the set of
file descriptors in which the application is interested,
epoll_wait() is much more efficient than select() and
poll(). Roughly speaking, its performance scales according to the
number of ready file descriptors, rather than the total number of file
descriptors being monitored.
Epoll and multithreaded applications: the problem
The author of the patch that implements EPOLL_CTL_DISABLE,
Paton Lewis, is not a regular kernel hacker. Rather, he's a developer with
a particular user-space itch, and it would seem that a kernel change is the
only way of scratching that itch. In the description accompanying the first
iteration of his patch, Paton began with the following observation:
It is not currently possible to reliably delete epoll items when using
the same epoll set from multiple threads. After calling epoll_ctl with
EPOLL_CTL_DEL, another thread might still be executing code related to an
event for that epoll item (in response to epoll_wait). Therefore the
deleting thread does not know when it is safe to delete resources
pertaining to the associated epoll item because another thread might be
using those resources.
The deleting thread could wait an arbitrary amount of time after
calling epoll_ctl with EPOLL_CTL_DEL and before deleting the item, but this
is inefficient and could result in the destruction of resources before
another thread is done handling an event returned by epoll_wait.
The fact that the kernel records internal state is the source of a
complication for multithreaded applications. The complication arises from
the fact that applications may also want to maintain state information
about file descriptors. One possible reason for doing this is to prevent
file descriptor starvation, the phenomenon that can occur when, for
example, an application determines that a file descriptor has data
available for reading and then attempts to read all of the available
data. It could happen that there is a very large amount of data available
(for example, another application may be continuously writing data on the
other end of a socket connection). Consequently, the reading application
would be tied up for a long period; meanwhile, it does not service I/O
events on the other file descriptors—those descriptors are starved of
service by the application.
The solution to file descriptor starvation is for the application to
maintain a user-space data structure that caches the readiness of each of
the file descriptors that it is monitoring. Whenever epoll_wait()
informs the application that a file descriptor is ready, then, instead of
performing as much I/O as possible on the descriptor, the application makes
a record in its cache that the file descriptor is ready. The application
logic then takes the form of a loop that (a) periodically calls
epoll_wait() and (b) performs a limited amount of I/O on
the file descriptors that are marked as ready in the user-space
cache. (When the application finds that I/O is no longer possible on one of
the file descriptors, then it can mark that descriptor as not ready in the
Thus, we have a scenario where the both kernel and a user-space
application are maintaining state information about the same
resources. This can potentially lead to race conditions when competing
threads in a multithreaded application want to update state information in
both places. The most fundamental piece of state information maintained in
both places is "existence".
For example, suppose that an application thread determines that it is
no longer necessary to monitor a file descriptor. The thread would first
check to see whether the file descriptor is marked as ready in the
user-space cache (i.e., there may still be some outstanding I/O to
perform), and then, if the file descriptor is not ready, the thread would
delete the file descriptor from the user-space cache and from the kernel's
epoll interest list using the epoll_ctl(EPOLL_CTL_DEL)
operation. However, these steps could fall afoul in scenarios such as the
following involving two threads operating on file descriptor 9:
Determine from the user-space cache that descriptor 9 is not ready.
Call epoll_wait(); the call indicates descriptor 9 as ready.
Record descriptor 9 as being ready inside the user-space cache so that I/O
can later be performed.
Delete descriptor 9 from the user-space cache.
Delete descriptor 9 from the kernel's epoll interest list
Following the above scenario, some data will be lost. Other scenarios could
lead to a corrupted cache or an application crash.
No use of (per-file-descriptor) mutexes can eliminate the sorts of
races described here, short of protecting the calls to
epoll_wait() with a (global) mutex, which has the effect of
destroying concurrency. (If one thread is blocked in a
epoll_wait() call, then any other thread that tries to acquire
the corresponding mutex will also block.)
Epoll and multithreaded applications: the solution
Paton's solution to this problem is to extend the epoll API with a new
operation that atomically prevents other threads from receiving further
indications that a file descriptor is ready, while at the same time
informing the caller whether another thread has "recently" been told the
file descriptor is ready. The new operation relies on some of the inner
workings of the epoll API.
When adding (EPOLL_CTL_ADD) or modifying
(EPOLL_CTL_MOD) a file descriptor in the interest list, the
application specifies a mask of I/O events that are of interest for the
descriptor. For example, the mask might include both EPOLLIN and
EPOLLOUT, if the application wants to know when the file
descriptor becomes either readable or writable. In addition, the kernel
implicitly adds two further flags to the events mask in the interest list:
EPOLLERR, which requests monitoring for error conditions, and
EPOLLHUP, which requests monitoring for a "hangup" (e.g., we are
monitoring the read end of a pipe, and the write end is closed). When a
file descriptor becomes ready, epoll_wait() returns a mask that
contains all of the requested events for which the file descriptor is
ready. For example, if an application requests monitoring of the read end
of a pipe using EPOLLIN and the write end of the pipe is closed,
then epoll_wait() will return an events mask that includes both
EPOLLIN and EPOLLHUP.
As well as the flags that can be used to monitor file descriptors for
various I/O events, there are a few "operational flags"—flags that
modify the semantics of the monitoring operation itself. One of these is
EPOLLONESHOT. If this flag is specified in the events mask for a
file descriptor, then, once the file descriptor becomes ready and is
returned by a call to epoll_wait(), it is disabled from further
monitoring (but remains in the interest list). If the application is
interested in monitoring file descriptor once more, then it must re-enable
the file descriptor using the epoll_ctl(EPOLL_CTL_MOD) operation.
Per-descriptor events mask recorded in an epoll interest list
I/O event flags
The implementation of EPOLLONESHOT relies on a trick. If this
flag is set, then, if the file descriptor indicates as being ready
via epoll_wait(), the kernel clears all of the
"non-operational flags" (i.e., the I/O event flags) in the events mask for
that file descriptor. This serves as a later cue to the kernel that it
should not track I/O events for this file descriptor.
By now, we finally have enough details to understand Paton's extension
to the epoll API—the epoll_ctl(EPOLL_CTL_DISABLE)
operation—that allows multithreaded applications to avoid the kind of
races described above. To successfully use this extension requires the
- The user-space cache that describes file descriptors should also
include a per-descriptor "delete-when-done" flag that defaults to false but
can be set true when one thread wants to inform another thread that a
particular file descriptor should be deleted.
- All epoll_ctl() calls that add or modify file descriptors
in the interest list must specify the EPOLLONESHOT flag.
- The epoll_ctl(EPOLL_CTL_DISABLE) operation should be used
as described in a moment.
In addition, calls to epoll_ctl(EPOLL_CTL_DISABLE) and
accesses to the user-space cache must be suitably protected with
per-file-descriptor mutexes. We won't go into details here, but the second version of Paton's patch adds a
sample application to the kernel source tree (under
tools/testing/selftests/epoll/test_epoll.c) that demonstrates the
The new epoll operation is employed via the following call:
epoll_ctl(epfd, EPOLL_CTL_DISABLE, fd, NULL);
is a file descriptor referring to an epoll
is the file descriptor in the interest list that is
to be disabled. The semantics of this operation handle two cases:
- One or more of the I/O event flags is set in the interest list
entry for fd. This means that, since the last epoll_ctl()
operation that added or modified this interest list entry, no other thread
has executed an epoll_wait() call that indicated this file
descriptor as being ready. In this case, the kernel clears the I/O event
flags in the interest list entry, which prevents subsequent
epoll_wait() calls from returning the file descriptor as being
ready. The epoll_ctl(EPOLL_CTL_DISABLE) call then returns zero to
the caller. At this point, the caller knows that no other thread is
operating on the file descriptor, and it can thus safely delete the
descriptor from the user-space cache and from the kernel interest list.
- No I/O event flag is set in the interest list entry for
fd. This means that since the last epoll_ctl() operation
that added or modified this interest list entry, another thread has
executed an epoll_wait() call that indicated this file descriptor
as being ready. In this case, epoll_ctl(EPOLL_CTL_DISABLE) returns
–1 with errno set to EBUSY. At this point, the
caller knows that another thread is operating on the descriptor, so it sets
the descriptor's "delete-when-done" flag in the user-space cache to
indicate that the other thread should delete the file descriptor once when
it has finished using it.
Thus, we see that with a moderate amount of effort, and a little help
from a new kernel interface, a race can be avoided when deleting file
descriptors in multithreaded applications that wish to avoid file
There was relatively little comment on the first iteration of Paton's
patch. The only substantive comments came from Christof Meerwald; in
response to these, Paton created the second version of his patch. That
version received no comments, and was incorporated into 3.7-rc1. It
would be nice to think that the relatively paucity of comments reflects the
silent agreement that Paton's approach is correct. However, one is left
with the nagging feeling that in fact few people have reviewed the patch,
which leaves open the question: is this the best solution to the problem?
Although EPOLL_CTL_DISABLE solves the problem, the solution is
neither intuitive nor easy to use. The main reason for this is that
EPOLL_CTL_DISABLE is a bolt-on hack to the epoll API that
satisfies the requirement (often repeated
by Linus Torvalds) that existing user-space applications must not be
broken by making a kernel ABI change. Within that constraint,
EPOLL_CTL_DISABLE may be the best solution to the
problem. However, it seems certain that a better solution might have been
possible if it had incorporated during the original design of the
epoll API. Next week's follow-on article will consider whether a better
initial solution could have been found and also consider why it might not
be possible to find a better solution within the constraints of the current
Finally, it's worth noting that the EPOLL_CTL_DISABLE feature
is not yet cast in stone, although it will become so in about two months,
when Linux 3.7 is released. In the meantime, if someone comes up with a
better idea to solve the problem, then the existing approach could be
modified or replaced.
Comments (19 posted)
The Linux kernel's software interrupt ("softirq") mechanism is a bit of a
strange beast. It is an obscure holdover from the earliest days of Linux
and a mechanism that few kernel developers ever deal with directly. Yet it
is at the core of much of the kernel's most important processing.
Occasionally softirqs make their presence known in undesired ways; it is
not surprising that the kernel's frequent problem child — the realtime
preemption patch set — has often run afoul of them. Recent versions of
that patch set embody a new approach to the software interrupt problem that
merits a look.
A softirq introduction
In the announcement for the 3.6.1-rt1 patch
set, Thomas Gleixner described software interrupts this way:
First of all, it's a conglomerate of mostly unrelated jobs, which
run in the context of a randomly chosen victim w/o the ability to
put any control on them.
The softirq mechanism is meant to handle processing that is almost — but
not quite — as important as the handling of hardware interrupts. Softirqs
run at a high priority (though with an interesting exception, described
below), but with
hardware interrupts enabled. They thus will normally preempt any work
except the response to a "real" hardware interrupt.
Once upon a time, there were 32 hardwired software interrupt vectors, one
assigned to each device driver or related task. Drivers have, for the most
part, been detached from software interrupts for a long time — they still
use softirqs, but that access has been laundered through intermediate APIs
like tasklets and timers. In current kernels there are ten softirq vectors
defined; two for tasklet processing, two for networking, two for the block
layer, two for timers, and one each for the scheduler and read-copy-update
processing. The kernel maintains a per-CPU bitmask indicating which
softirqs need processing at any given time. So, for example, when a kernel
subsystem calls tasklet_schedule(), the TASKLET_SOFTIRQ
bit is set on the corresponding CPU and, when softirqs are processed, the
tasklet will be run.
There are two places where software interrupts can "fire" and preempt
the current thread. One of them is at the end of the processing for a hardware
interrupt; it is common for interrupt handlers to raise softirqs, so it
makes sense (for latency and optimal cache use) to process them as soon as
hardware interrupts can be
re-enabled. The other possibility is anytime that kernel code re-enables
softirq processing (via a call to functions like local_bh_enable()
or spin_unlock_bh()). The end result is that the accumulated
softirq work (which can be substantial) is executed in the context of
whichever process happens to be running at the wrong time; that is the
"randomly chosen victim" aspect that Thomas was talking about.
Readers who have looked at the process mix on their systems may be wondering
where the ksoftirqd processes fit into the picture. These
processes exist to offload softirq processing when the load gets too heavy.
If the regular, inline softirq processing code loops ten times and still
finds more softirqs to process (because they continue to be raised), it
will wake the appropriate ksoftirqd process (there is one per CPU)
and exit; that process will
eventually be scheduled and pick up running softirq handlers.
also be poked if a softirq is raised outside of (hardware or software)
interrupt context; that is necessary because, otherwise, an arbitrary
amount of time might pass before softirqs are processed again. In older
kernels, the ksoftirqd processes ran at the lowest possible priority,
meaning that softirq processing was, depending on where it is being run,
highest priority or the lowest priority work on the system. Since 2.6.23,
ksoftirqd runs at normal user-level priority by default.
Softirqs in the realtime setting
On normal systems, the softirq mechanism works well enough that there has
not been much motivation to change it, though, as described in "The new visibility of RCU processing,"
read-copy-update work has been moved into its own helper threads for the
3.7 kernel. In the realtime world, though, the concept of forcing
arbitrary processes to do random work tends to be unpopular, so the
realtime patches have traditionally pushed all softirq processing into
separate threads, each with its own priority. That allowed, for example,
the priority of network softirq handling to be raised on systems where
networking needed realtime response; conversely, it could be lowered on
systems where response to network events was less critical.
Starting with the 3.0 realtime patch set, though, that capability went away. It
worked less well with the new approach to
per-CPU data adopted then, and, as Thomas said, the per-softirq threads
posed configuration problems:
It's extremely hard to get the parameters right for a RT system in
general. Adding something which is obscure as soft interrupts to
the system designers todo list is a bad idea.
So, in 3.0, softirq handling looked very similar to how things are done in
the mainline kernel. That improved the code and increased performance on
untuned systems (by eliminating the context switch to the softirq thread),
but took away the ability to finely tweak things for those
who were inclined to do so. And realtime developers tend to be highly
inclined to do just that. The result, naturally, is that some users
complained about the changes.
In response, in 3.6.1-rt1, the handling of softirqs has changed again.
Now, when a thread raises a softirq, the specific interrupt in question
(network receive processing, say) is remembered by the kernel. As soon as
the thread exits the context where software interrupts are disabled, that
one softirq (and no others) will be run. That has the effect of minimizing
softirq latency (since softirqs are run as soon as possible); just as
importantly, it also ties
processing of softirqs to the processes that generate them. A process
raising networking softirqs will not be bogged down processing some other
process's timers. That keeps the work local, avoids nondeterministic
behavior caused by running another process's softirqs, and causes softirq
to naturally run with the priority of the process creating the work in the
There is an exception, of course: softirqs raised in hardware interrupt
context cannot be handled in this way. There is no general way to
associate a hardware interrupt with a specific thread, so it is not
possible to force the responsible thread to do the necessary processing.
The answer in this case is to just hand those softirqs to the
ksoftirqd process and be done with it.
A logical next step, hinted at by Thomas, is to move from an environment
where all softirqs are disabled to one where only specific softirqs are. Most
code that disables softirq handling is only concerned with one specific
handler; all the others could be allowed to run as usual. Going further,
he adds: "the nicest solution would be to get rid of them
completely." The elimination of the softirq mechanism has been on
the "todo" list for a long time, but nobody has, yet, felt the pain
strongly enough to actually do that work.
The nature of the realtime patch set has often been that its users feel the
pain of mainline kernel shortcomings before the rest of us do. That has
caused a great many mainline fixes and improvements to come from the realtime
community. Perhaps that will eventually happen again for softirqs. For
the time being, though, realtime users have an improved softirq mechanism
that should give the desired results without the need for difficult
low-level tuning. Naturally, Thomas is looking for people to test this
change and report back on how well it works with their workloads.
Comments (13 posted)
Patches and updates
- Thomas Gleixner: 3.6.1-rt2 .
(October 17, 2012)
Core kernel code
Filesystems and block I/O
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>