Kernel development
Brief items
Kernel release status
The current development kernel is 3.7-rc1, released on October 14. See the separate article, below, for a summary of the final items added during the 3.7 merge window.Stable updates: 3.0.46, 3.4.14, 3.5.7 and 3.6.2 were released on October 12. 3.5.7 is the end of the line for updates in the 3.5 series.
Quotes of the week
There are two solutions that were contemplated for disabling the module: having a kind of global status of the crypto API that makes it non-responsive in case of an integrity/self-test error. The other solution is to simply terminate the entire kernel. As the former one also will lead to a kernel failure eventually as many parts of the kernel depend on the crypto API, the implementation of the latter option was chosen.
Should I just hope the sender realizes their foolishness on their own and give them N hours to rescind the statement and fix up their insane patch and resend it, thereby giving them a grace period? If so, what is the proper value for N?
Or is it fair game to let loose and channel up the Torvalds-like daemons within my keyboard, with the hope that it would actually do some good and they would learn from their mistakes?
Al Viro's new execve/kernel_thread design
Al Viro has been busily reworking the creation of kernel threads, making the code cleaner and less architecture-specific. As part of that exercise, he has posted a lengthy document on how kernel thread creation works now and the changes that he is making. Worth a read for those interested in how this part of the core kernel works. "Old implementation of kernel_thread() had been rather convoluted. In the best case, it filled struct pt_regs according to its arguments and passed them to do_fork(). The goal was to fool the code doing return from syscall into *not* leaving the kernel mode, so that newborn would have (after emptying its kernel stack) end up in a helper function with the right values in registers. Running in kernel mode."
Kernel development news
3.7 merge window: conclusion and summary
Linus pulled a total of 10,409 non-merge changesets into the mainline before closing the merge window for the 3.7 development cycle. That makes 3.7 one of the most active development cycles in recent history; only 3.2, with 10,214 changesets in the merge window, comes close. Clearly, there is a lot going on in the kernel development community.Interestingly, Linus expressed some skepticism about some of this cycle's work in the 3.7-rc1 announcement. For example, the discussion on the 64-bit ARM patch set concluded some time ago, but Linus came in with a late opinion of his own:
He also expressed some grumpiness about the user-space API header file split — an enormous set of patches that is only partially merged for 3.7. Header file cleanups, he says, are just too much pain for the benefit that results, so he will not consider any more of them in the future.
Grumbles notwithstanding, he pulled all of this work — and much more — for 3.7. The user-visible changes merged since last week's summary include:
- Support for signed kernel modules has
been merged. With this feature turned on, the kernel will refuse to
load modules that have not been signed with a recognized key. Among
other users, full support of UEFI secure boot requires this
capability. There is also a mode where unsigned modules will still be
loaded, but the kernel will be tainted in the process.
- NFS 4.1 support is no longer considered experimental.
- The MD RAID layer now supports TRIM ("discard") operations.
- New hardware support includes TI LM355x and LM3642 LED controllers, Atmel At91 two-wire interface controllers (replaced driver), and Renesas R-Car I2C controllers.
Changes visible to kernel developers include:
- The "UAPI disintegration" patch sets have been pulled into quite
a few subsystem trees, causing a lot of header file (and related)
churn. A fair amount of this work was deferred to 3.8 as well,
though, so this job is not yet done.
- The kerneldoc subsystem can now output documents in the HTML5 format.
- The kernel now has a generic cooling subsystem based on cpufreq; see
Documentation/thermal/cpu-cooling-api.txt
for (a few) details.
- It's worth noting that some kernel developers have expressed grumpiness about the increase in build time caused by the addition of the signed module feature. Anybody whose work involves doing lots of fast kernel builds will probably want to turn that feature off.
At this point it is time to perform the final stabilization work on all these changes. If things go according to the usual schedule, that should result in the final 3.7 release sometime in early December.
EPOLL_CTL_DISABLE and multithreaded applications
Other than the merging of the server-side component of TCP Fast Open, one of the few user-space API changes that has gone into the just-closed 3.7 merge window is the addition of a new EPOLL_CTL_DISABLE operation for the epoll_ctl() system call. It's interesting to look at this operation as an illustration of the sometimes unforeseen complexities of dealing with multithreaded applications; that examination is the subject of this article. However, the addition of the EPOLL_CTL_DISABLE feature highlights some common problems in the design of the APIs that the kernel presents to user space. (To be clear: EPOLL_CTL_DISABLE is the fix to a past design problem, not a design problem itself.) These design problems will be the subject of a follow-on article next week.
Understanding the need for EPOLL_CTL_DISABLE requires an understanding of several features of the epoll API. For those who are unfamiliar with epoll, we begin with a high-level picture of how the API works. We then look at the problem that EPOLL_CTL_DISABLE is designed to solve, and how it solves that problem.
An overview of the epoll API
The (Linux-specific) epoll API allows an application to monitor multiple file descriptors in order to determine which of the descriptors are ready to perform I/O. The API was designed as a more efficient replacement for the traditional select() and poll() system calls. Roughly speaking, the performance of those older APIs scales linearly with the number of file descriptors being monitored. That behavior makes select() and poll() poorly suited for modern network applications that may handle thousands of file descriptors simultaneously.
The poor performance of select() and poll() is an inescapable consequence of their design. For each monitoring operation, both system calls require the application to give the kernel a complete list of all of the file descriptors that are of interest. And on each call, the kernel must re-examine the state of all of those descriptors and then pass a data structure back to the application that describes the readiness of the descriptors.
The underlying problem of the older APIs is that they don't allow an application to inform the kernel about its ongoing interest in a (typically unchanging) set of file descriptors. If the kernel had that information, then, as each file descriptor became ready, it could record the fact in preparation for the next request by the application for the set of ready file descriptors. The epoll API allows exactly that approach, by splitting the monitoring API up across three system calls:
- epoll_create() creates an internal kernel data structure
("an epoll instance") that is used to record the set of file descriptors
that the application is interested in monitoring. The call returns a file
descriptor that is used in the remaining epoll APIs.
- epoll_ctl() allows the application to inform the kernel
about the set of file descriptors it would like to monitor by adding
(EPOLL_CTL_ADD) and removing (EPOLL_CTL_DEL) file
descriptors from the interest list of the epoll
instance. epoll_ctl() can also modify (EPOLL_CTL_MOD) the
set of events that are to be monitored for a file descriptor that is
already in the interest list. Once a file descriptor has been recorded in
the interest list, the kernel tracks I/O events for the file descriptor
(e.g., the arrival of new input); if the event causes the file descriptor
to become ready, the kernel places the descriptor on the ready list
of the epoll instance, in preparation for the next call to
epoll_wait().
- epoll_wait() requests the kernel to return one or more ready file descriptors. The kernel satisfies this request by simply fetching items from the ready list (the call can block if there are no descriptors that are yet ready). The application uses epoll_wait() each time it wants to check for changes in the readiness of file descriptors. What is notable about epoll_wait() is that the application does not need to pass in a list of file descriptors on each call: the kernel already has that information via preceding calls to epoll_ctl(). In addition, there is no need to rescan the complete set of file descriptors to see which are ready; the kernel has already been recording that information on an ongoing basis because it knows which file descriptors the application is interested in.
Schematically, the epoll API operates as shown in the following diagram:
Because the kernel is able to maintain internal state about the set of file descriptors in which the application is interested, epoll_wait() is much more efficient than select() and poll(). Roughly speaking, its performance scales according to the number of ready file descriptors, rather than the total number of file descriptors being monitored.
Epoll and multithreaded applications: the problem
The author of the patch that implements EPOLL_CTL_DISABLE, Paton Lewis, is not a regular kernel hacker. Rather, he's a developer with a particular user-space itch, and it would seem that a kernel change is the only way of scratching that itch. In the description accompanying the first iteration of his patch, Paton began with the following observation:
It is not currently possible to reliably delete epoll items when using the same epoll set from multiple threads. After calling epoll_ctl with EPOLL_CTL_DEL, another thread might still be executing code related to an event for that epoll item (in response to epoll_wait). Therefore the deleting thread does not know when it is safe to delete resources pertaining to the associated epoll item because another thread might be using those resources.
The deleting thread could wait an arbitrary amount of time after calling epoll_ctl with EPOLL_CTL_DEL and before deleting the item, but this is inefficient and could result in the destruction of resources before another thread is done handling an event returned by epoll_wait.
The fact that the kernel records internal state is the source of a complication for multithreaded applications. The complication arises from the fact that applications may also want to maintain state information about file descriptors. One possible reason for doing this is to prevent file descriptor starvation, the phenomenon that can occur when, for example, an application determines that a file descriptor has data available for reading and then attempts to read all of the available data. It could happen that there is a very large amount of data available (for example, another application may be continuously writing data on the other end of a socket connection). Consequently, the reading application would be tied up for a long period; meanwhile, it does not service I/O events on the other file descriptors—those descriptors are starved of service by the application.
The solution to file descriptor starvation is for the application to maintain a user-space data structure that caches the readiness of each of the file descriptors that it is monitoring. Whenever epoll_wait() informs the application that a file descriptor is ready, then, instead of performing as much I/O as possible on the descriptor, the application makes a record in its cache that the file descriptor is ready. The application logic then takes the form of a loop that (a) periodically calls epoll_wait() and (b) performs a limited amount of I/O on the file descriptors that are marked as ready in the user-space cache. (When the application finds that I/O is no longer possible on one of the file descriptors, then it can mark that descriptor as not ready in the cache.)
Thus, we have a scenario where the both kernel and a user-space application are maintaining state information about the same resources. This can potentially lead to race conditions when competing threads in a multithreaded application want to update state information in both places. The most fundamental piece of state information maintained in both places is "existence".
For example, suppose that an application thread determines that it is no longer necessary to monitor a file descriptor. The thread would first check to see whether the file descriptor is marked as ready in the user-space cache (i.e., there may still be some outstanding I/O to perform), and then, if the file descriptor is not ready, the thread would delete the file descriptor from the user-space cache and from the kernel's epoll interest list using the epoll_ctl(EPOLL_CTL_DEL) operation. However, these steps could fall afoul in scenarios such as the following involving two threads operating on file descriptor 9:
Thread 1 Thread 2 Determine from the user-space cache that descriptor 9 is not ready. Call epoll_wait(); the call indicates descriptor 9 as ready. Record descriptor 9 as being ready inside the user-space cache so that I/O can later be performed. Delete descriptor 9 from the user-space cache. Delete descriptor 9 from the kernel's epoll interest list using epoll_ctl(EPOLL_CTL_DEL).
Following the above scenario, some data will be lost. Other scenarios could lead to a corrupted cache or an application crash.
No use of (per-file-descriptor) mutexes can eliminate the sorts of races described here, short of protecting the calls to epoll_wait() with a (global) mutex, which has the effect of destroying concurrency. (If one thread is blocked in a epoll_wait() call, then any other thread that tries to acquire the corresponding mutex will also block.)
Epoll and multithreaded applications: the solution
Paton's solution to this problem is to extend the epoll API with a new operation that atomically prevents other threads from receiving further indications that a file descriptor is ready, while at the same time informing the caller whether another thread has "recently" been told the file descriptor is ready. The new operation relies on some of the inner workings of the epoll API.
When adding (EPOLL_CTL_ADD) or modifying (EPOLL_CTL_MOD) a file descriptor in the interest list, the application specifies a mask of I/O events that are of interest for the descriptor. For example, the mask might include both EPOLLIN and EPOLLOUT, if the application wants to know when the file descriptor becomes either readable or writable. In addition, the kernel implicitly adds two further flags to the events mask in the interest list: EPOLLERR, which requests monitoring for error conditions, and EPOLLHUP, which requests monitoring for a "hangup" (e.g., we are monitoring the read end of a pipe, and the write end is closed). When a file descriptor becomes ready, epoll_wait() returns a mask that contains all of the requested events for which the file descriptor is ready. For example, if an application requests monitoring of the read end of a pipe using EPOLLIN and the write end of the pipe is closed, then epoll_wait() will return an events mask that includes both EPOLLIN and EPOLLHUP.
As well as the flags that can be used to monitor file descriptors for various I/O events, there are a few "operational flags"—flags that modify the semantics of the monitoring operation itself. One of these is EPOLLONESHOT. If this flag is specified in the events mask for a file descriptor, then, once the file descriptor becomes ready and is returned by a call to epoll_wait(), it is disabled from further monitoring (but remains in the interest list). If the application is interested in monitoring file descriptor once more, then it must re-enable the file descriptor using the epoll_ctl(EPOLL_CTL_MOD) operation.
Per-descriptor events mask recorded in an epoll interest list Operational flags I/O event flags EPOLLONESHOT, EPOLLET, ... EPOLLIN, EPOLLOUT, EPOLLHUP, EPOLLERR, ...
The implementation of EPOLLONESHOT relies on a trick. If this flag is set, then, if the file descriptor indicates as being ready via epoll_wait(), the kernel clears all of the "non-operational flags" (i.e., the I/O event flags) in the events mask for that file descriptor. This serves as a later cue to the kernel that it should not track I/O events for this file descriptor.
By now, we finally have enough details to understand Paton's extension to the epoll API—the epoll_ctl(EPOLL_CTL_DISABLE) operation—that allows multithreaded applications to avoid the kind of races described above. To successfully use this extension requires the following:
- The user-space cache that describes file descriptors should also
include a per-descriptor "delete-when-done" flag that defaults to false but
can be set true when one thread wants to inform another thread that a
particular file descriptor should be deleted.
- All epoll_ctl() calls that add or modify file descriptors
in the interest list must specify the EPOLLONESHOT flag.
- The epoll_ctl(EPOLL_CTL_DISABLE) operation should be used as described in a moment.
In addition, calls to epoll_ctl(EPOLL_CTL_DISABLE) and accesses to the user-space cache must be suitably protected with per-file-descriptor mutexes. We won't go into details here, but the second version of Paton's patch adds a sample application to the kernel source tree (under tools/testing/selftests/epoll/test_epoll.c) that demonstrates the principles.
The new epoll operation is employed via the following call:
epoll_ctl(epfd, EPOLL_CTL_DISABLE, fd, NULL);
epfd is a file descriptor referring to an epoll
instance. fd is the file descriptor in the interest list that is
to be disabled. The semantics of this operation handle two cases:
- One or more of the I/O event flags is set in the interest list
entry for fd. This means that, since the last epoll_ctl()
operation that added or modified this interest list entry, no other thread
has executed an epoll_wait() call that indicated this file
descriptor as being ready. In this case, the kernel clears the I/O event
flags in the interest list entry, which prevents subsequent
epoll_wait() calls from returning the file descriptor as being
ready. The epoll_ctl(EPOLL_CTL_DISABLE) call then returns zero to
the caller. At this point, the caller knows that no other thread is
operating on the file descriptor, and it can thus safely delete the
descriptor from the user-space cache and from the kernel interest list.
- No I/O event flag is set in the interest list entry for fd. This means that since the last epoll_ctl() operation that added or modified this interest list entry, another thread has executed an epoll_wait() call that indicated this file descriptor as being ready. In this case, epoll_ctl(EPOLL_CTL_DISABLE) returns –1 with errno set to EBUSY. At this point, the caller knows that another thread is operating on the descriptor, so it sets the descriptor's "delete-when-done" flag in the user-space cache to indicate that the other thread should delete the file descriptor once when it has finished using it.
Thus, we see that with a moderate amount of effort, and a little help from a new kernel interface, a race can be avoided when deleting file descriptors in multithreaded applications that wish to avoid file descriptor starvation.
Concluding remarks
There was relatively little comment on the first iteration of Paton's patch. The only substantive comments came from Christof Meerwald; in response to these, Paton created the second version of his patch. That version received no comments, and was incorporated into 3.7-rc1. It would be nice to think that the relatively paucity of comments reflects the silent agreement that Paton's approach is correct. However, one is left with the nagging feeling that in fact few people have reviewed the patch, which leaves open the question: is this the best solution to the problem?
Although EPOLL_CTL_DISABLE solves the problem, the solution is neither intuitive nor easy to use. The main reason for this is that EPOLL_CTL_DISABLE is a bolt-on hack to the epoll API that satisfies the requirement (often repeated by Linus Torvalds) that existing user-space applications must not be broken by making a kernel ABI change. Within that constraint, EPOLL_CTL_DISABLE may be the best solution to the problem. However, it seems certain that a better solution might have been possible if it had incorporated during the original design of the epoll API. Next week's follow-on article will consider whether a better initial solution could have been found and also consider why it might not be possible to find a better solution within the constraints of the current API.
Finally, it's worth noting that the EPOLL_CTL_DISABLE feature is not yet cast in stone, although it will become so in about two months, when Linux 3.7 is released. In the meantime, if someone comes up with a better idea to solve the problem, then the existing approach could be modified or replaced.
Software interrupts and realtime
The Linux kernel's software interrupt ("softirq") mechanism is a bit of a strange beast. It is an obscure holdover from the earliest days of Linux and a mechanism that few kernel developers ever deal with directly. Yet it is at the core of much of the kernel's most important processing. Occasionally softirqs make their presence known in undesired ways; it is not surprising that the kernel's frequent problem child — the realtime preemption patch set — has often run afoul of them. Recent versions of that patch set embody a new approach to the software interrupt problem that merits a look.
A softirq introduction
In the announcement for the 3.6.1-rt1 patch set, Thomas Gleixner described software interrupts this way:
The softirq mechanism is meant to handle processing that is almost — but not quite — as important as the handling of hardware interrupts. Softirqs run at a high priority (though with an interesting exception, described below), but with hardware interrupts enabled. They thus will normally preempt any work except the response to a "real" hardware interrupt.
Once upon a time, there were 32 hardwired software interrupt vectors, one assigned to each device driver or related task. Drivers have, for the most part, been detached from software interrupts for a long time — they still use softirqs, but that access has been laundered through intermediate APIs like tasklets and timers. In current kernels there are ten softirq vectors defined; two for tasklet processing, two for networking, two for the block layer, two for timers, and one each for the scheduler and read-copy-update processing. The kernel maintains a per-CPU bitmask indicating which softirqs need processing at any given time. So, for example, when a kernel subsystem calls tasklet_schedule(), the TASKLET_SOFTIRQ bit is set on the corresponding CPU and, when softirqs are processed, the tasklet will be run.
There are two places where software interrupts can "fire" and preempt the current thread. One of them is at the end of the processing for a hardware interrupt; it is common for interrupt handlers to raise softirqs, so it makes sense (for latency and optimal cache use) to process them as soon as hardware interrupts can be re-enabled. The other possibility is anytime that kernel code re-enables softirq processing (via a call to functions like local_bh_enable() or spin_unlock_bh()). The end result is that the accumulated softirq work (which can be substantial) is executed in the context of whichever process happens to be running at the wrong time; that is the "randomly chosen victim" aspect that Thomas was talking about.
Readers who have looked at the process mix on their systems may be wondering where the ksoftirqd processes fit into the picture. These processes exist to offload softirq processing when the load gets too heavy. If the regular, inline softirq processing code loops ten times and still finds more softirqs to process (because they continue to be raised), it will wake the appropriate ksoftirqd process (there is one per CPU) and exit; that process will eventually be scheduled and pick up running softirq handlers. Ksoftirqd will also be poked if a softirq is raised outside of (hardware or software) interrupt context; that is necessary because, otherwise, an arbitrary amount of time might pass before softirqs are processed again. In older kernels, the ksoftirqd processes ran at the lowest possible priority, meaning that softirq processing was, depending on where it is being run, either the highest priority or the lowest priority work on the system. Since 2.6.23, ksoftirqd runs at normal user-level priority by default.
Softirqs in the realtime setting
On normal systems, the softirq mechanism works well enough that there has not been much motivation to change it, though, as described in "The new visibility of RCU processing," read-copy-update work has been moved into its own helper threads for the 3.7 kernel. In the realtime world, though, the concept of forcing arbitrary processes to do random work tends to be unpopular, so the realtime patches have traditionally pushed all softirq processing into separate threads, each with its own priority. That allowed, for example, the priority of network softirq handling to be raised on systems where networking needed realtime response; conversely, it could be lowered on systems where response to network events was less critical.
Starting with the 3.0 realtime patch set, though, that capability went away. It worked less well with the new approach to per-CPU data adopted then, and, as Thomas said, the per-softirq threads posed configuration problems:
So, in 3.0, softirq handling looked very similar to how things are done in the mainline kernel. That improved the code and increased performance on untuned systems (by eliminating the context switch to the softirq thread), but took away the ability to finely tweak things for those who were inclined to do so. And realtime developers tend to be highly inclined to do just that. The result, naturally, is that some users complained about the changes.
In response, in 3.6.1-rt1, the handling of softirqs has changed again. Now, when a thread raises a softirq, the specific interrupt in question (network receive processing, say) is remembered by the kernel. As soon as the thread exits the context where software interrupts are disabled, that one softirq (and no others) will be run. That has the effect of minimizing softirq latency (since softirqs are run as soon as possible); just as importantly, it also ties processing of softirqs to the processes that generate them. A process raising networking softirqs will not be bogged down processing some other process's timers. That keeps the work local, avoids nondeterministic behavior caused by running another process's softirqs, and causes softirq processing to naturally run with the priority of the process creating the work in the first place.
There is an exception, of course: softirqs raised in hardware interrupt context cannot be handled in this way. There is no general way to associate a hardware interrupt with a specific thread, so it is not possible to force the responsible thread to do the necessary processing. The answer in this case is to just hand those softirqs to the ksoftirqd process and be done with it.
A logical next step, hinted at by Thomas, is to move from an environment
where all softirqs are disabled to one where only specific softirqs are. Most
code that disables softirq handling is only concerned with one specific
handler; all the others could be allowed to run as usual. Going further,
he adds: "the nicest solution would be to get rid of them
completely.
" The elimination of the softirq mechanism has been on
the "todo" list for a long time, but nobody has, yet, felt the pain
strongly enough to actually do that work.
The nature of the realtime patch set has often been that its users feel the pain of mainline kernel shortcomings before the rest of us do. That has caused a great many mainline fixes and improvements to come from the realtime community. Perhaps that will eventually happen again for softirqs. For the time being, though, realtime users have an improved softirq mechanism that should give the desired results without the need for difficult low-level tuning. Naturally, Thomas is looking for people to test this change and report back on how well it works with their workloads.
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
