LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 3.7-rc2, released on October 20. Linus comments:

Anyway, it's been roughly a week, and -rc2 is out. The most noticeable thing tends to be fixing various fallout issues - there's lots of patches to finish up (and fix the fallout) from the UAPI include file reorganization, for example, but there's also some changes to how module signing is done etc etc.

Stable updates: 3.0.47, 3.4.15, and 3.6.3 were released on October 21; each contains another set of important fixes. Note that 3.4.15 and 3.6.3 also contain an ext4 data corruption bug (as do their immediate predecessors and 3.5.7) so waiting for the next update might be advisable. 3.0.47, instead, contains a block subsystem patch that "could cause problems"; 3.0.48 was released on October 22 with a revert.

Meanwhile, 3.2.32 was released on October 18.

Comments (none posted)

Quotes of the week

When I was on Plan 9, everything was connected and uniform. Now everything isn't connected, just connected to the cloud, which isn't the same thing. And uniform? Far from it, except in mediocrity. This is 2012 and we're still stitching together little microcomputers with HTTPS and ssh and calling it revolutionary. I sorely miss the unified system view of the world we had at Bell Labs, and the way things are going that seems unlikely to come back any time soon.
Rob Pike

local_irq_save() and local_irq_restore() were mistakes :( It's silly to write what appears to be a C function and then have it operate like Pascal (warning: I last wrote some Pascal in 66 B.C.).
Andrew Morton

So this is it. The big step why we did all the work over the past kernel releases. Now everything is prepared, so nothing protects us from doing that big step.

           |  |            \  \ nnnn/^l      |  |
           |  |             \  /     /       |  |
           |  '-,.__   =>    \/   ,-`    =>  |  '-,.__
           | O __.´´)        (  .`           | O __.´´)
            ~~~   ~~          ``              ~~~   ~~
Jiri Slaby

So really Rasberry Pi and Broadcom - get a big FAIL for even bothering to make a press release for this, if they'd just stuck the code out there and gone on with things it would have been fine, nobody would have been any happier, but some idiot thought this crappy shim layer deserved a press release, pointless.
Dave Airlie

Comments (6 posted)

Ext4 data corruption trouble [Updated]

By Jonathan Corbet
October 24, 2012
Stable kernel updates are supposed to be just that — stable. But they are not immune to bugs, as a recent ext4 filesystem problem has shown. In short: ext4 users would be well advised to avoid versions 3.4.14, 3.4.15, 3.5.7, 3.6.2, and 3.6.3; they all contain a patch which can, in some situations, cause filesystem corruption.

The problem, as explained in this note from Ted Ts'o, has to do with how the ext4 journal is managed. In some situations, unmounting the filesystem fails to truncate the journal, leaving stale (but seemingly valid) data there. After a single unmount/remount (or reboot) cycle little harm is done; some old transactions just get replayed unnecessarily. If the filesystem is quickly unmounted again, though, the journal can be left in a corrupted state; that corruption will be helpfully replayed onto the filesystem at the next mount.

Fixes are in the works. The ext4 developers are taking some time, though, to be sure that the problem has been fully understood and completely fixed; there are signs that the bug may have roots far older than the patch that actually caused it to bite people. Once that process is complete, there should be a new round of stable updates (possibly even for 3.5, which is otherwise at end-of-life) and the world will be safe for ext4 users again.

(Thanks are due to LWN reader "nix" who alerted readers in the comments and reported the bug to the ext4 developers).

Update: Ted now thinks that his initial diagnosis was incomplete at best; the problem is not as well understood as it seemed. Stay tuned.

Comments (46 posted)

Raspberry Pi VideoCore driver code released

The Raspberry Pi Foundation has announced that the source code for its video driver is now available under the BSD license. "If you’re not familiar with the status of open source drivers on ARM SoCs this announcement may not seem like such a big deal, but it does actually mean that the BCM2835 used in the Raspberry Pi is the first ARM-based multimedia SoC with fully-functional, vendor-provided (as opposed to partial, reverse engineered) fully open-source drivers, and that Broadcom is the first vendor to open their mobile GPU drivers up in this way."

Comments (60 posted)

Kernel development news

Small-task packing

By Jonathan Corbet
October 24, 2012
The Linux scheduler, in its various forms, has always been optimized for the (sometimes conflicting) goals of throughput and interactivity. Balancing those two objectives across all possible workloads has proved to be enough of a challenge over the years; one could argue that the last thing the scheduler developers need is yet another problem to worry about. In recent times, though, that is exactly what has happened: the scheduler is now expected to run the workload while also minimizing power consumption. Whether the system lives in a pocket or in a massive data center, the owner is almost certainly interested in more power-efficient operation. This problem has proved to be difficult to solve, but Vincent Guittot's recently posted small-task packing patch set may be a step in the right direction.

A "small task" in this context is one that uses a relatively small amount of CPU time; in particular, small tasks are runnable less than 25% of the time. Such tasks, if they are spread out across a multi-CPU system, can cause processors to stay awake (and powered up) without actually using those processors to any great extent. Rather than keeping all those CPUs running, it clearly makes sense to coalesce those small tasks onto a smaller number of processors, allowing the remaining processors to be powered down.

The first step toward this goal is, naturally, to be able to identify those small tasks. That can be a challenge: the scheduler in current kernels does not collect the information needed to make that determination. The good news is that this problem has already been solved by Paul Turner's per-entity load tracking patch set, which allows for proper tracking of the load added to the system by every "entity" (being either a process or a control group full of processes) in the system. This patch set has been out-of-tree for some time, but the clear plan is to merge it sometime in the near future.

The kernel's scheduling domains mechanism represents the topology of the underlying system; among other things, it is intended to help the scheduler decide when it makes sense to move a process from one CPU to another. Vincent's patch set starts by adding a new flag bit to indicate when two CPUs (or CPU groups, at the higher levels) share the same power line. In the shared case, the two CPUs cannot be powered down independently of each other. So, when two CPUs live in the same power domain, moving a process from one to the other will not significantly change the system's power consumption. By default, the "shared power line" bit is set for all CPUs; that preserves the scheduler's current behavior.

The real goal, from the power management point of view, is to vacate all CPUs on a given power line so the whole set can be powered down. So the scheduler clearly wants to use the new information to move small tasks out of CPU power domains. As we have recently seen, though, process-migration code needs to be written carefully lest it impair the performance of the scheduler as a whole. So, in particular, it is important that the scheduler not have to scan through a (potentially long) list of CPUs when contemplating whether a small task should be moved or not. To that end, Vincent's patch assigns a "buddy" to each CPU at system initialization time. Arguably "buddy" is the wrong term to use, since the relationship is a one-way affair; a CPU can dump small tasks onto its buddy (and only onto the buddy), but said buddy cannot reciprocate.

Imagine, for a moment, a simple two-socket, four-CPU system that looks (within the constraints of your editor's severely limited artistic capabilities) like this:

[System
diagram]

For each CPU, the scheduler tries to find the nearest suitable CPU on a different power line to buddy it with. The most "suitable" CPU is typically the lowest-numbered one in each group, but, on heterogeneous systems, the code will pick the CPU with the lowest power consumption on the assumption that it is the most power-efficient choice. So, if each CPU and each socket in the above system could be powered down independently, the buddy assignments would look like this:

[Buddies]

Note that CPU 0 has no buddy, since it is the lowest-numbered processor in the system. If CPUs 2 and 3 shared a power line, the buddy assignments would be a little different:

[Buddies]

In each case, the purpose is to define an easy path by which an alternative, power-independent CPU can be chosen as the new home for a small task.

With that structure in place, the actual changes to the scheduler are quite small. The normal load-balancing code is unaffected for the simple reason that small tasks, since they are more likely than not to be sleeping when the load balancer runs, tend not to be moved in the balancing process. Instead, the scheduler will, whenever a known small task is awakened, consider whether that task should be moved from its current CPU to the buddy CPU. If the buddy is sufficiently idle, the task will be moved; otherwise the normal wakeup logic runs as always. Over time, small tasks will tend to migrate toward the far end of the buddy chain as long as the load on those processors does not get too high. They should, thus, end up "packed" on a relatively small number of power-efficient processors.

Vincent's patch set included some benchmark results showing that throughput with the modified scheduler is essentially unchanged. Power consumption is a different story, though; using "cyclictest" as a benchmark, he showed power consumption at about ⅓ its previous level. The benefits are sure to be smaller with a real-world workload, but it seems clear that pushing small tasks toward a small number of CPUs can be a good move. Expect discussion of approaches like this one to pick up once the per-entity load tracking patches have found their way into the mainline.

Comments (5 posted)

EPOLL_CTL_DISABLE, epoll, and API design

By Michael Kerrisk
October 23, 2012

In an article last week, we saw that the EPOLL_CTL_DISABLE operation proposed by Paton Lewis provides a way for multithreaded applications that cache information about file descriptors to safely delete those file descriptors from an epoll interest list. For the sake of brevity, in the remainder of this article we'll use the term "the EPOLL_CTL_DISABLE problem" to label the underlying problem that EPOLL_CTL_DISABLE solves.

This article revisits the EPOLL_CTL_DISABLE story from a different angle, with the aim of drawing some lessons about the design of the APIs that the kernel presents to user space. The initial motivation for pursuing this angle arises from the observation that the EPOLL_CTL_DISABLE solution has some difficulties of its own. It is neither intuitive (it relies on some non-obvious details of the epoll implementation) nor easy to use. Furthermore, the solution is somewhat limiting, since it forces the programmer to employ the EPOLLONESHOT flag. Of course, these difficulties arise at least in part because EPOLL_CTL_DISABLE is designed so as to satisfy one of the cardinal rules of Linux development: interface changes must not break existing user-space applications.

If there had been an awareness of the EPOLL_CTL_DISABLE problem when the epoll API was originally designed, it seems likely that a better solution would have been built, rather than bolting on EPOLL_CTL_DISABLE after the fact. Leaving aside the question of what that solution might have been, there's another interesting question: could the problem have been foreseen?

One might suppose that predicting the EPOLL_CTL_DISABLE problem would have been quite difficult. However, the synchronized-state problem is well known and the epoll API was designed to be thread friendly. Furthermore, the notion of employing a user-space cache of the ready list to prevent file descriptor starvation was documented in the epoll(7) man page (see the sections "Example for Suggested Usage" and "Possible Pitfalls and Ways to Avoid Them") that was supplied as part of the original implementation.

In other words, almost all of the pieces of the puzzle were known when the epoll API was designed. The one fact whose implications might not have been clear was the presence of a blocking interface (epoll_wait()) in the API. One wonders if more review (and building of test applications) as the epoll API was being designed might have uncovered the interaction of epoll_wait() with the remaining well-known pieces of the puzzle, and resulted in a better initial design that addressed the EPOLL_CTL_DISABLE problem.

So, the first lesson from the EPOLL_CTL_DISABLE story is that more review is necessary in order to create better API designs (and we'll see further evidence supporting that claim in a moment). Of course, the need for more review is a general problem in all aspects of Linux development. However, the effects of insufficient review can be especially painful when it comes to API design. The problem is that once an API has been released, applications come to depend on it, and it becomes at the very least difficult, or, more likely, impossible to later change the aspects of the API's behavior that applications depend upon. As a consequence, a mistake in API design by one kernel developer can create problems that thousands of user-space developers must live with for many years.

A second lesson about API design can be found in a comment that Paton made when responding to a question from Andrew Morton about the design of EPOLL_CTL_DISABLE. Paton was speculating about whether a call of the form:

    epoll_ctl(epfd, EPOLL_CTL_DEL, fd, &epoll_event);

could be used to provide the required functionality. The EPOLL_CTL_DEL operation does not currently use the fourth argument of epoll_ctl(), and applications should specify it as NULL (but more on that point in a moment). The idea would be that "epoll_ctl [EPOLL_CTL_DEL] could set a bit in epoll_event.events (perhaps called EPOLLNOTREADY)" to notify the caller that the file descriptor was in use by another thread.

But Paton noted a shortcoming of this approach:

However, this could cause a problem with any legacy code that relies on the fact that the epoll_ctl epoll_event parameter is ignored for EPOLL_CTL_DEL. Any such code which passed an invalid pointer for that parameter would suddenly generate a fault when running on the new kernel code, even though it worked fine in the past.

In other words, although the EPOLL_CTL_DEL operation doesn't use the epoll_event argument, the caller is not required to specify it as NULL. Consequently, existing applications are free to pass random addresses in epoll_event. If the kernel now started using the epoll_event argument for EPOLL_CTL_DEL, it seems likely that some of those applications would break. Even though those applications might be considered poorly written, that's no justification for breaking them. Quoting Linus Torvalds:

We care about user-space interfaces to an insane degree. We go to extreme lengths to maintain even badly designed or unintentional interfaces. Breaking user programs simply isn't acceptable.

The lesson here is that when an API doesn't use an argument, usually the right thing to do is for the implementation to include a check that requires the argument to have a suitable "empty" value, such as NULL or zero. Failure to do that means that we may later be prevented from making the kind of API extensions that Paton was talking about. (We can leave aside the question of whether this particular extension to the API was the right approach. The point is that the option to pursue this approach was unavailable.) The kernel-user-space API provides numerous examples of failure to do this sort of checking.

However, there is yet more life in this story. Although there have been many examples of system calls that failed to check that "empty" values were passed for unused arguments, it turns out that epoll_ctl(EPOLL_CTL_DEL) fails to include the check for another reason. Quoting the BUGS section of the epoll_ctl() man page:

In kernel versions before 2.6.9, the EPOLL_CTL_DEL operation required a non-NULL pointer in event [the epoll_event argument], even though this argument is ignored. Since Linux 2.6.9, event can be specified as NULL when using EPOLL_CTL_DEL. Applications that need to be portable to kernels before 2.6.9 should specify a non-NULL pointer in event.

In other words, applications that use EPOLL_CTL_DEL are not only permitted to pass random values in the epoll_event argument: if they want to be portable to Linux kernels before 2.6.9 (which fixed the problem), they are required to pass a pointer to some random, but valid user-space address. (Of course, most such applications would simply allocate an unused epoll_event structure and pass a pointer to that structure.) Here, we're back to the first lesson: more review of the initial epoll API design would almost certainly have uncovered this fairly basic design error. (It's this writer's contention that one of the best ways to conduct that sort of review is by thoroughly documenting the API, but he admits to a certain bias on this point.)

Failing to check that unused arguments (or unused pieces of arguments) have "empty" values can cause subtle problems long after the fact. Anyone looking for further evidence on that point does not need to go far: the epoll_ctl() system call provides another example.

Linux 3.5 added a new epoll flag, EPOLLWAKEUP, that can be specified in the epoll_event.events field passed to epoll_ctl(). The effect of this flag is to prevent the system from being suspended while epoll readiness events are pending for the corresponding file descriptor. Since this flag has a system-wide effect, the caller must have a capability, CAP_BLOCK_SUSPEND (initially misnamed CAP_EPOLLWAKEUP).

In the initial EPOLLWAKEUP implementation, if the caller did not have the CAP_BLOCK_SUSPEND capability, then epoll_ctl() returned an error so that the caller was informed of the problem. However, Jiri Slaby reported that the new flag caused a regression: an existing program failed because it was setting formerly unused bits in epoll_event.events when calling epoll_ctl(). When one of those bits acquired a meaning (as EPOLLWAKEUP), the call failed because the program lacked the required capability. The problem of course is that epoll_ctl() has never checked the flags in epoll_event.events to ensure that the caller has specified only flag bits that are actually implemented in the kernel. Consequently, applications were free to pass random garbage in the unused bits.

When one of those random bits suddenly caused the application to fail, what should be done? Following the logic outlined above, of course the answer is that the kernel must change. And that is exactly what happened in this case. A patch was applied so that if the EPOLLWAKEUP flag was specified in a call to epoll_ctl() and the caller did not have the CAP_BLOCK_SUSPEND capability, then epoll_ctl() silently ignored the flag instead of returning an error. Of course, in this case, the calling application might easily carry on, unaware that the request for EPOLLWAKEUP semantics had been ignored.

One might observe that there is a certain arbitrariness about the approach taken to dealing with the EPOLLWAKEUP breakage. Taken to the extreme, this type of logic would say that the kernel can never add new flags to APIs that didn't hitherto check their bit-mask arguments—and there is a long list of such system calls (mmap(), splice(), and timer_settime(), to name just a few). Nevertheless, new flags are added. So, for example, Linux 2.6.17 added the epoll event flag EPOLLRDHUP, and since no one complained about a broken application, the flag remained. It seems likely that the same would have happened for the original implementation of EPOLLWAKEUP that returned an error when CAP_BLOCK_SUSPEND was lacking, if someone hadn't chanced to make an error report.

As an aside to the previous point, in cases where someone reports a regression after an API change has been officially released, there is a conundrum. On the one hand, there may be old applications that depend on the previous behavior; on the other hand, newer applications may already depend on the newly implemented change. At that point, there is no simple remedy: to fix things almost certainly means that some applications must break.

We can conclude with two observations, one specific, and the other more general. The specific observation is that, ironically, EPOLL_CTL_DISABLE itself seems to have had surprisingly little review before being accepted into the 3.7 merge window. And in fact, now that more attention has been focused on it, it looks as though the proposed API will see some changes. So, we have a further, very current, piece of evidence that there is still insufficient review of kernel-user-space APIs.

More generally, the problem seems to be that—while the kernel code gets reviewed on many dimensions—it is relatively uncommon for kernel-user-space APIs to be reviewed on their own merits. The kernel has maintainers for many subsystems. By now, the time seems ripe for there to be a kernel-user-space API maintainer—someone whose job it is to actively review and ack every kernel-user-space API change, and to ensure that test cases and sufficient documentation are supplied with the implementation of those changes. Lacking such a maintainer, it seems likely that we'll see many more cases where kernel developers add badly designed APIs that cause years of pain [PDF] for user-space developers.

Comments (29 posted)

The 2012 realtime minisummit

By Jake Edge
October 24, 2012

As is generally the case when realtime Linux developers get together, the discussion soon turns to how (and when) to get the remaining pieces of the realtime patch set into the mainline. That was definitely the case at the 2012 realtime minisummit, which was held October 18 in conjunction with the 14th Real Time Linux Workshop (RTLWS) in Chapel Hill, North Carolina. Some other topics were addressed as well, of course, and a lively discussion, which Thomas Gleixner characterized as "twelve people siting around a table not agreeing on anything", ensued. Gleixner's joke was just that, as there was actually a great deal of agreement around that table.

I unfortunately missed the first hour or so of the minisummit, so I am using Darren Hart's notes, Gleixner's recap for the entire workshop on October 19, and some conversations with attendees as the basis for the report on that part of the meeting.

[RTLWS group photo]

Development process

The first topic was on using Bugzilla to track bugs in the realtime patches. Hart and Clark Williams have agreed to shepherd a Bugzilla to help ensure the bugs have useful information and provide the needed pieces for the developers to track the problems down. Bugs can now be reported to the kernel Bugzilla using PREEMPT_RT for the "Tree" field. Doing so will send an email to developers who have registered their interest with Hart.

Gleixner has "mixed feelings" about it because it involves "web browsers, mouse clicks and other things developers hate". Previously, the normal way to report a bug was via the realtime or kernel mailing lists, but Bugzilla does provide a way to attach large files (e.g. log files) to bugs, which may prove helpful. The realtime hackers will know better in a year how well Bugzilla is working out and will report on it then, he said.

There was general agreement that the development process for realtime is working well. Currently, Gleixner is maintaining a patch set based on 3.6, which will be turned over to Steven Rostedt when it stabilizes. Rostedt then follows the mainline stable releases and is, in effect, the stable "team" for realtime. Those stable kernels are the ones that users and distributions generally base their efforts on. In the future, Gleixner has plans to update his 3.6-rt tree with incremental patches that have already been merged into other stable realtime kernels (3.0, 3.2, 3.4) to keep it closer to the mainline 3.6 stable release.

There was some discussion of the long-term support initiative (LTSI) kernels and what relationship those kernels have with the realtime stable kernels. The answer is: not much. LTSI plans to have realtime versions of its kernels, but when Hart suggested aligning the realtime kernel versions with those of LTSI, it was not met with much agreement. Gleixner said that the LTSI kernels would likely be supported for years, "probably decades", which is well beyond the scope of what the realtime developers are interested in doing.

3.6 softirq changes

One of the topics that came up frequently as part of both the workshop/minisummit and the extensive hallway/micro-brewery track was Gleixner's softirq processing changes released in 3.6-rt1. The locks for the ten different softirq types have been separated so that the softirqs raised in the context of a thread can be handled in that thread—without having to handle unrelated softirqs. This solves a number of problems with softirq handling (victimizing unrelated threads to process softirqs, configuring separate softirq thread priorities to get the desired behavior, etc.), but is a big change from the existing mainline implementation—as well as from previous realtime patch sets.

In the minisummit, Gleixner emphasized that more testing of the patches is needed. Networking, which is the most extensive user of softirqs in the kernel, needs more testing in particular. But the larger issue is the possibility of eventually eliminating softirqs in the kernel completely. To that end, each of the specific kernel softirq-using subsystems was discussed, with an eye toward eliminating the softirq dependency for both realtime and mainline.

The use of softirqs in the network subsystem is "massive" and even the network developers are not quite sure why it all works, according to Gleixner. But, softirqs seem to work fine for Linux networking, though the definition of "working" is not necessarily realtime friendly. If the kernel can pass the network throughput tests and fill the links on high-speed test hardware, then it is considered to be working. Any alternate solution will have to meet or exceed the current performance, which may be difficult.

The block subsystem's use of softirqs is mostly legacy code. Something like 90% of the deferred work has been shifted to workqueues over the years. Eliminating the rest won't be too difficult, Gleixner said.

The story with tasklets is similar. They should be "easy to get rid of", he said, it will just be a lot of work. Tasklets are typically used by legacy drivers and are not on a performance-critical path. Tasklet handling could be moved to its own thread, Rostedt suggested, but Gleixner thought it would be better to eliminate them entirely.

The timer softirq, which is used for the timer wheel (described and diagrammed in this LWN article), is more problematic. The timer wheel is mostly used for timeouts in the network stack and elsewhere, so it is pretty low priority. It can't run with interrupts disabled in either the mainline or in the realtime kernel, but it has to run somewhere, so pushing it off to ksoftirqd is a possibility.

The high-resolution timers softirq is mostly problematic because of POSIX timers and their signal-delivery semantics. Determining which thread should be the "victim" to deliver the signal to can be a lengthy process, so it is not done in the softirq handler in the realtime patches as it is in mainline. One solution that may be acceptable to mainline developers is to set a flag in the thread which requested the timer, and allow it to do all of the messy victim-finding and signal delivery. That would mean that the thread which requests a POSIX timer pays the price for its semantics.

Williams asked if users were not being advised to avoid signal-based timers. Gleixner said that he tells users to "use pthreads". But, "customers aren't always reasonable", Frank Rowand observed. He pointed out that some he knows of are using floating point in the kernel, and now that they have hardware floating point want to add that context to what is saved during context switches. Paul McKenney noted that many processors have lots of floating point registers which can add "multiple hundreds of milliseconds microseconds" to save or restore. Similar problems exist for the auto-vectorization code that is being added to GCC, which will result in many more registers needing to be saved.

Back to the softirqs, McKenney said that the read-copy-update (RCU) work had largely moved to threads in 3.6, but that not all of the processing moved out of the softirq. He had tried to completely move out of softirq in a patch a ways back, but Linus Torvalds "kicked it out immediately". He has some ideas of ways to address those complaints, though, so eliminating the RCU softirq should be possible.

Finally, the scheduler softirq does "nothing useful that I can see", Gleixner said. It mostly consists of heuristics to do load balancing, and Peter Zijlstra may be amenable to moving it elsewhere. Mike Galbraith pointed out that the NUMA scheduling work will make the problem worse, as will power management. ARM's big.LITTLE scheduling could also complicate things, Rowand said.

There is a great deal of interest in getting those changes into the 3.2 and 3.4 realtime kernels. Later in the meeting, Rostedt said that he would create an unstable branch of those kernels to facilitate that. The modifications are "pretty local", Gleixner said, so it should be fairly straightforward to backport the changes. In addition, it is unlikely that backports of other fixes into the mainline stable kernels (which are picked up by the realtime stable kernels) will touch the changed areas, so the ongoing maintenance should not be a big burden.

Upstreaming

Gleixner said that he is "swamped" by a variety of tasks, including stabilizing the realtime tree, the softirq split, and a "huge backlog" of work that needs to be done for the CPU hotplug rework. Part of the latter was merged for 3.7, but there is lots more to do. Rusty Russell has offered to help once Gleixner gets the infrastructure in place, so he needs to "get that out the door". Beyond that, he also spends a lot of time tracking down bugs found by the Open Source Automation Development Lab (OSADL) testing and from Red Hat bug reports.

He needs some help from the other realtime kernel developers in order to move more of the patch set into the mainline. Those in the room seemed very willing to help. The first step is to go through all of the realtime patches and work on any that are "halfway reasonable to get upstream" first.

One of the top priorities for upstreaming is not a kernel change, but is a change needed in the GNU C library (glibc). Gleixner noted that the development process for glibc has gotten a "lot better" recently and that the new maintainers are doing a "great job". That means that a longstanding problem with condvars and priority inheritance may finally be able to be addressed.

When priority inheritance was added to the kernel, Ulrich Drepper wrote the user-space portion for glibc. He had a solution for the problem of condvars not being able to specify that they want to use a priority-inheriting mutex, but that solution was one that Gleixner and Ingo Molnar didn't like, so nothing was added to glibc.

Three years ago, Hart presented a solution at the RTLWS in Dresden, but he was unable to get it into glibc. It is a real problem for users according to Gleixner and Williams, so Hart's solution (or something derived from it) should be merged into glibc. Hart said he would put that at the top of his list.

SLUB

Another area that should be fairly easy to get upstream are changes to the SLUB allocator to make it work with the realtime code. SLUB developer Christoph Lameter has done some work to make the core allocator lockless and for it not to disable interrupts or preemption. Lameter's work was mostly to support enterprise users on large NUMA systems, but it should also help make SLUB work better with realtime.

If SLUB can be made to work relatively easily, Gleixner would be quite willing to drop support for SLAB. The SLOB allocator is targeted at smaller, embedded systems, including those without an MMU, so it is not an interesting target. Besides which, SLOB's "performance is terrible", Rostedt said. During the minisummit, Williams was able to build and boot SLUB on a realtime system, which "didn't explode right away", Gleixner reported in the recap. That, coupled with SLUB's better NUMA performance, may make it a much better target anyway, he said.

Switching to SLUB might also get rid of a whole pile of "intrusive changes" in the memory allocator code. The realtime memory management changes will be some of the hardest to sell to the upstream developers, so any reduction in the size of those patches will be welcome.

There are a number of places where drivers call local_irq_save() and local_irq_enable() that have been changed in the realtime tree to call *_nort() variants. There are about 25 files that use those variants, mostly drivers designed for uniprocessor machines that have never been fixed for multiprocessor systems. No one really cares about those drivers any more, Gleixner said, so the _nort changes can either go into mainline or be trivially maintained out of it.

Spinlocks

Bit spinlocks (i.e. single bits used as spinlocks) need to be changed to support realtime, and that can probably be sold because it would add lockdep coverage. Right now, bit spinlocks are not checked by lockdep, which is a debugging issue. In converting bit spinlocks to regular spinlocks, Gleixner said he found 3-4 locking bugs in the mainline, so it would be beneficial to have a way to check them.

The problem is that bit spinlocks are typically using flag bits in size-constrained structures (e.g. struct page). But, for debugging, it will be acceptable to grow those structures when lockdep is enabled. For realtime, there is a need to just "live with the fact that we are growing some structures", Gleixner said. There aren't that many bit spinlocks; two others that he mentioned were the buffer head lock and the journal head lock.

Hart brought up the sleeping spinlock conversion, but Gleixner said that part is the least of his worries. Most of the annotations needed have already been merged, as have the header file changes. The patches are "really unintrusive now", though it is still a big change.

The CPU hotplug rework should eliminate most of the changes required for realtime once it gets merged. The migrate enable and disable patches are self-contained. The high-resolution timers changes and softirq changes can be fairly self-contained as well. Overall, getting the realtime patches upstream is "not that far away", Gleixner said, though some thought is needed on good arguments to get around the "defensive list" of some mainline developers.

To try to ensure they hadn't skipped over anything, Williams put up a January email from Gleixner with a "to do" list for the realtime patches. There are some printk()-related issues that were on the list. Gleixner said those still linger, and it will be "messy" to deal with them.

Zijlstra was at one time opposed to the explicit migrate enable/disable calls, but that may be not be true anymore, Gleixner said. The problem may be that there will be a question of who uses the code when trying to get the infrastructure merged. It is a "hen-egg problem", but there needs to be a way to ensure that processes do not move between CPUs, particularly when handling per-CPU variables.

In the mainline, spinlocks disable preemption (which disables migration), but that's not true in realtime. The current mainline behavior is somewhat "magic", and realtime adds an explicit way to disable migration if that's truly what's needed. As Paul Gortmaker put it, "making it explicit is an argument for it in it's own right". Gleixner said he would talk to Zijlstra about a use case and get the code into shape for mainline.

Gortmaker asked if there were any softirq uses that could be completely eliminated. McKenney believes he can do so for the RCU softirq, but he does have the reservation that he has never successfully done so in the past. High-resolution timers and the timer wheel can both move out of softirqs, Gleixner said, though the former may be tricky. The block layer softirq work can be moved to workqueues, but the network stack is the big issue.

One possible solution for the networking softirqs is something Rostedt calls "ENAPI" (even newer API, after "NAPI", the new API). When using threaded interrupt handlers, the polling that is currently done in the softirq handler could be done directly in the interrupt handler thread. If that works, and shows a performance benefit, Gleixner said, the network driver writers will do much of the work on the conversion.

Wait queues are another problem area. While most are "pretty straightforward", there are some where the wait queue has a callback that is called on wakeup for every woken task. Those callbacks could do most anything, including sleep, which prevents those wait queues from being converted to use raw locks. Lots of places can be replaced, but not for "NFS and other places with massive callbacks", Gleixner said.

There are a number of pieces that should be able to go into mainline largely uncontended. Code to shorten the time that locks are held and to reduce the interrupt and preempt disabled regions is probably non-controversial. The _nort annotations may also fall into that category as they don't hurt things in mainline.

CPU isolation

The final item on the day's agenda is a feature that is not part of the realtime patches, but is of interest to many of the same users: CPU isolation. That feature, which is known by other names such as "adaptive NOHZ", would allow users to dedicate one or more cores to user-space processing by removing all kernel processing from those cores. Currently, nearly all processing can be moved to other cores using CPU affinity, but there is still kernel housekeeping (notably CPU time accounting and RCU) that will run on those CPUs.

Frédéric Weisbecker has been working on CPU isolation, and he attended the minisummit at least partly to give an update on the status of the feature. Accounting for CPU time without the presence of the timer tick is one of the areas that needs work. Users still want to see things like load averages reflect the time being spent in user-space processing on an isolated CPU, but that information is normally updated in the timer tick interrupt.

In order to isolate the CPU, though, the timer tick needs to be turned off. In the recap, Gleixner noted that the high-performance computer users of the feature aren't so concerned about the time spent in the timer tick (which is minimal), but the cache effects from running the code. Knocking data and instructions out of the cache can result in a 3% performance hit, which is significant for those workloads.

To account for CPU time usage without the tick, adaptive NOHZ will use the same hooks that RCU uses to calculate the CPU usage. While the CPU is isolated, the CPU time will just be calculated, but won't be updated until the user-space process enters the kernel (e.g. via a system call). The tick might be restarted when system calls are made, which will eventually occur so that the CPU-bound process can report its results or get new data. Restarting the tick would allow the CPU accounting and RCU housekeeping to be done. Weisbecker felt that it should only be restarted if it was needed for RCU; even that might possibly be offloaded to a separate CPU.

That led to a discussion of what the restrictions there are for using CPU isolation. There was talk of trying to determine which system calls will actually require restarting the tick, but that was deemed too kernel-version specific to be useful. The guidelines will be that anything other than one thread that makes no system calls on the CPU may result in less than 100% of the CPU available. Gleixner suggested adding a tracepoint that would indicate when the CPU exited isolated mode and why. McKenney suggested a warning like "this code needs a more deterministic universe"—to some chuckles around the table. Weisbecker and Rostedt plan to work on CPU isolation in the near term, with an eye toward getting it upstream soon.

And that is pretty much where the realtime minisummit ended. While there is plenty of work still to do, it is clear that there is increasing interest in "finishing" the task of getting the realtime changes merged. Gleixner confessed to being tired of maintaining it in the recap session, and that feeling is probably shared by others. Given that the mainline has benefited from the realtime changes already merged, it seems likely that will continue as more goes upstream. The trick will be in convincing the other kernel hackers of that.

[ I would like to thank OSADL and RTLWS for supporting my travel to the minisummit and workshop. ]

Comments (8 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Memory management

Architecture-specific

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds