Kernel development
Brief items
Kernel release status
The current development kernel is 3.19-rc3, released on January 5. "It's a day delayed - not because of any particular development issues, but simply because I was tiling a bathroom yesterday. But rc3 is out there now, and things have stayed reasonably calm. I really hope that implies that 3.19 is looking good, but it's equally likely that it's just that people are still recovering from the holiday season."
3.19-rc2 was released, with a minimal set of changes, on December 28.
Stable updates: there have been no stable updates released in the last two weeks. As of this writing, the 3.10.64, 3.14.28, 3.17.8, and 3.18.2 updates are in the review process; they can be expected on or after January 9. Note that 3.17.8 will be the final update in the 3.17 series.
Quotes of the week
Kernel development news
Haunted by ancient history
Kernel development policy famously states that changes are not allowed to break user-space programs; any patch that does break things will be reverted. That policy has been put to the test over the last week, when two such changes were backed out of the mainline repository. These actions demonstrate that the kernel developers are serious about the no-regressions policy, but they also show what's involved in actually living up to such a policy.
The ghost of wireless extensions
Back in the dark days before the turn of the century, support for wireless networking in the kernel was minimal at best. The drivers that did exist mostly tried to make wireless adapters look like Ethernet cards with a few extra parameters. After a while, those parameters were standardized, after a fashion, behind the "wireless extensions" interface. This ioctl()-based interface was never well loved, but it did the job for some years until the developers painted themselves into a corner in 2006. Conflicting compatibility issues brought development of that API to a close; the good news was that there was already a plan to supersede it with the then under-development nl80211 API.
Years later, nl80211 is the standard interface to the wireless subsystem. The wireless extensions, which are now just a compatibility interface over nl80211, have been deprecated for years, and the relevant developers would like to be rid of them entirely. So it was perhaps unsurprising to see a patch merged for 3.19 that took away the ability to configure the wireless extensions into the kernel.
Equally unsurprising, though, would be the flurry of complaints that came shortly thereafter. It seems that the wicd network manager still uses the wireless extensions API. But, perhaps more importantly, the user-space tools (iwconfig for example) that were part of the wireless extensions still use it — and they, themselves, are still in use in countless scripts. So this change looks set to break quite a few systems. As a result, Jiri Kosina posted a patch reverting the change and Linus accepted it immediately.
There were complaints from developers that users will never move away from the old commands on their own, and that some pushing is required. But it is not the place of the kernel to do that pushing. A better approach, as Ted Ts'o suggested, would be:
Such an approach would avoid breaking user scripts. But it would still take a long time before all users of the old API would have moved over, so the kernel is stuck with supporting the wireless extensions API into the 2020's.
Bogomips
Rather older than the wireless extensions is the concept of "bogomips," an estimation of processor speed used in (some versions of) the kernel for short delay loops. The bogomips value printed during boot (and found in /proc/cpuinfo) is only loosely correlated with the actual performance of the processor, but people like to compare bogomips values anyway. It seems that some user-space code uses the bogomips value for its own purposes as well.
If bogomips deserved the "bogo" part of the name back in the beginning, it has only become more deserving over time. Features like voltage and frequency scaling will cause a processor's actual performance to vary over time. The calculated bogomips value can differ significantly depending on how successful the processor is in doing branch prediction while running the calibration loop. Heterogeneous processors make the situation even more complicated. For all of these reasons, the actual use of the bogomips value in the kernel has been declining over time.
The ARM architecture code, on reasonably current processors, does not use that value at all, preferring to poll a high-resolution timer instead. On some subarchitectures the calculated bogomips value differed considerably from what some users thought was right, leading to complaints. In response, the ARM developers decided to simply remove the bogomips value from /proc/cpuinfo entirely. The patch was accepted for the 3.12 release in 2013.
Nearly a year and a half later, Pavel Machek complained that the change broke pyaudio on his system. Noting that others had complained as well, he posted a patch reverting the change. It was, he said, a user-space regression and, thus, contrary to kernel policy.
Reverting this change was not a popular idea in the ARM camp; Nicolas Pitre
tried to block it, saying that "No
setups actually relying on this completely phony bogomips value
bearing no links to hardware reality could have been qualified as
'working'.
"
Linus was unsympathetic, though, saying
that regressions were not to be tolerated and that "The kernel serves
user space. That's what we do.
" The change was duly reverted; ARM
kernels starting with 3.19 will export a bogomips value again; one assumes
the change will make it into the stable tree as well.
That still leaves the little problem that the bogomips value calculated on current ARM CPUs violates user expectations; people wonder when their shiny new CPU shows as having 6.0 bogomips. Even ARM systems are expected to be faster than that. The problem, according to Nicolas, is that a constant calculated to help with the timer-based delay loops was being stored as the bogomips value; the traditional bogomips value was no longer calculated at all. There is no real reason, he said, to conflate those two values. So he has posted a patch causing bogomips to be calculated by timing the execution of a tight "do-nothing" loop — the way it was done in the beginning.
The bogomips value has long since outlived its value for the kernel itself.
It is calculated solely for user space, and, even there, its value is
marginal at best. As Alan Cox put it,
bogomips is mostly printed "for the user so they can copy it to tweet
about how neat their new PC is
". But, since some software depends on
its presence, the kernel must continue to provide this silly number
despite the fact that it reflects reality poorly at best. Even a useless
number has value if it keeps programs from breaking.
The problem with nested sleeping primitives
Waiting for events in an operating system is an activity that is fraught with hazards; without a great deal of care, it is easy to miss the event that is being waited for. The result can be an infinite wait — an outcome which tends to be unpopular with users. The kernel has long since buried the relevant code in the core kernel with the idea that, with the right API, wait-related race conditions can be avoided. Recent experience shows, though, that the situation is not always quite that simple.Many years ago, kernel code that needed to wait for an event would execute something like this:
while (!condition) sleep_on(&wait_queue);
The problem with this code is that, should the condition become true between the test in the while loop and the call to sleep_on(), the wakeup could be lost and the sleep would last forever. For this reason, sleep_on() was deprecated for a long time and no longer exists in the kernel.
The contemporary pattern looks more like this:
DEFINE_WAIT(wait); while (1) { prepare_to_wait(&queue, &wait, state); if (condition) break; schedule(); } finish_wait(&queue, &wait);
Here, prepare_to_wait() will enqueue the thread on the given queue and put it into the given execution state, which is usually either TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE. Normally, that state will cause the thread to block once it calls schedule(). If the wakeup happens first, though, the process state will be set back to TASK_RUNNING and schedule() will return immediately (or, at least, as soon as it decides this thread should run again). So, regardless of the timing of events, this code should work properly. The numerous variants of the wait_event() macro expand into a similar sequence of calls.
Signs of trouble can be found in messages like the following, which are turning up on systems running the 3.19-rc kernels:
do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff910a0f7a>] prepare_to_wait+0x2a/0x90
This message, the result of some new checks added for 3.19, is indicating that a thread is performing an action that could block while it is ostensibly already in a sleeping state. One might wonder how that can be, but it is not that hard to understand in the light of the sleeping code above.
The "condition" checked in that code is often a function call; that function may perform a fair amount of processing on its own. It may need to acquire locks to properly check for the wakeup condition. That, of course, is where the trouble comes in. Should the condition-checking function call something like mutex_lock(), it will go into a new version of the going-to-sleep code, changing the task state. That, of course, may well interfere with the outer sleeping code. For this reason, nesting of sleeping primitives in this way is discouraged; the new warning was added to point the finger at code performing this kind of nesting. It turns out that kind of nesting happens rather more often than the scheduler developers would have liked.
So what is a developer to do if the need arises to take locks while checking the sleep condition? One solution was added in 3.19; it takes the form of a new pattern that looks like this:
DEFINE_WAIT_FUNC(wait, woken_wait_function); add_wait_queue(&queue, &wait); while (1) { if (condition) break; wait_woken(&wait, state, timeout); } remove_wait_queue(&queue, &wait);
The new wait_woken() function encapsulates most of the logic needed to wait for a wakeup. At a first glance, though, it looks like it would suffer from the same problem as sleep_on(): what happens if the wakeup comes between the condition test and the wait_woken() call? The key here is in the use of a special wakeup function called woken_wait_function(). The DEFINE_WAIT_FUNC() macro at the top of the above code sequence associates this function with the wait queue entry, changing what happens when the wakeup arrives.
In particular, that change causes a special flag (WQ_FLAG_WOKEN) to be set in the flags field of the wait queue entry. If wait_woken() sees that flag, it knows that the wakeup already occurred and doesn't block. Otherwise, the wakeup has not occurred, so wait_woken() can safely call schedule() to wait.
This pattern solves the problem, but there is a catch: every place in the kernel that might be using nested sleeping primitives needs to be found and changed. There are a lot of places to look for problems and potentially fix, and the fix is not an easy, mechanical change. It would be nicer to come up with a version of wait_event() that doesn't suffer from this problem in the first place or, failing that, with something new that can be easily substituted for wait_event() calls.
Kent Overstreet thinks he has that replacement in the form of the "closure" primitive used in the bcache subsystem. Closures work in a manner similar to wait_woken() in that the wakeup state is stored internally to the relevant data structure; in this case, though, an atomic reference count is used. Interested readers can see drivers/md/bcache/closure.h and closure.c for the details. Scheduler developer Peter Zijlstra is not convinced about the closure code, but he agrees that it would be nice to have a better solution.
The form of that solution is thus unclear at this point. What does seem clear is that the current nesting of sleeping primitives needs to be fixed. So, one way or another, we are likely to see a fair amount of work going into finding and changing problematic calls over the next few development cycles. Until that work is finished, warnings from the new debugging code are likely to be a common event.
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>