Kernel development [LWN.net]

Kernel release status

The current development kernel is 3.19-rc3, released on January 5. "It's a day delayed - not because of any particular development issues, but simply because I was tiling a bathroom yesterday. But rc3 is out there now, and things have stayed reasonably calm. I really hope that implies that 3.19 is looking good, but it's equally likely that it's just that people are still recovering from the holiday season."

3.19-rc2 was released, with a minimal set of changes, on December 28.

Stable updates: there have been no stable updates released in the last two weeks. As of this writing, the 3.10.64, 3.14.28, 3.17.8, and 3.18.2 updates are in the review process; they can be expected on or after January 9. Note that 3.17.8 will be the final update in the 3.17 series.

Comments (none posted)

Quotes of the week

"New and improved" is only really improved if it also takes backwards compatibility into account, rather than saying "now everybody must do things the new and improved - and different - way"

— Linus Torvalds

The fact that I still have to rattle a tin cup to fix bufferbloat at this point is quite bothersome. With such an epidemic of a problem I really thought the world would have beat a path to our doors long, long ago, and/or start leveraging the plethora of information and code we have put online to go forth and deploy bufferbloat solutions, especially in extreme cases like aircraft, access in the third world, and in remote areas.

— Dave Täht

Comments (13 posted)

Haunted by ancient history

By Jonathan Corbet
January 6, 2015

Kernel development policy famously states that changes are not allowed to break user-space programs; any patch that does break things will be reverted. That policy has been put to the test over the last week, when two such changes were backed out of the mainline repository. These actions demonstrate that the kernel developers are serious about the no-regressions policy, but they also show what's involved in actually living up to such a policy.

The ghost of wireless extensions

Back in the dark days before the turn of the century, support for wireless networking in the kernel was minimal at best. The drivers that did exist mostly tried to make wireless adapters look like Ethernet cards with a few extra parameters. After a while, those parameters were standardized, after a fashion, behind the "wireless extensions" interface. This ioctl()-based interface was never well loved, but it did the job for some years until the developers painted themselves into a corner in 2006. Conflicting compatibility issues brought development of that API to a close; the good news was that there was already a plan to supersede it with the then under-development nl80211 API.

Years later, nl80211 is the standard interface to the wireless subsystem. The wireless extensions, which are now just a compatibility interface over nl80211, have been deprecated for years, and the relevant developers would like to be rid of them entirely. So it was perhaps unsurprising to see a patch merged for 3.19 that took away the ability to configure the wireless extensions into the kernel.

Equally unsurprising, though, would be the flurry of complaints that came shortly thereafter. It seems that the wicd network manager still uses the wireless extensions API. But, perhaps more importantly, the user-space tools (iwconfig for example) that were part of the wireless extensions still use it — and they, themselves, are still in use in countless scripts. So this change looks set to break quite a few systems. As a result, Jiri Kosina posted a patch reverting the change and Linus accepted it immediately.

There were complaints from developers that users will never move away from the old commands on their own, and that some pushing is required. But it is not the place of the kernel to do that pushing. A better approach, as Ted Ts'o suggested, would be:

[W]hy not hack into the "iw" command backwards compatibility so that if argv[0] is "iwlist" or "iwconfig", it provides the limited subset compatibility to the legacy commands. Then all you need to do is to convince the distributions to set up the packaging rules so that "iw" conflicts with wireless-tools, and you will be able to get everyone switched over to iw after at least seven years.

Such an approach would avoid breaking user scripts. But it would still take a long time before all users of the old API would have moved over, so the kernel is stuck with supporting the wireless extensions API into the 2020's.

Bogomips

Rather older than the wireless extensions is the concept of "bogomips," an estimation of processor speed used in (some versions of) the kernel for short delay loops. The bogomips value printed during boot (and found in /proc/cpuinfo) is only loosely correlated with the actual performance of the processor, but people like to compare bogomips values anyway. It seems that some user-space code uses the bogomips value for its own purposes as well.

If bogomips deserved the "bogo" part of the name back in the beginning, it has only become more deserving over time. Features like voltage and frequency scaling will cause a processor's actual performance to vary over time. The calculated bogomips value can differ significantly depending on how successful the processor is in doing branch prediction while running the calibration loop. Heterogeneous processors make the situation even more complicated. For all of these reasons, the actual use of the bogomips value in the kernel has been declining over time.

The ARM architecture code, on reasonably current processors, does not use that value at all, preferring to poll a high-resolution timer instead. On some subarchitectures the calculated bogomips value differed considerably from what some users thought was right, leading to complaints. In response, the ARM developers decided to simply remove the bogomips value from /proc/cpuinfo entirely. The patch was accepted for the 3.12 release in 2013.

Nearly a year and a half later, Pavel Machek complained that the change broke pyaudio on his system. Noting that others had complained as well, he posted a patch reverting the change. It was, he said, a user-space regression and, thus, contrary to kernel policy.

Reverting this change was not a popular idea in the ARM camp; Nicolas Pitre tried to block it, saying that "No setups actually relying on this completely phony bogomips value bearing no links to hardware reality could have been qualified as 'working'." Linus was unsympathetic, though, saying that regressions were not to be tolerated and that "The kernel serves user space. That's what we do." The change was duly reverted; ARM kernels starting with 3.19 will export a bogomips value again; one assumes the change will make it into the stable tree as well.

That still leaves the little problem that the bogomips value calculated on current ARM CPUs violates user expectations; people wonder when their shiny new CPU shows as having 6.0 bogomips. Even ARM systems are expected to be faster than that. The problem, according to Nicolas, is that a constant calculated to help with the timer-based delay loops was being stored as the bogomips value; the traditional bogomips value was no longer calculated at all. There is no real reason, he said, to conflate those two values. So he has posted a patch causing bogomips to be calculated by timing the execution of a tight "do-nothing" loop — the way it was done in the beginning.

The bogomips value has long since outlived its value for the kernel itself. It is calculated solely for user space, and, even there, its value is marginal at best. As Alan Cox put it, bogomips is mostly printed "for the user so they can copy it to tweet about how neat their new PC is". But, since some software depends on its presence, the kernel must continue to provide this silly number despite the fact that it reflects reality poorly at best. Even a useless number has value if it keeps programs from breaking.

Comments (55 posted)

The problem with nested sleeping primitives

By Jonathan Corbet
January 7, 2015

Waiting for events in an operating system is an activity that is fraught with hazards; without a great deal of care, it is easy to miss the event that is being waited for. The result can be an infinite wait — an outcome which tends to be unpopular with users. The kernel has long since buried the relevant code in the core kernel with the idea that, with the right API, wait-related race conditions can be avoided. Recent experience shows, though, that the situation is not always quite that simple.

Many years ago, kernel code that needed to wait for an event would execute something like this:

    while (!condition)
 	sleep_on(&wait_queue);

The problem with this code is that, should the condition become true between the test in the while loop and the call to sleep_on(), the wakeup could be lost and the sleep would last forever. For this reason, sleep_on() was deprecated for a long time and no longer exists in the kernel.

The contemporary pattern looks more like this:

    DEFINE_WAIT(wait);

    while (1) {
    	prepare_to_wait(&queue, &wait, state);
    	if (condition)
	    break;
	schedule();
    }
    finish_wait(&queue, &wait);

Here, prepare_to_wait() will enqueue the thread on the given queue and put it into the given execution state, which is usually either TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE. Normally, that state will cause the thread to block once it calls schedule(). If the wakeup happens first, though, the process state will be set back to TASK_RUNNING and schedule() will return immediately (or, at least, as soon as it decides this thread should run again). So, regardless of the timing of events, this code should work properly. The numerous variants of the wait_event() macro expand into a similar sequence of calls.

Signs of trouble can be found in messages like the following, which are turning up on systems running the 3.19-rc kernels:

     do not call blocking ops when !TASK_RUNNING; state=1
     	set at [<ffffffff910a0f7a>] prepare_to_wait+0x2a/0x90

This message, the result of some new checks added for 3.19, is indicating that a thread is performing an action that could block while it is ostensibly already in a sleeping state. One might wonder how that can be, but it is not that hard to understand in the light of the sleeping code above.

The "condition" checked in that code is often a function call; that function may perform a fair amount of processing on its own. It may need to acquire locks to properly check for the wakeup condition. That, of course, is where the trouble comes in. Should the condition-checking function call something like mutex_lock(), it will go into a new version of the going-to-sleep code, changing the task state. That, of course, may well interfere with the outer sleeping code. For this reason, nesting of sleeping primitives in this way is discouraged; the new warning was added to point the finger at code performing this kind of nesting. It turns out that kind of nesting happens rather more often than the scheduler developers would have liked.

So what is a developer to do if the need arises to take locks while checking the sleep condition? One solution was added in 3.19; it takes the form of a new pattern that looks like this:

    DEFINE_WAIT_FUNC(wait, woken_wait_function);

    add_wait_queue(&queue, &wait);
    while (1) {
	if (condition)
	    break;
	wait_woken(&wait, state, timeout);
    }
    remove_wait_queue(&queue, &wait);

The new wait_woken() function encapsulates most of the logic needed to wait for a wakeup. At a first glance, though, it looks like it would suffer from the same problem as sleep_on(): what happens if the wakeup comes between the condition test and the wait_woken() call? The key here is in the use of a special wakeup function called woken_wait_function(). The DEFINE_WAIT_FUNC() macro at the top of the above code sequence associates this function with the wait queue entry, changing what happens when the wakeup arrives.

In particular, that change causes a special flag (WQ_FLAG_WOKEN) to be set in the flags field of the wait queue entry. If wait_woken() sees that flag, it knows that the wakeup already occurred and doesn't block. Otherwise, the wakeup has not occurred, so wait_woken() can safely call schedule() to wait.

This pattern solves the problem, but there is a catch: every place in the kernel that might be using nested sleeping primitives needs to be found and changed. There are a lot of places to look for problems and potentially fix, and the fix is not an easy, mechanical change. It would be nicer to come up with a version of wait_event() that doesn't suffer from this problem in the first place or, failing that, with something new that can be easily substituted for wait_event() calls.

Kent Overstreet thinks he has that replacement in the form of the "closure" primitive used in the bcache subsystem. Closures work in a manner similar to wait_woken() in that the wakeup state is stored internally to the relevant data structure; in this case, though, an atomic reference count is used. Interested readers can see drivers/md/bcache/closure.h and closure.c for the details. Scheduler developer Peter Zijlstra is not convinced about the closure code, but he agrees that it would be nice to have a better solution.

The form of that solution is thus unclear at this point. What does seem clear is that the current nesting of sleeping primitives needs to be fixed. So, one way or another, we are likely to see a fair amount of work going into finding and changing problematic calls over the next few development cycles. Until that work is finished, warnings from the new debugging code are likely to be a common event.

Comments (8 posted)

Linus Torvalds Linux 3.19-rc2 ?

Linus Torvalds Linux 3.19-rc3 ?

Ben Hutchings Linux 3.2.66 ?

Wang Nan ARM: kprobes: OPTPROBES and other improvements. ?

Ganapatrao Kulkarni arm64:numa: Add numa support for arm64 platforms. ?

Hanjun Guo Introduce ACPI for ARM64 based on ACPI 5.1 ?

Kevin Cernekee Generic BMIPS kernel ?

Toshi Kani Support Write-Through mapping on x86 ?

Jiri Olsa perf tools: New build framework ?

Thomas Graf rhashtable: Per bucket locks & deferred table resizing ?

Tejun Heo writeback: prepare for cgroup writeback support ?

Tejun Heo writeback: cgroup writeback support ?

Steven Rostedt trace-cmd v2.5 released ?

Shuah Khan kselftest install target feature ?

Flora Fu Add Support for MediaTek PMIC MT6397 MFD Core and Regulator ?

Pavel Machek add omap34xx temperature monitoring support ?

Liu Ying Add support for i.MX MIPI DSI DRM driver ?

Roger Chen support GMAC driver for RK3288 ?

Aaron Lu Support INT3406 Display thermal device ?

Chanwoo Choi [RFC PATCHv2 0/8] devfreq: Add generic exynos memory-bus frequency driver ?

Rameshwar Prasad Sahu dmaengine: APM X-Gene SoC DMA driver support ?

Oleksij Rempel mtd: nand: add asm9260 NFC driver ?

Graham Moore mtd: spi-nor: Add driver for Cadence Quad SPI Flash Controller. ?

Krzysztof Kozlowski power: Add max77693 charger driver ?

Inha Song [PATCH 0/3] Sound support for Exynos4412 Trats2 board ?

Ding Tianhong add hisilicon hip04 ethernet driver ?

thloh@altera.com Altera soft IP GPIO driver ?

Beomho Seo power: rt5033: Add Richtek RT533 drivers ?

jeffrey.lin driver: input :touchscreen : add Raydium I2C touch driver ?

Boris Brezillon drm: add support for Atmel HLCDC Display Controller ?

gyungoh@gmail.com Add Skyworks SKY81452 device drivers ?

Ezequiel Garcia Imagination Technologies PWM and PDM DACs support ?

Chanwoo Choi [PATCHv6 0/9] devfreq: Add devfreq-event class to provide raw data for devfreq device ?

Mauro Carvalho Chehab dvb core: add basic support for the media controller ?

Rob Herring pci: introduce common pci config space accessors ?

atull@opensource.altera.com FPGA Manager Framework ?

Michael Kerrisk (man-pages) Edited seccomp.2 man page for review [v2] ?

Michael Kerrisk (man-pages) Request for review of adjtimex(2) man page ?

Michael Kerrisk (man-pages) man-pages-3.76 is released ?

Christoph Hellwig a simple and scalable pNFS block layout server ?

Mel Gorman Replace _PAGE_NUMA with PAGE_NONE protections v5 ?

Joonsoo Kim mm/slab: optimize allocation fastpath ?

John Fastabend A flow API ?

Julian Kirsch TCP: Add support for TCP Stealth ?

Thomas Graf VXLAN Group Policy Extension ?

Stephan Mueller crypto: AF_ALG: add AEAD and RNG support ?

Stephen Hemminger iproute2 v3.18 ?

Douglas Gilbert sdparm 1.09 available ?

Douglas Gilbert ddpt, version 0.95 available ?

Kernel development

Brief items

Kernel release status

Quotes of the week

Kernel development news

Haunted by ancient history

The ghost of wireless extensions

Bogomips

The problem with nested sleeping primitives

Patches and updates

Kernel trees

Architecture-specific

Build system

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous