Kernel development [LWN.net]

Kernel release status

The 3.9 kernel is out, released by Linus on April 28. Headline features in 3.9 include KVM virtualization on the ARM architecture, the near-completion of user namespace support, PowerClamp support, the dm-cache device mapper target, RAID5/6 support in the Btrfs filesystem, The ARC and MetaG architectures, and more. See the KernelNewbies 3.9 page for lots of details.

The 3.10 merge window is open; see the separate summary below for an overview of what has been merged so far.

Stable updates: 3.8.9, 3.4.42, 3.2.44 and 3.0.75 all came out on April 25; 3.8.10 followed one day later with a fix for a build problem. 3.8.11, 3.4.43, and 3.0.76 were released on May 1.

Comments (none posted)

Quotes of the week

If you're using custom kernels with [CyanogenMod] (particularly with nightly builds), be aware of this. Things will break. We aren't conspiring against you, nor are we "violating the spirit of the GPL". At the same time, I will freely admit that we aren't looking out for you either. If you want to run a fork of a single subsystem of a fast-moving highly interdepedent codebase, you will find dragons waiting for you.

— Steve Kondik

I am trying to avoid perpetrating^Winventing new RCU flavors, at least for the next week or two.

Speaking as an IBMer, I can only hang my head in shame at my inability thus far to come up with an acronym that is at least five letters long, but consisting only of vowels.

— Paul McKenney

Comments (8 posted)

Deloget: The SoC GPU driver interview

In a lengthy blog post, Emmanuel Deloget interviews nine developers of GPU drivers and tools for ARM system-on-chip (SoC) devices. Questions range from the status of various projects and how the projects got started to intra-project collaboration and the future of the ARM platform. The developers and projects are: Connor Abbot - Open GPU Tools, Eric Faye-Lund - grate (for Tegra GPUs), Herman H. Hermitage - Videocore (for Broadcom GPUs), Luc Verhaegen - lima (for MALI GPUs), Matthias Gottschlag - Videocore (for Broadcom GPUs), Rob Clark - freedreno (for Adreno GPUs), Thierry Reding - grate (for Tegra GPUs), Scott Mansell - Videocore (for Broadcom GPUs), and Wladimir J. van der Laan - etna_viv (for Vivante GPUs).

Comments (3 posted)

Output redirection vulnerabilities in recent kernels

Andy Lutomirski has posted a description of a set of security vulnerabilities fixed in recent stable updates. One is a fairly severe user namespace vulnerability that appeared in the 3.8 kernel; another dates back to 2.6.36. Exploit code is included.

Full Story (comments: 12)

Three Outreach Program for Women kernel internships available

The Linux Foundation has announced that it will be supporting three kernel internships for the upcoming Outreach Program for Women cycle. "The official deadline for applying to OPW is May 1st. However, the kernel project joined late, so that deadline is flexible. Please fill out your initial application, and then update by May 17th with your initial patch." Acceptance in the program brings a $5000 stipend plus $500 in travel funding.

Comments (1 posted)

What's coming in 3.10, part 1

By Jonathan Corbet
May 1, 2013

As of this writing, nearly 5,100 non-merge changesets have been pulled into the mainline repository for the 3.10 development cycle. That is a pretty good pace, given that this merge window only opened on April 29. A number of interesting new features have been added, with more to come.

User-visible changes merged at the beginning of the 3.10 cycle include:

The "ftrace" tracing facility has seen a number of improvements. At the top of the list, arguably, is the ability to establish multiple buffers for tracing information and to direct specific events to different buffers. Additionally, it is now possible set up a trigger that will enable or disable specific events when a given kernel function is called, and it is possible to get a stack trace when a given function is called. See the improved ftrace.txt document for lots of details.
Control groups have seen a fair amount of work in this release, some of which is implementing the improvements planned over one year ago. The device and perf_event groups now offer full hierarchical support. There is a new mount option (with the unwieldy name "__DEVEL__sane_behavior") that tries to ensure more consistent hierarchy behavior across all groups; it is obviously meant for development rather than production, but it gives some clues about what is coming.
Applications within control groups can now request memory pressure notifications when the system is running low on available memory. See the new section 11 at the end of Documentation/cgroups/memory.txt for details on how to use this feature.
"Return probes" are now supported in user-space probing; they allow the activation of a breakpoint when a target function returns to its caller.
The generation of POSIX timer IDs has changed; IDs are no longer guaranteed to be unique across the system. A process's timers can now be queried by reading /proc/PID/timers. These changes make it possible for the checkpoint/restart feature to restore active timers without changing their IDs.
POSIX and high-resolution timers support a new clock (CLOCK_TAI) which operates in international atomic time.
The perf command (along with the perf_events subsystem) can now do memory access profiling.
The iSCSI subsystem has gained support for the iSCSI extensions for RDMA (iSER) protocol.
CPU frequency governors can now exist in multiple instances with different tuning parameters. This feature is needed in heterogeneous multiprocessing systems (ARM big.LITTLE, for example) to allow different CPUs to run with different parameters.
The ability to run scripts as executables (with the interpreter specified using the "#!" sequence) can now be built as a module — or left out altogether for systems that run no scripts.
New hardware support includes:
- Display and graphics: Microsoft Hyper-V synthetic video devices and ILI Technology ILI9221/ILI9222 backlight controllers.
- Hardware monitoring: Analog Devices ADT7310/ADT7320 temperature monitors, Nuvoton NCT6779D hardware monitoring chips, National Semiconductor LM95234 temperature sensors, and ST-Ericsson AB8500 thermal monitors.
- Input: Apple infrared receivers.
- Miscellaneous: Advantech PCI-1724U analog output cards, Analog Devices AD7923 analog to digital interfaces, Qualcomm single-wire serial bus interfaces, Broadcom BCM2835 SPI controllers, Aeroflex Gaisler GRLIB SPI controllers, NVIDIA Tegra114 SPI controllers, Silicon Labs 5351A/B/C programmable clock generators, on-chip static RAM regions, TI TPS65090 battery chargers, Samsung EXYNOS5440 CPU frequency controllers, and ARM big.LITTLE CPU frequency controllers.
- Networking: Netlogic XLR/XLS network interfaces.
- USB: DesignWare USB2 USB controllers.
- Video4Linux: Rafael Micro R820T silicon tuners, ITE Tech IT913x silicon tuners, OmniVision OV7640 sensors, Philips UDA1342 audio codecs, Techwell TW9903, TW9906 and TW2804 video decoders, Silicon Laboratories 4761/64/68 AM/FM radios, Silicon Laboratories Si476x I2C FM radios, and Samsung EXYNOS4x12 FIMC-IS imaging subsystems.
Note also that the Android "configurable composite gadget" driver has been removed from the staging tree. It is apparently difficult to maintain and no current hardware uses it.

Changes visible to kernel developers include:

The devtmpfs filesystem now provides drivers with the ability to specify which user and group ID should own a given device. This capability is somewhat controversial — there is resistance to encoding user and group ID policy in the kernel — but it will be useful for systems like Android.
The staging tree has gained a new "sync" driver (from Android) that can be used for synchronization between other drivers.
There is a new "dummy-irq" driver that does nothing other than register an interrupt handler. It exists to debug IRQ sharing problems by forcing the enabling of a specific interrupt line.
A lot of the low-level USB PHY access functions have been changed to GPL-only exports.
The new idr_alloc_cyclic() function allocates ID numbers in a cyclic fashion: when the given range is exhausted, allocations will start again at the beginning of that range.
The workqueue subsystem has seen some substantial reworking which, among other things, should make it perform better on NUMA systems. There is also a new sysfs interface that can be used to tweak some workqueue parameters.

If the usual pattern holds, this merge window should remain open until around May 12 and the 3.10 kernel can be expected in early July. As usual, LWN will follow the mainline as the merge window progresses.

Comments (none posted)

Wait/wound mutexes

By Jonathan Corbet
May 1, 2013

Developers wanting to add new locking primitives to the kernel tend to be received with a certain amount of skepticism. The kernel is already well equipped with locking mechanisms, and experience shows that new mechanisms tend to be both unnecessary and hard to get right. The "wait/wound mutex mechanism" proposed by Maarten Lankhorst may well get that kind of response. But it is an interesting approach to a specific locking problem that merits a closer look.

A conceptual overview

Situations where multiple locks must be held simultaneously pose a risk of deadlocks: if the order in which those locks are acquired is not always the same, there will eventually come a time when two threads find themselves blocked, each waiting for the other to release a lock. Kernel code tends to be careful about lock ordering, and the "lockdep" checking tool has gotten quite good about finding code that violates the rules. So deadlocks are quite rare, despite the huge number of locks used by the kernel.

But what about situations where the ordering of lock acquisition cannot be specified in advance, or, even worse, is controlled by user space? Maarten's patch describes one such scenario: a chain of buffers used with the system's graphical processing unit (GPU). These buffers must, at various times, be "owned" by the GPU itself, the GPU driver, user space, and, possibly, another driver completely, such as for a video frame grabber. User space can submit the buffers for processing in an arbitrary order, and the GPU may complete them in a different order. If locking is used to control the ownership of the buffers, and if multiple buffers must be manipulated at once, avoiding deadlocks could become difficult.

Imagine a simple situation where there are two buffers of interest:

Imagine further that we have two threads (we'll call them T1 and T2) that attempt to lock both buffers in the opposite order: T1 starts with Buffer A, while T2 starts with Buffer B. As long as they do not both try to grab the buffers at the same time, things will work. But, someday, each will succeed in locking one buffer and a situation like this will develop:

The kernel's existing locking primitives have no answer to a situation like this other than "don't do that." The wait/wound mutex, instead, is designed for just this case. In general terms, what will happen in this situation is:

The thread that "got there first" will simply sleep until the remaining buffer becomes available. If T1 started the process of locking the buffers first, it will be the thread that waits.
The other thread will be "wounded," meaning that it will be told it must release any locks it holds and start over from scratch.

So if T2 is wounded, the deadlock will be resolved by telling T2 to release Buffer B; it must then wait until that buffer becomes available again and start over. So the situation will look something like this:

Once T1 has released the buffers, T2 will be able to retry and, presumably, make forward progress on its task.

The details

The first step toward using a set of locks within the wait/wound mechanism is to define a "class"; this class is essentially a context within which the locks are to be acquired. When multiple threads contend for the same locks, they must do so using the same class. A wait/wound class is defined with:

    #include <linux/mutex.h>

    static DEFINE_WW_CLASS(my_class);

As far as users of the system are concerned, the class needs to exist, but it is otherwise opaque; there is no explicit initialization required. Internally, the main purpose for the class's existence is to hold a sequence number (an atomic counter) used to answer the "who got there first" question; it also contains some information used by lockdep to verify correct use of wait/wound locks.

The acquisition of a specific set of locks must be done within a "context" that tracks the specific locks held. Before acquiring the first lock, a call should be made to:

    void ww_acquire_init(struct ww_acquire_ctx *ctx, struct ww_class *ww_class);

This call will assign a sequence number to the context and do a bit of record keeping. Once that has been done, it is possible to start acquiring locks:

    int ww_mutex_lock(struct ww_mutex *lock, struct ww_acquire_ctx *ctx);

If the lock has been successfully acquired, the return value will be zero. When all goes well, the thread will manage to acquire all of the locks it needs. Once that process is complete, that fact should be signaled with:

    void ww_acquire_done(struct ww_acquire_ctx *ctx);

This function is actually a no-op in the current implementation, but that could change in the future. After this call, the processing of the locked data can proceed normally. Once the job is done, it is time to release the locks and clean up:

    void ww_mutex_unlock(struct ww_mutex *lock);
    void ww_acquire_fini(struct ww_acquire_ctx *ctx);

Each held lock should be released with ww_mutex_unlock(); once all locks have been released, the context should be cleaned up with ww_acquire_fini().

The above description describes what happens when all goes well, but it has left out an important case that all wait/wound mutex users must handle: the detection of a potential deadlock. That case comes about whenever an attempt is made to lock a ww_mutex that is already locked; in this case, there are three possible outcomes.

The first of these comes about if the locking thread already holds that ww_mutex and is attempting to lock it for a second time. With ordinary mutexes, this would be an error, but the wait/wound mechanism is designed for this case. Evidently, sometimes, the ordering of the locking is so poorly defined that multiple locking attempts can happen. In such cases, ww_mutex_lock() will return -EALREADY. The locking thread, assuming it knows how to respond to -EALREADY, can continue about its business.

The second possibility is that the sequence number in the context for the locking process is higher than the number associated with thread already holding the lock. In this case, the new caller gets "wounded"; ww_mutex_lock() will return -EDEADLK to signal that fact. The wounded thread is expected to clean up and get out of the way. "Cleaning up" means releasing all locks held under the relevant context with calls to ww_mutex_unlock(). Once all of the locks are free, the wounded thread can try again, but only when the contended lock is released by the victorious thread; waiting for that to happen is done with:

    void ww_mutex_lock_slow(struct ww_mutex *lock, struct ww_acquire_ctx *ctx);

This function will block the calling thread until lock becomes free; once it returns, the thread can try again to acquire all of the other locks it needs. It is entirely possible that this thread could, once again, fail to acquire all of the needed locks. But, since the sequence number increases monotonically, a once-wounded thread must eventually reach a point where it has the highest priority and will win out.

The final case comes about when the new thread's sequence number is lower than that of the thread currently holding the lock. In this case, the new thread will simply block in ww_mutex_lock() until the lock is freed. If the thread holding the contended lock attempts to acquire another lock that is already held by the new thread, it will get the -EDEADLK status at that point; it will then release the contended lock and let the new thread proceed. Going back to the example above:

Thread T1, holding the lower sequence number, will wait for Buffer B to be unlocked, while thread T2 will see -EDEADLK when it attempts to lock Buffer A.

The documentation in the patch does not describe what happens if the holding process never calls ww_mutex_lock() again. In this case, it will never know that it is supposed to back off. But, in this case, the holder must necessarily already have acquired all of the locks it needs, so there should be no reason why it cannot simply finish its work and release the locks normally. So the end result will be the same.

Conclusion

Needless to say, there are plenty of details that have not been covered here; see the ww-mutex-design.txt document included with the patch set for more information.

In that document, there are code examples for three different ways of working with wait/wound mutexes. One need not read for long to conclude that the API looks a bit complex and tricky to use; it will be far harder to write correct locking code using this facility than it would be with normal mutexes. Perhaps that complexity is necessary, and it seems certain that this mechanism will not be needed in many places in the kernel, so the complexity should not spread too far. But an API like this can be expected to raise some eyebrows.

What is missing at this point is any real code that uses wait/wound mutexes. Kernel developers will certainly want to see some examples of where this kind of locking mechanism is needed. After all, the kernel has made it through its first two decades without this kind of complex locking; convincing the community that this feature is now necessary is going to take a strong sales effort. That is best done by showing how wait/wound mutexes solve a problem that cannot be easily addressed otherwise. Until that is done, wait/wound mutexes are likely to remain an interesting bit of code on the sidelines.

Comments (15 posted)

LSFMM Summit coverage complete

By Jonathan Corbet
May 1, 2013

Since the first notes from the 2013 Linux Storage, Filesystem, and Memory Management Summit were posted, we have been busy filling in notes from the remaining sessions. We now have notes from every scheduled session at the summit.

Since the initial posting, the following sessions have been added:

Combined Filesystem/Storage sessions

dm-cache and bcache: the future of two storage-caching technologies for Linux.
Error returns: filesystems could use better error information from the storage layer.
Storage management: how do we ease the task of creating and managing filesystems on Linux systems?
O_DIRECT: the kernel's direct I/O code is complicated, fragile, and hard to change. Is it time to start over?
Reducing io_submit() latency: submitting asynchronous I/O operations can potentially block for long periods of time, which is not what callers want. Various ways of addressing this problem were discussed, but easy solutions are not readily at hand.

Filesystem-only sessions

NFS status: what is going on in the NFS subsystem.
Btrfs status: what has happened with the next-generation Linux filesystem, and when will it be ready for production use?
User-space filesystem servers: what can the kernel do to support user-space servers like Samba and NFS-GANESHA?
Range locking: a proposal to lock portions of files within the kernel.
FedFS: where things stand with the creation of a Federated Filesystem implementation for Linux.

Storage-only sessions

Reducing SCSI latency. The SCSI stack is having a hard time keeping up with the fastest drives; what can be done to speed things up?
SCSI testing. It would be nice to have a test suite for SCSI devices; after this session, one may well be in the works.
Error handling and firmware updates: some current problems with handling failing drives, and how can we do online firmware updates on SATA devices?

Thanks are due to those who helped with the creation of these writeups. We would like to thank Elena Zannoni in particular for providing comprehensive notes from the Storage track.

Comments (none posted)

Linus Torvalds Linux 3.9 released ?

Alexandre Oliva GNU Linux-libre 3.9-gnu ?

Greg KH Linux 3.8.11 ?

Greg KH Linux 3.8.10 ?

Sebastian Andrzej Siewior 3.8.10-rt6 ?

Greg KH Linux 3.8.9 ?

Sebastian Andrzej Siewior 3.8.9-rt4 ?

Greg KH Linux 3.4.43 ?

Greg KH Linux 3.4.42 ?

Steven Rostedt 3.4.42-rt56 ?

Steven Rostedt 3.4.41-rt55-feat3 ?

Ben Hutchings Linux 3.2.44 ?

Steven Rostedt 3.2.44-rt64 ?

Steven Rostedt 3.2.43-rt63-feat2 ?

Greg KH Linux 3.0.76 ?

Greg KH Linux 3.0.75 ?

Steven Rostedt 3.0.75-rt102 ?

Xi Wang seccomp filter JIT ?

Steve Capper HugeTLB and THP support for ARM64. ?

Aneesh Kumar K.V THP support for PPC64 (Patchset 1) ?

Yann E. MORIN kconfig-frontends-3.9.0.0 released ?

Vincent Guittot sched: packing small tasks ?

Alex Shi sched: use runnable load based balance ?

Thomas Gleixner clocksource/events: Overhaul (un)registration ?

Maarten Lankhorst Wait/wound mutex implementation, v3 ?

Alexandru Copot selftests: Basic framework for tests ?

Sasha Levin liblockdep: userspace lockdep ?

Hayes Wang net/usb: new driver for RTL8152 ?

Amit Daniel Kachhap thermal: exynos: Add thermal driver for exynos5440 ?

Qiaowei Ren TXT driver ?

Jon Arne Jørgensen Add a driver for Somagic smi2021 ?

Tony Prisk Add support for velocity network driver on platform devices ?

Kishon Vijay Abraham I Generic PHY Framework ?

Rafael J. Wysocki Driver core / ACPI: Add offline/online for graceful hot-removal of devices ?

Zoran Markovic drivers: power: Add watchdog timer to catch drivers which lockup during suspend. ?

Pavel Shilovsky Add O_DENY* support for VFS and CIFS/NFS ?

Philippe De Muyter Add aix lvm partitions support ?

Minchan Kim Per process reclaim ?

Glauber Costa kmemcg shrinkers ?

Mel Gorman Obey mark_page_accessed hint given by filesystems ?

Pavel Emelyanov mm: Ability to monitor task memory changes (v4) ?

Kees Cook kernel ASLR ?

H. Peter Anvin random: Account for entropy loss due to overwrites ?

Steve Dickson lnfs: 3.9-rc8 release ?

Jari Ruusu Announce loop-AES-v3.6h file/swap crypto package ?

Karel Zak util-linux 2.23 ?

Stephen Hemminger iproute2 3.9.0 ?

Pavel Emelyanov Checkpoint-restore tool v0.5 ?

Kernel development

Brief items

Kernel release status

Quotes of the week

Deloget: The SoC GPU driver interview

Output redirection vulnerabilities in recent kernels

Three Outreach Program for Women kernel internships available

Kernel development news

What's coming in 3.10, part 1

Wait/wound mutexes

A conceptual overview

The details

Conclusion

LSFMM Summit coverage complete

Combined Filesystem/Storage sessions

Filesystem-only sessions

Storage-only sessions

Patches and updates

Kernel trees

Architecture-specific

Build system

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Security-related

Miscellaneous