Kernel development [LWN.net]

Kernel release status

The current development kernel is 3.3-rc2, released on January 31 - a little later than would have ordinarily been expected. "The diffstat is pretty flat - indicative of mostly small changes spread out. Which is what I like seeing, and we don't always see at this point. There's some file movement (8250-based serial and the arm mx5 -> imx merge), but otherwise really not a lot of excitement. Good." That said, there are quite a few changes in this prepatch; see the short-form changelog in the announcement for details. Thirteen of those changes are reverts for patches that didn't work out.

Stable updates: there have been no stable updates in the last week. The 2.6.32.56, 3.0.19, and 3.2.3 stable updates are in the review process as of this writing; they can be expected on or after February 3.

Comments (none posted)

Quotes of the week

It wouldn't be the first time lockdep & ftrace live locked the system. Or made it so unbearably slow. Lockdep and ftrace do not play well together. They both are very intrusive. The two remind me of the United States congress. Where there is two parties trying to take control of everything, but nothing ever gets done. We end up with a grid/live lock in the country/computer.

-- Steven Rostedt

I can see some vindictive programmer doing that, while thinking "I'll show these people who pointed out this bug in my code, mhwhahahahaa! I'll fix their test-case while still leaving the real problem unaddressed", but I don't think compiler people are quite *that* evil. Yes, they are evil people who are trying to trip us up, but still..

-- Linus Torvalds

In that way, my philosophy of ext4 is that it should be like the Linux kernel; it's an evolutionary process and central planning is often overrated. People contribute to ext4 for many different reasons, and that means they optimize ext4 for their particular workloads. Like Linus for Linux, we're not trying to architect for "world domination" by saying, "hmm, in order to 'take out' reiserfs4, we'd better implement features foo and bar".

-- Ted Ts'o

You're making the assumption that users are informed and knowledgable, and all filesystem developers should know this is simply not true. Users repeatedly demonstrate that they don't know how filesystems work, don't understand the knobs that are provided, don't understand what their applications do in terms of filesystem operations and don't really understand their data sets. Education takes time and effort, but still users make the same mistakes over and over again.

-- Dave Chinner

Looks like there are more dragons and hidden trapdoors in the drm release path than actual lines of code.

-- Daniel Vetter

Comments (none posted)

Greg Kroah-Hartman moves to the Linux Foundation

The Linux Foundation has announced that Greg Kroah-Hartman has joined the organization as a fellow. "In his role as Linux Foundation Fellow, Kroah-Hartman will continue his work as the maintainer for the Linux stable kernel branch and a variety of subsystems while working in a fully neutral environment. He will also work more closely with Linux Foundation members, workgroups, Labs projects, and staff on key initiatives to advance Linux."

Comments (12 posted)

LSF/MM summit deadline approaching

The deadline for requests to attend the 2012 storage, filesystem, and memory management summit is February 5 (the event happens April 1-2 in San Francisco). Any developers who would like to be there and have not expressed their interest should do so in the very near future.

Full Story (comments: none)

What happened to disk performance in 2.6.39

By Jonathan Corbet
January 31, 2012

Herbert Poetzl recently reported an interesting performance problem. His SSD-equipped laptop could read data at about 250MB/s with the 2.6.38 kernel, but performance dropped to 25-50MB/s on anything more recent. An order-of-magnitude performance drop is just not the sort of benefit that most people look forward to when upgrading their kernel, so this report quickly gained the attention of a number of developers. The resolution of the problem turned out to be simple, but it offers an interesting view of how high-performance disk I/O works in the kernel.

An explanation of the problem requires just a bit of background, and, in particular, the definition of a couple of terms. "Readahead" is the process of speculatively reading file data into memory with the idea that an application is likely to want it soon. Reasonable performance when reading a file sequentially depends on proper readahead; that is the only way to ensure that reading and consuming the data can be done in parallel. Without readahead, applications will spend more time than necessary waiting for data to be read from disk.

"Plugging," instead, is the process of stopping I/O request submissions to the low-level device for a period of time. The motivation for plugging is to allow a number of I/O requests to accumulate; that lets the I/O scheduler sort them, merge adjacent requests, and apply any sort of fairness policy that might be in effect. Without plugging, I/O requests would tend to be smaller and more scattered across the device, reducing performance even on solid-state disks.

Now imagine that we have a process about to start reading through a long file, as indicated by your editor's unartistic rendering here:

[Bad art]

Once the application starts reading from the beginning of the file, the kernel will set about filling the first readahead window (which is 128KB with larger files) and submit I/O for the second window, so the situation will look something like this:

[Reading begins]

Once the application reads past 128KB into the file, the data it needs will hopefully be in memory. The readahead machinery starts up again, initiating I/O for the window starting at 256KB; that yields a situation that looks something like this:

[Next window]

This process continues indefinitely with the kernel running to always stay ahead of the application and have the data there by the time that application gets around to reading it.

The 2.6.39 kernel saw some significant changes to how plugging is handled, with the result that the plugging and unplugging of queues is now explicitly managed in the I/O submission code. So, starting with 2.6.39, the readahead code will plug the request queue before it submits a batch of read operations, then unplug the queue at the end. The function that handles basic buffered file I/O (generic_file_aio_read()) also now does its own plugging. And that is where the problems begin.

Imagine a process that is doing large (1MB) reads. As the first large read gets into generic_file_aio_read(), that function will plug the request queue and start working through the file pages already in memory. When it gets to the end of the first readahead window (at 128KB), the readahead code will be invoked as described above. But there's a problem: the queue is still plugged by generic_file_aio_read(), which is still working on that 1MB read request, so the I/O operations submitted by the readahead code are not passed on to the hardware; they just sit in the queue.

So, when the application gets to the end of the second readahead window, we see a situation like this:

[Bummer]

At this point, everything comes to a stop. That will cause the queue to be unplugged, allowing the readahead I/O requests to be executed at last, but it is too late. The application will have to wait. That wait is enough to hammer performance, even on solid-state devices.

The fix is to simply remove the top-level plugging in generic_file_aio_read() so that readahead-originated requests can get through to the hardware. Developers who have been able to reproduce the slowdown report that this patch makes the problem go away, so this issue can be considered solved. Look for this fix to appear in a stable kernel release sometime soon.

Comments (15 posted)

Preparing for user-space checkpoint/restore

By Jonathan Corbet
January 31, 2012

The addition of a checkpoint/restore functionality to Linux has been an ongoing topic of discussion and development for some years now. After the poor reception given to the in-kernel C/R implementation at the end of 2010, that particular project seems to have faded into the background. Instead, most of the interest seems to be in solutions that operate mostly in user space. Depending on the approach taken, most or all the support needed to implement this functionality in user space already exists. But a complete solution is not yet there.

CRIU

Cyrill Gorcunov has been working to fill in some of the gaps with a preparatory patch set for user-space checkpointing/restore with the "CRIU" tool set. There are a number of small additions to the kernel ABI to be found here:

A new children entry in a thread's /proc directory provides a list of that thread's immediate children. This information allows a user-space checkpoint utility to find those child processes without needing to walk through the entire process tree.
/proc/pid/stat is extended to provide the bounds of the process's argument and environment arrays, along with the exit code. That allows this information to be reliably captured at checkpoint time.
A number of new prctl() options allow the argument and environment arrays to restored in a way matching what was there at checkpoint time. The desired end result is that ps shows the same information about a process after a checkpoint/restore cycle as it did before.

Perhaps the most significant new feature, though, is the addition of a new system call:

    long kcmp(pid_t pid1, pid_t pid2, int type, unsigned long idx1, unsigned long idx2);

Checkpoint/restore is meant to work as well on a tree of processes as on a single process. One challenge in the way of meeting that goal is that some of those processes may share resources - files, say, or, perhaps, a whole lot more. Replicating that sharing at restore time is relatively easy; the clone() system call provides a nice set of flags controlling the sharing of resources. The harder part is knowing, at checkpoint time, whether that sharing is taking place.

One way for user space to determine whether, for example, two processes are sharing the same open file would be to query the kernel for the address of the associated struct file and see if they are the same in both processes. That kind of functionality sets off alarms among those concerned about security, though; learning where data structures live in kernel space is often an important precondition to an attack. There was talk for a while of "obfuscating" the pointers - through an exclusive-OR with a random value, for example - but the risk was still seen as being too high. So the compromise is kcmp(), which simply answers the question of whether resources found in two processes are the same or not.

kcmp() takes two process ID parameters, indicating the processes of interest; both processes must be in the same PID namespace as the calling process. The type parameter tells the kernel the specific item that is being compared:

KCMP_FILE: determines whether a file descriptor idx1 in the first process is the same as another descriptor (idx2) in the second process.
KCMP_FILES: compares the file descriptor arrays to see whether the processes share all files.
KCMP_FS: compares fs_struct structures (which hold the current umask, working directory, namespace root, etc.).
KCMP_IO: compares the I/O context, used mainly for block I/O scheduling.
KCMP_SIGHAND: compares the two process's signal handler arrays.
KCMP_SYSVSEM: compares the list of undo operations associated with SYSV semaphores.
KCMP_VM: compares each process's address space.

The return value from kcmp() is zero if the two items are equal, one if the first item is "less" than the second, or two if the first is "greater" than the second. The ordered comparison may seem a little strange, especially when one looks at the implementation and sees that the pointers are obfuscated before comparison within the kernel. The result is, thus, an ordering that (by design) does not match the ordering of the relevant data structures in kernel space. It turns out that even a reshuffled (but consistent) "ordering" is useful for optimizing comparisons in user space when large numbers of open files are present.

This patch set has been through a few cycles of review and seems to have addressed most of the concerns raised by reviewers. It may just find its way in through the next merge window. Meanwhile, people who want to see how the user-space side works can find the relevant code at criu.org.

DMTCP

CRIU is not the only user-space checkpoint/restore implementation out there; the DMTCP (Distributed MultiThreaded CheckPointing) project has been busy since about 2.6.9. DMTCP differs somewhat from CRIU, though; in particular, it is able to checkpoint groups of processes connected by sockets - even across different machines - and it requires no changes to the kernel at all. These features come with a couple of limitations, though.

Checkpoint/restore with DMTCP requires that the target process(es) be started with a special script; it is not possible to checkpoint arbitrary processes on the system. That script uses the LD_PRELOAD mechanism to place wrappers around a number of libc and (especially) system call implementations. As a result, DMTCP has no need to ask the kernel whether two processes are sharing a specific resource; it has been watching the relevant system calls and knows how the processes were created. The disadvantage to this approach - beyond having to run checkpointable process in a special environment - is that, as can be seen in the table of supported applications, not all programs can be checkpointed.

The recent 1.2.4 release improves support, though, to the point that everything a wide range of users care about should be checkpointable. The system has been integrated with Open MPI and is able to respond to MPI-generated checkpoint and restore requests. DMTCP is available with the openSUSE, Debian Testing, and Ubuntu distributions. DMTCP may offer something good enough today for many users, who may not need to wait for one of the other projects to be ready sometime in the future.

Comments (14 posted)

Betrayed by a bitfield

By Jonathan Corbet
February 1, 2012

Developers tend to fear compiler bugs, and for good reason: such bugs can be hard to find and hard to work around. They can leave traps in a compiled program that spring on users at bad times. Things can get even worse if one person's compiler bug is seen by the compiler's developer as a feature - such issues have a tendency to never get fixed. It is possible that just this kind of feature has turned up in GCC, with unknown impact on the kernel.

One of the many structures used by the btrfs filesystem, defined in fs/btrfs/ctree.h, is:

    struct btrfs_block_rsv {
	u64 size;
	u64 reserved;
	struct btrfs_space_info *space_info;
	spinlock_t lock;
	unsigned int full:1;
    };

Jan Kara recently reported that, on the ia64 architecture, the lock field was occasionally becoming corrupted. Some investigation revealed that GCC was doing a surprising thing when the bitfield full is changed: it generates a 64-bit read-modify-write cycle that reads both lock and full, modifies full, then writes both fields back to memory. If lock had been modified by another processor during this operation, that modification will be lost when lock is written back. The chances of good things resulting from this sequence of events are quite small.

One can imagine that quite a bit of work was required to track down this particular surprise. It is also not hard to imagine the dismay that results from a conversation like this:

I've raised the issue with our GCC guys and they said to me that: "C does not provide such guarantee, nor can you reliably lock different structure fields with different locks if they share naturally aligned word-size memory regions. The C++11 memory model would guarantee this, but that's not implemented nor do you build the kernel with a C++11 compiler."

Unsurprisingly, Linus was less than impressed by this response. Language standards are not written for the unique needs of kernels, he said, and can never "guarantee" the behavior that a kernel needs:

So C/gcc has never "promised" anything in that sense, and we've always had to make assumptions about what is reasonable code generation. Most of the time, our assumptions are correct, simply because it would be *stupid* for a C compiler to do anything but what we assume it does.

But sometimes compilers do stupid things. Using 8-byte accesses to a 4-byte entity is *stupid*, when it's not even faster, and when the base type has been specified to be 4 bytes!

As it happens, the problem is a bit worse than non-specified behavior. Linus suggested running a test with a structure like:

    struct example {
	volatile int a;
      	int b:1;
    };

In this case, if an assignment to b causes a write to a, the behavior is clearly buggy: the volatile keyword makes it explicit that a may be accessed from elsewhere. Jiri Kosina gave it a try and reported that GCC is still generating 64-bit operations in this case. So, while the original problem is technically compliant behavior, it almost certainly results from the same decision-making that makes the second example go wrong.

Knowing that may give the kernel community more ammunition to flame the GCC developers with, but it is not necessarily all that helpful. Regardless of the source of the problem, this behavior exists in versions of the compiler that, almost certainly, are being used outside of the development community to build the kernel. So some sort of workaround is likely to be necessary even if GCC's behavior is fixed. That could be a bit of a challenge; auditing the entire kernel for 32-bit-wide bitfield variables in structures that may be accessed concurrently will not be a small job. But, then, nobody said that kernel development was easy.

Comments (85 posted)

Linus Torvalds Linux 3.3-rc2 ?

Steven Rostedt 3.0.18-rt34 ?

Peter De Schrijver Support for secondary cores on Tegra30 ?

Nicolas Ferre ARM: at91: at91sam9x5 family basic support ?

bill4carson@gmail.com ARM hugetlb support ?

Kukjin Kim ARM: EXYNOS: add support EXYNOS5 SoC ?

Vivien Didelot Support for the TS-5500 platform ?

Gilad Ben-Yossef Reduce cross CPU IPI interference ?

Eric W. Biederman sysctl rewrite for speed and clarity ?

Grant Likely irq_domain generalization and refinement ?

Peter Zijlstra srcu: Implement call_srcu() Jan 30

Frederic Weisbecker cgroups: Task counter subsystem v8 ?

Paul E. McKenney RCU commits for 3.4 ?

Stephane Eranian perf: add support for sampling taken branches ?

Akihiro Nagai perf script: add BTS analysis features ?

Jiri Olsa ftrace, perf: Adding support to use function trace ?

Hui Zhu KGTP (Linux Kernel debugger and tracer) 20120131 release(fix memory leaks of 20111218) ?

Vladislav Bolkhovitin [ANNOUNCE]: Released version 2.2 of SCST core, target drivers iSCSI-SCST, qla2x00t (QLogic FC), ib_srpt (InfiniBand SRP) and scst_local with scstadmin utility and fileio_tgt user space backend handler ?

Joerg Roedel IOMMU: Make IOMMU-API ready for GART-like hardware ?

Mauro Carvalho Chehab This is the version 2 of the HERM patches ?

Maxime Coquelin PASR: Partial Array Self-Refresh Framework ?

Zefir Kurtisi ath9k: DFS pattern detector ?

Sjur Brændeland modem_shm: Driver for ST-E Thor M7400 LTE modem ?

David Gibson RFC: Device isolation groups ?

Stephen Warren pinctrl: add a driver for NVIDIA Tegra ?

Shaohua Li block: An IOPS based ioscheduler ?

Stanislav Kinsbursky Lockd: make it network namespace aware ?

Martin K. Petersen Write same support ?

Marek Szyprowski Contiguous Memory Allocator ?

Wu Fengguang readahead stats/tracing, backwards prefetching and more (v4) ?

Hans Schillstrom NETFILTER new target module, HMARK ?

Victor Goldenshtein nl/cfg/mac80211: add DFS master ability ?

Will Drewry seccomp_filters: system call filtering using BPF ?

Andy Lutomirski PR_SET_NO_NEW_PRIVS, unshare, and chroot ?

Mimi Zohar ima: appraisal extension ?

Stefano Stabellini Xen (PV and HVM) multiple PV consoles ?

Kernel development

Brief items

Kernel release status

Quotes of the week

Greg Kroah-Hartman moves to the Linux Foundation

LSF/MM summit deadline approaching

Kernel development news

What happened to disk performance in 2.6.39

Preparing for user-space checkpoint/restore

CRIU

DMTCP

Betrayed by a bitfield

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers