LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 3.12-rc1, released on September 16. Linus said: "I personally particularly like the scalability improvements that got merged this time around. The tty layer locking got cleaned up and in the process a lot of locking became per-tty, which actually shows up on some (admittedly odd) loads. And the dentry refcount scalability work means that the filename caches now scale very well indeed, even for the case where you look up the same directory or file (which could historically result in contention on the per-dentry d_lock)."

Stable updates: 3.0.96, 3.4.62, 3.10.12, and 3.11.1 were all released on September 14.

Comments (none posted)

Quote of the week

Yo Dawg, I heard you like kernel compiles, so I put a kernel compile in your kernel compile so that you can compile the kernel while you compile the kernel.
Linus Torvalds (Thanks to Josh Triplett)

Comments (1 posted)

The Linux Foundation's kernel development report is out

The Linux Foundation has announced the release of its roughly annual report on the kernel development community; this report is written by Greg Kroah-Hartman, Amanda McPherson, and LWN editor Jonathan Corbet. There won't be much new there for those who follow the development statistics on LWN, but it does take a bit of a longer time perspective.

Comments (20 posted)

The OpenZFS project launches

The OpenZFS project has announced its existence. "ZFS is the world's most advanced filesystem, in active development for over a decade. Recent development has continued in the open, and OpenZFS is the new formal name for this open community of developers, users, and companies improving, using, and building on ZFS. Founded by members of the Linux, FreeBSD, Mac OS X, and illumos communities, including Matt Ahrens, one of the two original authors of ZFS, the OpenZFS community brings together over a hundred software developers from these platforms."

Comments (35 posted)

Kernel development news

The end of the 3.12 merge window

By Jonathan Corbet
September 17, 2013
Despite toying with the idea of closing the merge window rather earlier than expected, Linus did, in the end, keep it open until September 16. He repeated past grumbles about maintainers who send their pull requests at the very end of the merge window, though; increasingly, it seems that wise maintainers should behave as if the merge window were a single week in length. Pull requests that are sent too late run a high risk of being deferred until the next development cycle.

In the end, 9,479 non-merge changesets were pulled into the mainline repository for the 3.12 merge window; about 1,000 of those came in after the writing of last week's summary. Few of the changes merged in the final days of the merge window were hugely exciting, but there have been a number of new features and improvements. Some of the more significant, user-visible changes include:

  • Unlike its predecessor, the 3.12 kernel will not be known as "Linux for Workgroups." Instead, for reasons that are not entirely clear, the new code name was "Suicidal Squirrel" for a few days; it then was changed to "One giant leap for frogkind."

  • It is now possible to provide block device partition tables on the kernel command line; see Documentation/block/cmdline-partition.txt for details.

  • The memory management subsystem has gained the ability to migrate huge pages between NUMA nodes.

  • The Btrfs filesystem has the beginning of support for offline deduplication of data blocks. A new ioctl() command (BTRFS_IOC_FILE_EXTENT_SAME) can be used by a user-space program to inform the kernel of extents in two different files that contain the same data. The kernel will, after checking that the data is indeed the same, cause the two files to share a single copy of that data.

  • The HFS+ filesystem now supports POSIX access control lists.

  • The reliable out-of-memory killer patches have been merged. This work should make OOM handling more robust, but it could possibly confuse user-space applications by returning "out of memory" errors in situations where such errors were not seen before.

  • The evdev input layer has gained a new EVIOCREVOKE ioctl() command that revokes all access to a given file descriptor. It can be used to ensure that no evil processes lurk on an input device across sessions. See this patch for an example of how this functionality can be used.

  • New hardware support includes:

    • Miscellaneous: MOXA ART real-time clocks, Freescale i.MX SoC temperature sensors, Allwinner A10/A13 watchdog devices, Freescale PAMU I/O memory management units, TI LP8501 LED controllers, Cavium OCTEON GPIO controllers, and Mediatek/Ralink RT3883 PCI controllers,

    • Networking: Intel i40e Ethernet interfaces.

Changes visible to kernel developers include:

  • The seqlock locking primitive has gained a new "locking reader" type. Normally, seqlocks allow for data structures to be changed while being accessed by readers; the readers are supposed to detect the change (by checking the sequence number) and retry if need be. Some types of readers cannot tolerate changes to the structure, though; in current kernels, they take an expensive write lock instead. The "locking reader" lock will block writers and other locking readers, but allow normal readers through. Note that locking readers could share access with each other; the fact that this sharing does not happen now is an implementation limitation. The functions for working with this type of lock are:

        void read_seqlock_excl(seqlock_t *sl);
        void read_sequnlock_excl(seqlock_t *sl);
    

    There are also the usual variants for blocking hardware and software interrupts; the full set can be found in <linux/seqlock.h>.

  • The new shrinker API has been merged. Most code using this API needed to be changed; the result should be better performance and a better-defined, more robust API. The new "LRU list" mechanism that was a part of that patch set has also been merged.

  • The per-CPU IDA ID allocator patch set has been merged.

Now begins the stabilization phase for the 3.12 kernel. If the usual pattern holds, the final release can be expected on or shortly after Halloween; whether it turns out to be a "trick" or a "treat" depends on how well the testing goes between now and then.

Comments (18 posted)

The search for truly random numbers in the kernel

By Jonathan Corbet
September 18, 2013
The ongoing disclosures of governmental attempts to weaken communications security have caused a great deal of concern. Thus far, the evidence would seem to suggest that the core principles behind cryptography remain sound, and that properly encrypted communications can be secure. But the "properly encrypted" part is a place where many things can go wrong. One of those things is the generation of random numbers; without true randomness (or, at least, unpredictability), encryption algorithms can be far easier to break than their users might believe. For this reason, quite a bit of attention has been paid to the integrity of random number generation mechanisms, including the random number generator (RNG) in the kernel.

Random number generation in Linux seems to have been fairly well thought out with no obvious mistakes. But that does not mean that all is perfect, or that improvements are not possible. The kernel's random number generator has been the subject of a few different conversations recently, some of which will be summarized here.

Hardware random number generators

A program running on a computer is a deterministic state machine that cannot, on its own, generate truly random numbers. In the absence of a source of randomness from the outside world, the kernel is reduced to the use of a pseudo-random number generator (PRNG) algorithm that, in theory, will produce numbers that could be guessed by an attacker. In practice, guessing the results of the kernel's PRNG will not be an easy task, but those who concern themselves with these issues still believe that it is better to incorporate outside entropy (randomness) whenever it is possible.

One obvious source of such randomness would be a random number generator built into the hardware. By sampling quantum noise, such hardware could create truly random data. So it is not surprising that some processors come with RNGs built in; the RDRAND instruction provided by some Intel processors is one example. The problem with hardware RNGs is that they are almost entirely impossible to audit; should some country's spy agency manage to compromise a hardware RNG, this tampering would be nearly impossible to detect. As a result, people who are concerned about randomness tend to look at the output of hardware RNGs with a certain level of distrust.

Some recently posted research [PDF] can only reinforce that distrust. The researchers (Georg T. Becker, Francesco Regazzoni, Christof Paar, and Wayne P. Burleson) have documented a way to corrupt a hardware RNG by changing the dopant polarity in just a few transistors on a chip. The resulting numbers still pass tests of randomness, and, more importantly, the hardware still looks the same at almost every level, regardless of whether one looks at the masks used or whether one looks at the chip directly with an electron microscope. This type of hardware compromise is thus nearly impossible to detect; it is also relatively easy to carry out. The clear conclusion is that hostile hardware is a real possibility; the corruption of a relatively simple and low-level component like an RNG is especially so. Thus, distrust of hardware RNGs would appear to be a healthy tendency.

The kernel's use of data from hardware RNGs has been somewhat controversial from the beginning, with some developers wanting to avoid such sources of entropy altogether. The kernel's approach, though, is that using all available sources of entropy is a good thing, as long as it is properly done. In the case of a hardware RNG, the random data is carefully mixed into the buffer known as the "entropy pool" before being used to generate kernel-level random numbers. In theory, even if the data from the hardware RNG is entirely hostile, it cannot cause the state of the entropy pool to become known and, thus, it cannot cause the kernel's random numbers to be predictable.

Given the importance of this mixing algorithm, it was a little surprising to see, earlier this month, a patch that would allow the user to request that the hardware RNG be used exclusively by the kernel. The argument for the patch was based on performance: depending entirely on RDRAND is faster than running the kernel's full mixing algorithm. But the RNG is rarely a performance bottleneck in the kernel, and the perceived risk of relying entirely on the hardware RNG was seen as being far too high. So the patch was not received warmly and had no real chance of being merged; sometimes it is simply better not to tempt users to compromise their security in the name of performance.

H. Peter Anvin raised a related question: what about hardware RNGs found in other components, and, in particular, in trusted platform module (TPM) chips? Some TPMs may have true RNGs in them; others are known to use a PRNG and, thus, are fully deterministic. What should the kernel's policy be with regard to these devices, which, for the most part, are ignored currently? The consensus seemed to be that no particular trust should be put into TPM RNGs, but that using some data from the TPM to seed the kernel's entropy pool at boot could be beneficial. Many systems have almost no entropy to offer at boot time, so even suspect random data from the TPM would be helpful early in the system's lifetime.

Overestimated entropy

As noted above, the kernel attempts to pick up entropy from the outside world whenever possible. One source of entropy is the timing of device interrupts; that random data is obtained by (among other things) reading the time stamp counter (TSC) with a call to get_cycles() and using the least significant bits. In this way, each interrupt adds a little entropy to the pool. There is just one little problem, pointed out by Stephan Mueller: on a number of architectures, the TSC does not exist and get_cycles() returns zero. The amount of entropy found in a constant stream of zeroes is rather less than one might wish for; the natural consequence is that the kernel's entropy pool may contain less entropy than had been thought.

The most heavily used architectures do not suffer from this problem; on the list of those that do, the most significant may be MIPS, which is used in a wide range of home network routers and other embedded products. As it turned out, Ted Ts'o had already been working with the MIPS maintainers to find a solution to this problem. He didn't like Stephan's proposed solution — reading a hardware clock if get_cycles() is not available — due to the expense; hardware clocks can take a surprisingly long time to read. Instead, he is hoping that each architecture can, somehow, provide some sort of rapidly increasing counter that can be used to contribute entropy to the pool. In the case of MIPS, there is a small counter that is incremented each clock cycle; it doesn't hold enough bits to work as a TSC, but it's sufficient for entropy generation.

In the end, a full solution to this issue will take a while, but, Ted said, that is not necessarily a problem:

If we believed that /dev/random was actually returning numbers which are exploitable, because of this, I might agree with the "we must do SOMETHING" attitude. But I don't believe this to be the case. Also note that we're talking about embedded platforms, where upgrade cycles are measured in years --- if you're lucky. There are probably home routers still stuck on 2.6 [...]

So, he said, it is better to take some time and solve the problem properly.

Meanwhile, Peter came to another conclusion about the entropy pool: when the kernel writes to that pool, it doesn't account for the fact that it will be overwriting some of the entropy that already exists there. Thus, he said, the kernel's estimate for the amount of entropy in the pool is almost certainly too high. He put together a patch set to deal with this problem, but got little response. Perhaps that's because, as Ted noted in a different conversation, estimating the amount of entropy in the pool is a hard problem that cannot be solved without knowing a lot about where the incoming entropy comes from.

The kernel tries to deal with this problem by being conservative in its accounting for entropy. Quite a few sources of unpredictable data are mixed into the pool with no entropy credit at all. So, with luck, the kernel will have a vague handle on the amount of entropy in the pool, and its mixing techniques and PRNG should help to make its random numbers thoroughly unpredictable. The end result should be that anybody wanting to attack the communications security of Linux users will not see poor random numbers as the easiest approach; in this world, one cannot do a whole lot better than that.

Comments (24 posted)

Copy offloading with splice()

By Jonathan Corbet
September 18, 2013
One of the most common things to do on a computer is to copy a file, but operating systems have traditionally offered little in the way of mechanisms to accelerate that task. The cp program can replicate a filesystem hierarchy using links — most useful for somebody wanting to work with multiple kernel trees — but that trick speeds things up by not actually making copies of the data; the linked files cannot be modified independently of each other. When it is necessary to make an independent copy of a file, there is little alternative to reading the whole thing through the page cache and writing it back out. It often seems like there should be a better way, and indeed, there might just be.

Contemporary systems often have storage mechanisms that could speed copy operations. Consider a filesystem mounted over the network using a protocol like NFS, for example; if a file is to be copied to another location on the same server, doing the copy on the server would avoid a lot of work on the client and a fair amount of network traffic as well. Storage arrays often operate at the file level and can offload copy operations in a similar way. Filesystems like Btrfs can "copy" a file by sharing a single copy of the data between the original and the copy; since that sharing is done in a copy-on-write mode, there is no way for user space to know that the two files are not completely independent. In each of these cases, all that is needed is a way for the kernel to support this kind of accelerated copy operation.

Zach Brown has recently posted a patch showing how such a mechanism could be added to the splice() system call. This system call looks like:

    ssize_t splice(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out,
    		   size_t len, unsigned int flags);

Its job is to copy len bytes from the open file represented by fd_in to fd_out, starting at the given offsets for each. One of the key restrictions, though, is that one of the two file descriptors must be a pipe. Thus, splice() works for feeding data into a pipe or for capturing piped data to a file, but it does not perform the simple task of copying one file to another.

As it happens, the machinery that implements splice() does not force that limitation; instead, the "one side must be a pipe" rule comes from the history of how the splice() system call came about. Indeed, it already does file-to-file copies when it is invoked behind the scenes from the sendfile() system call. So there should be no real reason why splice() would be unable to do accelerated file-to-file copies. And that is exactly what Zach's patch causes it to do.

That patch set comes in three parts. The first of those adds a new flag (SPLICE_F_DIRECT) allowing users to request a direct file-to-file copy. When this flag is present, it is legal to provide values for both off_in and off_out (normally, the offset corresponding to a pipe must be NULL); when an offset is provided, the file will be positioned to that offset before the copying begins. After this patch, the file copy will happen without the need to copy any data in memory and without filling up the page cache, but it will not be optimized in any other way.

The second patch adds a new entry to the ever-expanding file_operations structure:

    ssize_t (*splice_direct)(struct file *in, loff_t off_in, struct file *out, 
			     loff_t off_out, size_t len, unsigned int flags);

This optional method can be implemented by filesystems to provide an optimized implementation of SPLICE_F_DIRECT. It is allowed to fail, in which case the splice() code will fall back to copying within the kernel in the usual manner.

Here, Zach worries a bit in the comments about how the SPLICE_F_DIRECT flag works: it is used to request both direct file-to-file copying and filesystem-level optimization. He suggests that the two requests should be separated, though it is hard to imagine a situation where a developer who went to the effort to use splice() for a file-copy operation would not want it to be optimized. A better question, perhaps, is why SPLICE_F_DIRECT is required at all; a call to splice() with two regular files as arguments would already appear to be an unambiguous request for a file-to-file copy.

The last patch in the series adds support for optimized copying to the Btrfs filesystem. In truth, that support already exists in the form of the BTRFS_IOC_CLONE ioctl() command; Zach's patch simply extends that support to splice(), allowing it to be used in a filesystem-independent manner. No other filesystems are supported at this point; that work can be done once the interfaces have been nailed down and the core work accepted as the right way forward.

Relatively few comments on this work have been posted as of this writing; whether that means that nobody objects or nobody cares about this functionality is not entirely clear. But there is an ongoing level of interest in the idea of optimized copy operations in general; see the lengthy discussion of the proposed reflink() system call for an example from past years. So, sooner or later, one of these mechanisms needs to make it into the mainline. splice() seems like it could be a natural home for this type of functionality.

Comments (4 posted)

Patches and updates

Kernel trees

Core kernel code

Device drivers

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds