LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 3.4-rc7, released on May 12. Linus says: "This is almost certainly the last -rc in this series - things really have calmed down, and I even considered just cutting 3.4 this weekend, but felt that another week wouldn't hurt." Expect a 3.4 final release in the near future.

Stable updates: 3.3.6 was released on May 12, as was 3.2.17. The 2.6.34.12 update is in the review process as of this writing; it can be expected on or after May 17.

Comments (none posted)

Quotes of the week

Also, when Van [Jacobson] says something, you can be fairly sure its right, and if it's not, then you didn't understand what Van said.
Eric Dumazet (thanks to Dave Täht)

The thrust of the argument seems to be that by establishing good habits from the very beginning you can avoid the need for change. That may well be true, but it isn't particularly "user friendly". We should make things simple and safe so that people don't *need* to carefully form good habits.
Neil Brown

Comments (5 posted)

The end of the token ring era?

By Jonathan Corbet
May 16, 2012
Paul Gortmaker was recently doing some cleanup work when he found the token ring networking code getting in the way. Which led him to wonder: was anybody still using that code? He concluded that the answer was "no":

A search on the internet for users tends to show that even the die hard enthusiasts who cared to poke at MCA/TR just for hobby sake have pretty much all given up somewhere in the 2003-2005 "pre-git" timeframe, and never really moved off their 2.4.x kernels.

In response, he put together a patch to remove the token ring subsystem altogether. The patch was presented as a demonstration, without a lot of hope that it would be applied in the near future. Paul's real goal was to get comments and see if he could build a consensus for the removal of the code at some more distant time.

Thus far, there has been one objection. But, that notwithstanding, David Miller has accepted the patch and fast-tracked it directly into the net-next repository. Barring some sort of reversion prior to the merge window, it looks like the 3.5 kernel will be missing support for token ring networking.

Comments (10 posted)

Kernel development news

User and group mount options for ext filesystems

By Jake Edge
May 16, 2012

When transporting files between systems using USB sticks or other removable media, one can run into an annoying problem: the UIDs or GIDs of the files on the media don't match those on the system. In most situations, those kinds of devices have a VFAT filesystem that avoids the problem entirely by not storing UID/GID information. But if a user wants to use a "real" filesystem on the device, one of the ext* family for example, it might be useful to specify the local owner of the files. Ludwig Nussel's patch set would do just that for ext2, ext3, and ext4 filesystems.

The patch comes from some work Nussel did "years ago", he said, when re-introducing it. It simply adds two new mount options for ext filesystems. Following in the footsteps of the VFAT filesystem, the patch would add uid= and gid= options that would treat all files in the filesystem as being owned by that UID/GID combination. When a filesystem is mounted using these options, files retain their ownership on disk, but they appear to be owned by the specified user and group. Existing files cannot have their ownership changed, but new files will be created with the user and group given at mount time. If a different UID/GID combination is desired for new files—to match the UID/GID on the device for example—they can be added to the mount option:

    uid=m:n
    gid=x:y
which would make the files appear to be owned by m.x and would create new files as n.y.

One of the first questions to greet Nussel's patch was about putting the code into specific filesystems, rather than the VFS layer. While the VFS seems like the right place, Ted Ts'o points out that there is no easy way to do it all there:

The problem is that there will need to be at least some support in the individual file system, since there isn't a good place for the VFS to intercept the internal file system iget() function to patch in the override uid/gid values.

So the question at this point is whether it's cleaner to have the functionality split between the VFS and the file system layers (i.e., with the options parsing and storing the override uid/gid values in the super_block structure) or keeping it all in the file system layer, and accepting the duplication of code across multiple file systems.

Ts'o leaned toward the first approach in that message, but later reluctantly accepted the code duplication. From what he could see, there wasn't enough of a win to put it into the VFS.

There was a little more discussion when Nussel resent the patch on May 10. First off, Jan Kara and Ts'o both wanted to see the patch split into three parts (one for each of ext2, 3, and 4), which Nussel did and posted the next day. But, Roland Eggner and Boaz Harrosh were both concerned about the underlying idea of the patch. Circumventing the access restrictions on the files via a mount option is not a sensible way to address what is, really, an administrative problem, they said.

Eggner described how he "solves" the problem for systems he administers by essentially creating and using a static list of UIDs and GIDs. His position is: "If UIDs differ on machines FORESEEN for file exchange, this is an administrator error, not a kernel deficit." Furthermore, exchanging files with unexpected systems requires root privileges, he said, so there is no need for the mount option override.

Like Eggner, Harrosh is concerned about security issues with the proposed change. He also doesn't see anything particularly special about the ext filesystems in terms of removable media, noting that VFAT is the dominant choice. Beyond that, he questions the definition of "removable media", and notes that the problem is common in the NFS world: "we constantly encounter multiple domain uid/gid views, and it does not mean we blow a hole in POSIX security rules."

But Neil Brown sees things a little differently. He notes that VFAT suffers from limitations including a 4G file size limit and an inability to handle some special characters in file names. That aside, when someone has physical access to a device, it is essentially "removable" in some sense, so that someone may want to easily access the data:

[...] if I "own" a filesystem - whether because I hold the physical non-encrypted devices or because I know the encryption key - then I want to be able to leverage that "ownership" to full access rights to the contents of the filesystem. By typing in a key or plugging in a device I want to get full "root" access to the filesystem on the device. Not giving that to me is just getting in my way.

When users insert a VFAT-formatted USB stick or disk, suitably configured systems will give full access to the user by using the VFAT uid/gid options. Nussel's patches essentially just give that same power for ext-formatted devices. While it could certainly lead to problems, those problems are already latent, as Brown pointed out:

You cannot prevent data destruction on such devices if you lose physical control, and the only workable data privacy option is encryption. Trying to pretend that file permission bits mean anything is extremely naive.

While Harrosh is concerned that automounters will start using the options, Brown believes that makes sense for removable devices. In the patch, Nussel mentions that it could be done statically in /etc/fstab or be handled dynamically through udev rules. The alternative suggested by Harrosh is that root can mount the device and then chmod (or chown, presumably) the files appropriately. That seems like a pointless exercise that will just have to be repeated, potentially every time the device is plugged into a new system. Eggner's method is certainly workable, at some level, but makes things more difficult and less "user friendly", Brown said.

In the end, it is a convenience feature. Anyone with physical access to a unencrypted removable device already has the tools available to read the data on it or to put malware onto it. It's a little hard to see how making it easier for legitimate owners of removable USB storage to access their data somehow opens the floodgates for attackers of various sorts. Those of a malicious bent can find any number of ways (live CD, their own Linux system, ...) to access the device as root if they wish.

It is unclear how prevalent ext-formatted removable devices are, so there may be an argument against adding the feature on those grounds. On the other hand, making the ext family work better may encourage people to use those filesystems more often for removable media. The patches do duplicate code in the three separate filesystems, but the total number of lines is changed is only around 100 lines for each. Moving some of that into the VFS (like parsing the mount options and storing the flags in the superblock) might reduce that a bit, but it's not much code overall. Administrators who are worried about the feature will be able to avoid it entirely, though they may need to keep an eye on their distribution's udev rules. Given that it brings the same convenience as VFAT to ext-formatted devices, it seems like a feature worth having.

Comments (51 posted)

Various tweaks to printk()

By Jonathan Corbet
May 16, 2012
For the most part, the logging reliability patches covered here in April have been quietly stabilizing and appear to be set for merging for 3.5. But printk() is a heavily-used function, so there are a lot of people with strong opinions on how it should work. Thus the discussion on how printk() can be improved has stretched out for some time. The result, so far, is a better understanding of how continuation lines should be handled and, possibly, a new format for timestamps.

Messages are sent to the system log with printk(), but that function has an interesting bit of historical behavior: like printf() in user space, printk() can be used to send partial lines to the log. Multiple printk() calls can be used to produce a single line in the log stream, piece by piece. The patches for 3.5 make printk() much more record-oriented internally, but the API does not change. So there is a bit of an impedance mismatch between a record-oriented logging system and its stream-oriented API. That mismatch has been there since the beginning, but it has become more clear over time.

The mixed nature of kernel logging leads to a bit of an ambiguity, because any message can be either of two things: (1) a new message to be logged or (2) a continuation of a previous log message. The kernel decides which of the two situations holds by remembering whether the previous log message ended with a newline or not. If there was no trailing newline, a new message will be appended to the previous line.

This approach works much of the time, but it is not without its hazards. In particular, there is nothing that guarantees that two successive printk() calls will be executed one right after the other. Even on a uniprocessor system, interrupt handlers can emit messages between two printk() calls that are supposed to produce a single line of output. Adding more processors to the system clearly makes the situation worse; there is only one log buffer containing messages from all processors, so it is easy for one processor to jump into the middle of a sequence of printk() calls being executed on another. What happens then is not especially pretty: messages get mashed together and corrupted. The result is a log that is harder for humans to read, and which can totally confuse automated log-processing tools.

This patch set was supposed to be about increasing logging reliability, so that sort of message corruption is not welcome. The original plan devised by developer Kay Sievers was to require an explicit KERN_CONT "log level" marker for continuations. In this scheme, every printk() call will generate a new log line unless merging has been explicitly requested with the KERN_CONT "log level." There is a little problem in that most continuation lines are not so-marked in current kernels, leading to lines being split up; Kay's plan was to audit the kernel and fix all of those calls to work properly in the new scheme.

Linus didn't like that idea, saying that things work well as they are now; to him, adding all those KERN_CONT markers just represented unnecessary noise. After some back-and-forth, Kay came around to Linus's point of view, but he still wanted to avoid the corruption of messages whenever possible. The result was a new patch that tries to explicitly remember partial printk() calls and associate them with a specific process. Lines passed to printk() will be merged only if they both come from the same process and only if the second line is clearly not the start of a new log message. The end result is not perfect: if two processors try to output partial lines at the same time, at least one of them will be split. But there will be no more joining of unrelated messages, and that seems like a good thing.

A different branch of the same discussion got into the formatting of timestamps, which will always be present in the new scheme. In current kernels, that timestamp comes in the form of seconds and microseconds since the system booted. But what developers often really want to see is some combination of the absolute time of an event and the relative time from previous events. After some discussion with Sasha Levin, Linus requested a format that looks like this:

    [May12 11:27] foo
    [May12 11:28] bar
    [  +5.077527] zoot
    [ +10.235225] foo
    [  +0.002971] bar
    [May12 11:29] zoot
    [  +0.003081] foo

In other words, events that are relatively far apart in time would be marked with the absolute time with one-minute precision. When things happen more closely in time, the elapsed time between successive events would be printed instead. For any driver developer trying to figure out the relative timing of device-related events, this kind of output format would help to save a lot of mental arithmetic.

The patches to produce this format have not yet been posted, so it is looking likely that we will not see it in the 3.5 kernel. The rest of the logging work should be there for 3.5, though, taking Linux one small step closer to the sort of structured and reliable logging that many users and developers would like to see.

Comments (10 posted)

A bcache update

By Jonathan Corbet
May 14, 2012
Flash-based solid-state storage devices (SSDs) have a lot to recommend them; in particular, they can be quite fast even when faced with highly non-sequential I/O patterns. But SSDs are also relatively small and expensive; for that reason, for all their virtues, they will not be fully replacing rotating storage devices for a long time. It would be nice to have a storage device that provided the best features of both SSDs and rotating devices—the speed of flash combined with the cheap storage capacity of traditional drives. Such a device could simultaneously reduce the performance pain that comes with rotating storage and the financial pain associated with solid-state storage.

The classic computer science response to such a problem is to add another level of indirection in the form of another layer of caching. In this case, a large array of drives could be hidden behind a much smaller SSD-based cache that provides quick access to frequently-accessed data and turns random access patterns in something closer to sequential access. Hybrid drives and high-end storage arrays have provided this kind of feature for some time, but Linux does not currently have the ability to construct such two-level drives from independent components. That situation could change, though, if the bcache patch set finds its way into the mainline.

LWN last looked at bcache almost two years ago. Since then, the project has been relatively quiet, but development has continued. With the current v13 patch set, bcache creator Kent Overstreet says:

Bcache is solid, production ready code. There are still bugs being found that affect specific configurations, but there haven't been any major issues found in awhile - it's well past time I started working on getting it into mainline.

The idea behind bcache is relatively straightforward: given an SSD and one or more storage devices, bcache will interpose the SSD between the kernel and those devices, using the SSD to speed I/O operations to and from the underlying "backing store" devices. If a read request can be satisfied from the SSD, the backing store need not be involved at all. Depending on its configuration, bcache can also buffer write operations; in this mode, it serves as a sort of extended I/O scheduler, reordering operations so that they can be sent to the backing device in a more seek-friendly manner. Once one gets into the details, though, the problem starts to become more complex than one might imagine.

Consider the buffering and reordering of write operations, for example. Some users may be uncomfortable with anything that delays the arrival of data on the backing device; for such situations, bcache can be run in a write-through caching mode. When write-through behavior is selected, no write operation is considered to be complete until it has made it to the backing device. Clearly, in this case, the SSD cache is not going to improve write performance at all, though it may still improve performance overall if that data is read while it remains in the cache.

If, instead, writeback caching is enabled, bcache will mark the completion of writes once they make it to the SSD. It can then flush those dirty blocks out to the backing device at its leisure. Writeback caching can allow the system to coalesce multiple writes to the same blocks and to achieve better on-disk locality when the writes are eventually flushed out; both of those should improve performance. Obviously, writeback caching also carries the risk of losing data if the system is struck by a meteorite before the writeback operation is complete. Bcache includes a fair amount of code meant to address this concern; the SSD contains an index as well as the cached data, so dirty blocks can be located and written back after the system comes back up. Providing meteorite-proof drives is beyond the scope of the bcache patch set, though.

Of course, maintaining this index on the SSD has some performance costs of its own, especially since bcache takes pains to only write full erase blocks at a time. One write operation from the kernel can turn into several operations at the SSD level to ensure that the on-SSD data structures are consistent at all times. To mitigate this cost, bcache provides an optional journaling layer that can speed up operations at the SSD level.

Another interesting problem that comes with writeback caching is the implementation of barrier operations. Filesystems use barriers (implemented as synchronous "force to media" operations in contemporary kernels) to ensure that the on-disk filesystem structure is consistent at all times. If bcache does not recognize and implement those barriers, it runs the risk of wrecking the filesystem's careful ordering of operations and corrupting things on the backing device. Unfortunately, bcache does indeed lack such support at the moment, leading to a strong recommendation to mount filesystems with barriers disabled for now.

Multi-layer solutions like bcache must face another hazard: what happens if somebody accesses the underlying backing device directly, routing around bcache? Such access could result in filesystem corruption. Bcache handles this possibility by requiring exclusive access to the backing device. That device is formatted with a special marker, and its leading blocks are hidden when accessing the device by way of bcache. Thus, the beginning of the device under bcache is not the same as the beginning when the device is accessed directly. That means that a filesystem created through bcache will not be recognized by the filesystem code if an attempt is made to mount the backing device directly. Simple attempts to shoot one's own feet should be defeated by this mechanism; as always, there is little point in doing more to protect those who are really determined to injure themselves.

There seems to be a reasonable level of consensus that bcache would be a useful functionality to add to the kernel. There are some obstacles to overcome before this code can be merged, though. One of those is that bcache adds its own management interface involving a set of dedicated tools and a complex sysfs structure. There is resistance to adding another API for block device management, so Kent has been encouraged to integrate bcache into the device mapper code. Nobody seems to be working on that project at the moment, but Dan Williams has posted a set of patches integrating bcache into the MD RAID layer. With these patches, a simple mdadm command is sufficient to set up an array with SSD caching added on top. Once that code gets into shape, presumably the user-space interface concerns will be somewhat lessened.

A harder problem to get around may be the simple fact that the bcache patch set is large, adding over 15,000 lines of code to the kernel. Included therein is a fair amount of tricky data structure work such as a complex btree implementation and "closures," being "asynchronous refcounty things based on workqueues." The complexity of the code will make it hard to review, but, given the potential for trouble when adding a new stage to the block I/O path, developers will want this code to be well reviewed indeed. Getting enough eyeballs directed toward this code could be a challenge, but the benefit, in the form of faster storage devices, could well be worth the trouble.

Comments (38 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds