User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current stable 2.6 kernel is, released on August 23. It is a relatively large patch set with fixes for a number of important bugs. Before that, was released on August 22. This one has three security fixes: one for a privilege escalation problem in the SCTP code, one for a UDF filesystem memory corruption bug, and one for a crash which can only be triggered by a privileged user. was released on August 18 with a single, PowerPC-specific security fix.

The current 2.6 prepatch remains 2.6.18-rc4. Linus surfaced from his vacation long enough to merge 100 or so fixes into the mainline repository, but little more than that has happened on that front.

The current -mm tree is 2.6.18-rc4-mm2. Recent changes to -mm include a kernel stack protector patch set (labeled "security placebo"), the ability to filter core dumps through a helper application, and a lot of fixes.

The current stable 2.4 kernel is, released on August 22. It contains a small number of fixes, including one for the latest SCTP vulnerability. Previously, came out on August 19 with another pair of security fixes.

The 2.4.34 process has begun with 2.4.34-pre1, with another small set of fixes (but see below as well).

Comments (none posted)

Kernel development news

Old kernels and new compilers

Under the long-lasting maintainership of Marcelo Tosatti, the 2.4 kernel went into a deep maintenance mode, with only important fixes being considered for merging. For some people, perhaps, it was a little too deep - Marcelo clearly had other tasks besides 2.4 maintenance keeping him busy. Even so, few expected major changes when Willy Tarreau took over 2.4 maintenance after the 2.4.33 release. Why mess with 2.4 at this point?

So Willy's 2.4.34-pre1 announcement raised a few eyebrows. The prepatch itself contains a relatively small number of patches of the type one would expect. But the announcement itself notes that Willy is considering merging a set of patches to allow 2.4 kernels to be built with current gcc 4.x compilers. This is not a trivial set of changes; gcc 4.x is sufficiently different that a fairly wide-ranging set of fixes is required. The gcc 4.x transition for 2.6 was not an overnight affair.

A clear question comes immediately to mind: why would somebody who is not interested in running a current kernel be bothering with contemporary compilers? One answer is to be found in the announcement itself: there are administrators who deploy 2.4 kernels on ultra-stable systems, but who build those kernels on their desktops. It is getting increasingly hard to find a current distribution with a compiler old enough to build 2.4 kernels, so these administrators are finding themselves in a bit of a bind. A 2.4 kernel which could be compiled with a current gcc would allow current systems to be used to build kernels for deployment on stable, production systems, many of which may not have their own compilers installed at all.

Solar Designer has also noted that Openwall GNU/*/Linux is planning to upgrade to gcc 4.x and would really rather not have to change to the 2.6 kernel at the same time.

For an interesting read, see Willy's description of the user base, as he sees it, for the 2.4 kernel. In his view, the major users are those setting up very high-reliability sites. These people prefer 2.4 kernels for this job:

Simply because we already know from collective experience that these versions can achieve very long uptimes (while we don't know this yet for a fresh new version which got 5700 patches in the last 3 months), and because with the addition of very few patches, you can make a bet on security: as long as newly discovered vulnerabilities don't affect you or are covered by your additional patches, you win. If you need to update and induce excessive downtime, you lose and pay penalties.

The idea is to keep these people happy - by enabling the use of current compilers, among other things - until a 2.6 kernel comes along which is able to provide the same sort of stability guarantees. The 2.6 development model makes that sort of guarantee harder, however, because older 2.6.x kernels go out of general maintenance relatively quickly (though distributors can and do maintain them for longer). It is hard to find a 2.6 kernel with a multi-year track record of reliability, security, and ongoing fixes.

Willy's hope is that the current 2.6.16 kernel, which Adrian Bunk has stepped forward to maintain for the long term, will help in this regard. Once 2.6.16 has received a year or two of fixes (and nothing else), it might reach a point where high-reliability people might trust it in deployed systems. Time will tell if this kernel is able to reach that point.

As an aside, it's worth mentioning that a small number of developers (well, OK, one developer) have expressed some discontent about the 2.6.16 long-term process. This developer has said that it would have been better to elect an extra-stable tree maintainer through some sort of popular vote, and, perhaps, to move on to a 2.7 development series as well. This complaint ignores the fact that volunteers to maintain 2.6 kernels over the long term have been in relatively short supply; in fact, Adrian would appear to be about the only one. It does not appear that Adrian's appointment as the long-term 2.6.16 maintainer has deprived anybody else of their lifetime dreams. So maintainer elections - other than those of the "vote with your feet" variety - seem unlikely to happen in the near future.

Comments (12 posted)

Kevents and review of new APIs

The proposed kevent interface has been covered here before - see this article and this one too. Kevents appear to have gained significant momentum over the last few weeks, to the point that inclusion in 2.6.19 is not entirely out of the question. Most developers who have reviewed the code seem to like the core idea (a unified interface for applications to get information on all events of interest) and the implementation within the kernel. Only now, however, is significant attention being paid to the user-space API which comes with kevents. But the definition of that API is of crucial importance. This article will look at it from two perspectives - first technical, then political.

The discussion of the proposed API has been hampered somewhat by the lack of associated documentation - and the fact that said API is still changing quickly. In an attempt to pull together some of the available information, Stephen Hemminger has put up a page at OSDL describing the system call API. That page misses one important aspect of kevents, however: the ability to receive events via a shared memory interface. In an attempt to fill that gap, we'll look at the August 23 version of the memory-mapped kevent API.

One of the goals behind kevents is to make the processing of events as fast as possible - the idea being that a high-performance network server (say) can work through vast numbers of events per second without appreciable system overhead. One way to achieve this is to avoid system calls altogether whenever possible. That is why there is interest in mapping kevents directly into user space; this approach will allow the application to consume them without calling into the kernel for each one.

To use the mmap interface, the application obtains a kevent file descriptor, as usual. A simple call to mmap() will then create the shared buffer for kevent communication. The size of this buffer is currently determined by an in-kernel parameter - the maximum number of kevents which will be stored there. Presumably there will eventually be a KEVENT_MMAP_PAGES macro (or some such) to free the application from trying to figure out how many pages it should map, but that is not yet provided.

The resulting memory array is treated as a big circular buffer by the kernel. There is a single index only, however - where the next event will be written by the kernel. In other words, the kernel has no way to know which events have been consumed by the application; if that application falls too far behind, the kernel will begin to overwrite unprocessed events. For this reason, perhaps, the buffer is made relatively large - 4096 events fit there in the current version of the patch.

The events stored in the buffer are not the same ukevent structures used by the system call interface. There is, instead, a shortened version in the form of struct mukevent:

    struct kevent_id
	union {
	    __u32 raw[2];
	    __u64 raw_u64 __attribute__((aligned(8)));

    struct mukevent
	struct kevent_id	id;
	__u32			ret_flags;

The id field contains some information about what happened: the relevant file descriptor, for example. The actual event code itself is not present, however.

The event ring is not quite a pure circular buffer. It is formatted with a four-byte field at the beginning of each page, followed by as many mukevent structures as will fit within the page. The four-byte field in the first page contains the current event index - where the kernel will write the next event. The application will, presumably, keep track of the last event it read from the buffer, moving that counter forward until it catches up with the kernel. The application must take care, however, to notice every time it crosses a page boundary so it can skip the count field.

Since there is no way to inform the kernel that events have been consumed from the memory-mapped ring, and since the full event information is not available via that ring, the application must still call into the kernel for events. Otherwise, if nothing else, they will accumulate there until they reach their maximum allowed number. So the advantage of the memory-mapped approach will be hard to obtain with the current code. As was noted above, however, this API is very young. One assumes that these little problems will be ironed out in the near future.

Meanwhile, kevents have created a separate discussion on how new APIs go into the kernel. One Nicholas Miell requested that some documentation for this interface be written:

Is any of this documented anywhere? I'd think that any new userspace interfaces should have man pages explaining their use and some example code before getting merged into the kernel to shake out any interface problems.

The response he got was "Get real". Others suggested that, if Mr. Miell really wanted documentation, he could sit down and write it himself. It must be said that, through the discussion, Mr. Miell has comported himself in a way which is highly unlikely to inspire cooperation from anybody. He seems to carry a certain contempt for the interface, the process, and the people involved in it.

But it must also be said that he has a point. The creation of user-space APIs differs from how most kernel code is written. Much is made of the evolutionary nature of the kernel itself - things continually evolve as better solutions to problems are found. User-space interfaces, however, cannot evolve - once they are shipped as part of a mainline kernel, they are set in stone and must be maintained forever. They must be right from the outset. So it is not unreasonable to expect that the level of review for new user-space APIs would be higher, and that documentation of proposed APIs, which can be expected to help the review process, should be provided. It is true, however, that the original developer is not always the best person to provide that documentation.

One question which has been raised about this interface has to do with its similarity to the FreeBSD kqueue mechanism. The intent of the interface is the same, but no attempt to emulate the kqueue API has been made. Andrew Morton has said:

I mean, if there's nothing wrong with kqueue then let's minimise app developer pain and copy it exactly. If there _is_ something wrong with kqueue then let us identify those weaknesses and then diverge. Doing something which looks the same and works the same and does the same thing but has a different API doesn't benefit anyone.

There are, evidently, real reasons for not replicating the kqueue interface, but those reasons have not, yet, been made clear.

Kevents will, it is hoped, be a major improvement for people writing applications for Linux. This new API should bring together all information of interest into a single place, provide significant performance benefits, and ease porting of applications from other operating systems. But, if this API is going to meet the high expectations being placed on it, it will require a high level of review from a number of interested parties. That review is now starting to happen, so expect this API to remain in flux for some time yet.

Comments (7 posted)

KHB: A Filesystems reading list

August 21, 2006

This article was contributed by Valerie Henson

We've all been there - you're wandering around a party at some Linux event clutching your drink and looking for someone to talk to, but everyone is having some obscure technical conversation full of unfamiliar jargon. Then, as you slide past a cluster of important-looking people, you overhear the word "superblock" and think, "Superblock, that's a file system thing... I read about file systems in operating systems class once." Gratefully, you join the conversation, only to discover while you know some of the terms - cylinder group, indirect block, inode - you're still unable to come up with stunning ripostes like, "Aha, but that's really just another version of soft updates, and it doesn't solve the nlinks problem." (Admiring silence ensues.) Now what? You want to be able to make witty remarks about the pros and cons of journaling while throwing back the last of your martini, but you don't know where to start.

Fortunately, you can get a decent grasp of modern file systems without reading a whole book on file systems. (I haven't yet read a book on file systems I would recommend, anyway.) After reading these file systems papers (or at least their abstracts), you'll be able to at least fake a working knowledge of file systems - as long as everyone is drinking and it's too loud to hear anyone clearly. Enjoy!

The Basics

These papers are oldies but goodies. While the systems they describe are fairly obsolete and have been heavily improved since these initial descriptions, they make a good introduction to file systems structure and terminology.

A Fast File System for UNIX by Marshall Kirk McKusick, William Joy, Samuel Leffler and Robert Fabry. This paper describes the first version of the original UNIX file system that was suitable for production use. It became known as FFS (Fast File System) or UFS (UNIX File System). The "fast" part of the name comes from the fact that the original UNIX file system maxed out at about 5% of disk bandwidth, whereas the first iteration of FFS could use about 50% - a huge improvement. This paper is absolutely foundational, as the majority of production UNIX file systems are FFS-style file systems. While some parts of this paper are obsolete (check out the section on rotational delay), it's a simple, readable explanation of basic file system architecture that you can refer back to time and again. Also, it's pretty fun to read a paper describing the first implementation of, for example, symbolic links for a UNIX file system.

For extra credit, you can read the original file system checker paper, Fsck - the UNIX file system check program, by Marshall Kirk McKusick and T. J. Kowalski. It describes the major issues in checking and repairing file system metadata consistency. Improving fsck is a hot topic in file systems right now, so reading this paper might be worthwhile.

Vnodes: An Architecture for Multiple File System Types in Sun UNIX by Steve Kleiman. The original UNIX file system interface had been designed to support exactly one kind of file system. With the advent of FFS and other file systems, operating systems now needed to support several different file systems. Several solutions were proposed, but the dominant solution ended up being the VFS (Virtual File System) interface, first proposed and implemented by Sun. This paper explains the rationale behind VFS and vnodes.

Design and Implementation of the Sun Network Filesystem by Russel Sandberg, David Goldberg, Steve Kleiman, Dan Walsh, and Bob Lyon. Once upon a time (1985, specifically), people weren't really clear on why you would want a network file system (as opposed to, for example, a network disk or copying around files via rcp). This paper explains the needs and requirements that resulted in the invention of NFS, the network file system everyone loves to hate but uses all the time anyway. It also discusses the design of the VFS. A fun quote from the paper: "One of the advantages of the NFS was immediately obvious: as the df output below shows, a diskless workstation can have access to more than a Gigabyte of disk!"

Slaying the fsck dragon

One of the major problems in file systems is keeping the on-disk data consistent in the event that a file system is interrupted in the middle of update (for example, if the system loses power). Original FFS solved this problem by running fsck on the file system after a crash or other unclean unmount, but this took a really long time and could lose data. Many smart people thought about this problem and came up with four major approaches: journaling, log-structured file systems, soft updates, and copy-on-write. Each method provided a way of quickly recovering the file system after a crash. The most popular approach was journaling, since it was both relatively simple and easy to "bolt-on" to existing FFS-style file systems.

Journaling file systems solve the fsck problem by first writing an entry describing an update to the file system to a on-disk journal - a record of file system operations. Once the journal entry is complete, the main file system is updated; if the operation is interrupted, the journal entry is replayed on the next mount, completing any half-finished operations in progress at the time of the crash. Most production file systems (including ext3, XFS, VxFS, logging UFS, and reiserfs) use journaling to avoid fsck after a crash. No canonical journaling paper exists outside the database literature (from whence the idea was lifted wholesale), but Journaling the Linux ext2fs Filesystem by Stephen Tweedie is a good choice for learning both journaling techniques in general and the details of ext3 in particular.

The Design and Implementation of a Log-Structured File System by Mendel Rosenblum and John K. Ousterhout. Journaling file systems have to write each operation to disk twice: once in the log, and once in the final location. What would happen if we only wrote the data to disk once - in the journal? While the log-structured architecture was an electrifying new idea, it ultimately turned out to be impractical for production use, despite the concerted efforts of many computer science researchers. Today, no major production file system is log-structured. (Note that a log-structured file system is not the same as a logging file system - logging is another name for journaling.)

If you're looking for cocktail party gossip, Margot Seltzer and several colleagues published papers critiquing and comparing log-structured file systems to variations of FFS-style file systems, in which LFS usually came out rather the worse for the wear. This led to a semi-famous flame war in the form of web pages, archived here.

Soft Updates: A Technique for Eliminating Most Synchronous Writes in the Fast Filesystem by Marshall Kirk McKusick and Greg Ganger. Soft updates carefully orders writes to a file system such that in the event of a crash, the only inconsistencies are relatively harmless ones - leaked blocks and inodes. After a crash, the file system is mounted immediately and fsck runs in the background. The performance of soft updates is excellent, but the complexity is very high - as in, soft updates has been implemented only once (on BSD) to my knowledge. Personally, it took me about 5 years to thoroughly understand soft updates and I haven't met anyone other than the authors who claimed to understand it well enough to implement it. The paper is pretty understandable up to about page 5, at which point your head will explode. Don't feel bad about this, it happens to everyone.

File System Design for an NFS File Server Appliance by Dave Hitz, James Lau, and Michael Malcom. This paper describes the file system used inside NetApp file servers, Write-Anywhere File Layout (WAFL), as of 1994 (it's been improved in many ways since then). WAFL was the first major use of a copy-on-write file system - one in which "live" (in use) metadata is never overwritten in place but copied elsewhere on disk. Once a consistent set of updates has been written to disk, the "superblock" is re-written to point to the new set of metadata. Copy-on-write has an interesting set of trade-offs all its own, but has been implemented in a production file system twice now; Solaris's ZFS is also a copy-on-write file system.

File system performance

Each of these papers focuses on file system performance, but also introduces more than one interesting idea and makes a good starting point for exploring several areas of file system design and implementation.

Extent-like Performance from a UNIX File System by Larry McVoy and Steve Kleiman. This 1991 paper describes optimizations to FFS that doubled file system bandwidth for sequential I/O workloads. While the optimizations described in this paper are considered old hat these days (ever heard of readahead?), it's a good introduction to file system performance.

Sidebar: Where are they now?

You might have recognized some of the names in the author lists of the papers in this article - and chances are, you aren't recognizing their names because of their file system work. What else did these people do? Here's a totally non-scientific selection.

  • Bill Joy - co-founded Sun Microsystems
  • Larry McVoy - wrote BitKeeper, co-founded BitMover
  • Steve Kleiman - CTO, Network Appliance
  • Mendel Rosenblum - co-founder, VMWare
  • John Ousterhout - wrote Tcl/Tk, co-founded several companies
  • Margot Seltzer - co-founder, Sleepycat Software
  • Dave Hitz - co-founder, Network Appliance
Obviously, anyone wanting to found a successful company and make millions of dollars should consider writing a file system first.
Scalability in the XFS File System by Adam Sweeney, Doug Doucette, Wei Hu, Curtis Anderson, Mike Nishimoto, and Geoff Peck. This paper describes the motivation and implementation of XFS, a 64-bit file system using extents, B+ trees, dynamically allocated inodes, and journaling. XFS is not by any means an FFS-style file system and reading this paper will give you the basics on most extent-based file systems. It also describes quite a few useful optimizations for avoiding fragmentation, scaling to multiple threads, and the like.

The Utility of File Names by Daniel Ellard, Jonathan Ledlie, and Margot Seltzer. File system performance and on-disk layout can be vastly improved if the file system can predict (with reasonable accuracy) the size and access pattern of a file before it writes it to disk. The obvious solution is to add a new set of file system interfaces allowing the application to give explicit hints about the size and properties of a new file. Unfortunately, the history of file systems is littered with unused per-file interfaces like this (how often do you set the noatime flag on a file?). However, it turns out that applications are already giving these hints - in the form of file names, permissions, and other per-file properties. This paper is the first in a series demonstrating that a file system can make useful predictions about the future of a file based on the file name and other properties.

Further reading and acknowledgments

If you are interested in learning more about file systems, check out the Linux file systems wiki, especially the reading list. If you have a good file systems paper or book, please add it to the list, which is publicly editable (look for the password on the front page of the wiki). Note that I will ignore any comments of the form "You should have included paper XYZ!" unless it is also added to the reading list on the file systems wiki - WITH a short summary of the paper. With any luck, we'll have a fairly complete list of Linux file systems papers in the next few days.

If you are interested in working on file systems, or any other area of systems programming, you should contact the author at val dot henson at gmail dot com.

Thanks to Nikita Danilov, Zach Brown, and Kristen Accardi for paper suggestions and encouragement to write this article. Thanks to Theodore Y. Ts'o for actually saying something very similar to the stunning riposte in the first paragraph (which was, by the way, a completely accurate and very incisive criticism of what I was working on at the moment).

Comments (38 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers


Filesystems and block I/O


Memory management



Virtualization and containers


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds