Brief items
The current stable 2.6 kernel is 2.6.17.11,
released on August 23. It is a
relatively large patch set with fixes for a number of important bugs.
Before that, 2.6.17.10 was
released on August 22.
This one has three security fixes: one for a privilege escalation problem
in the SCTP code, one for a UDF filesystem memory corruption bug, and one
for a crash which can only be triggered by a privileged user.
2.6.17.9 was released on
August 18 with a single, PowerPC-specific security fix.
The current 2.6 prepatch remains 2.6.18-rc4.
Linus surfaced from his vacation long enough to merge 100 or so fixes into
the mainline repository, but little more than that has happened on that
front.
The current -mm tree is 2.6.18-rc4-mm2. Recent changes
to -mm include a kernel stack protector patch set (labeled "security
placebo"), the ability to filter core dumps through a helper application,
and a lot of fixes.
The current stable 2.4 kernel is 2.4.33.2, released on August 22. It
contains a small number of fixes, including one for the latest SCTP
vulnerability. Previously, 2.4.33.1 came out on
August 19 with another pair of security fixes.
The 2.4.34 process has begun with 2.4.34-pre1, with another small
set of fixes (but see below as well).
Comments (none posted)
Kernel development news
Under the long-lasting maintainership of Marcelo Tosatti, the 2.4 kernel
went into a deep maintenance mode, with only important fixes being
considered for merging. For some people, perhaps, it was a little too deep
- Marcelo clearly had other tasks besides 2.4 maintenance keeping him
busy. Even so, few expected major changes when Willy Tarreau took over 2.4
maintenance after the 2.4.33 release. Why mess with 2.4 at this point?
So Willy's 2.4.34-pre1
announcement raised a few eyebrows. The prepatch itself contains a
relatively small number of patches of the type one would expect. But the
announcement itself notes that Willy is considering merging a set of
patches to allow 2.4 kernels to be built with current gcc 4.x
compilers. This is not a trivial set of changes; gcc 4.x is
sufficiently different that a fairly wide-ranging set of fixes is
required. The gcc 4.x transition for 2.6 was not an overnight
affair.
A clear question comes immediately to mind: why would somebody who is not
interested in running a current kernel be bothering with contemporary
compilers? One answer is to be found in the announcement itself: there are
administrators who deploy 2.4 kernels on ultra-stable systems, but who build
those kernels on their desktops. It is getting increasingly hard to find a
current distribution with a compiler old enough to build 2.4 kernels, so
these administrators are finding themselves in a bit of a bind. A 2.4
kernel which could be compiled with a current gcc would allow current
systems to be used to build kernels for deployment on stable, production
systems, many of which may not have their own compilers installed at all.
Solar Designer has also noted that Openwall GNU/*/Linux is planning to
upgrade to gcc 4.x and would really rather not have to change to the
2.6 kernel at the same time.
For an interesting read, see Willy's
description of the user base, as he sees it, for the 2.4 kernel. In
his view, the major users are those setting up very high-reliability
sites. These people prefer 2.4 kernels for this job:
Simply because we already know from collective experience that
these versions can achieve very long uptimes (while we don't know
this yet for a fresh new version which got 5700 patches in the last
3 months), and because with the addition of very few patches, you
can make a bet on security: as long as newly discovered
vulnerabilities don't affect you or are covered by your additional
patches, you win. If you need to update and induce excessive
downtime, you lose and pay penalties.
The idea is to keep these people happy - by enabling the use of current
compilers, among other things - until a 2.6 kernel comes along which is
able to provide the same sort of stability guarantees. The 2.6 development
model makes that sort of guarantee harder, however, because older 2.6.x
kernels go out of general maintenance relatively quickly (though
distributors can and do maintain them for longer). It is hard to find a
2.6 kernel with a multi-year track record of reliability, security, and
ongoing fixes.
Willy's hope is that the current 2.6.16 kernel, which Adrian Bunk has
stepped forward to maintain for the long term, will help in this regard.
Once 2.6.16 has received a year or two of fixes (and nothing else), it
might reach a point where high-reliability people might trust it in
deployed systems. Time will tell if this kernel is able to reach that
point.
As an aside, it's worth mentioning that a small number of developers (well,
OK, one developer) have expressed some discontent about the 2.6.16
long-term process. This developer has said
that it would have been better to elect an extra-stable tree maintainer
through some sort of popular vote, and, perhaps, to move on to a 2.7
development series as well. This complaint ignores the fact that
volunteers to maintain 2.6 kernels over the long term have been in
relatively short supply; in fact, Adrian would appear to be about the only
one. It does not appear that Adrian's appointment as the long-term 2.6.16
maintainer has deprived anybody else of their lifetime dreams. So
maintainer elections - other than those of the "vote with your feet"
variety - seem unlikely to happen in the near future.
Comments (12 posted)
The proposed kevent interface has been covered here before - see
this article and
this one too. Kevents appear to
have gained significant momentum over the last few weeks, to the point that
inclusion in 2.6.19 is not entirely out of the question. Most developers
who have reviewed the code seem
to like the core idea (a unified interface for applications to get
information on all events of interest) and the implementation within the
kernel. Only now, however, is significant attention being paid to the user-space API
which comes with kevents. But the definition of that API is of crucial
importance. This article will look at it from two perspectives - first
technical, then political.
The discussion of the proposed API has been hampered somewhat by the lack
of associated documentation - and the fact that said API is still changing
quickly. In an attempt to pull together some of the available information,
Stephen Hemminger has put up a page at OSDL
describing the system call API. That page misses one important aspect of
kevents, however: the ability to receive events via a shared memory
interface. In an attempt to fill that gap, we'll look at the
August 23 version of the memory-mapped kevent API.
One of the goals behind kevents is to make the processing of events as fast
as possible - the idea being that a high-performance network server (say)
can work through vast numbers of events per second without appreciable
system overhead. One way to achieve this is to avoid system calls
altogether whenever possible. That is why there is interest in mapping
kevents directly into user space; this approach will allow the application
to consume them without calling into the kernel for each one.
To use the mmap interface, the application obtains a kevent file
descriptor, as usual. A simple call to mmap() will then create
the shared buffer for kevent communication. The size of this buffer is
currently determined by an in-kernel parameter - the maximum number of
kevents which will be stored there. Presumably there will eventually be a
KEVENT_MMAP_PAGES macro (or some such) to free the application
from trying to figure out how many pages it should map, but that is not yet
provided.
The resulting memory array is treated as a big circular buffer by the
kernel. There is a single index only, however - where the next event will
be written by the kernel. In other words, the kernel has no way to know
which events have been consumed by the application; if that application
falls too far behind, the kernel will begin to overwrite unprocessed
events. For this reason, perhaps, the buffer is made relatively large -
4096 events fit there in the current version of the patch.
The events stored in the buffer are not the same ukevent
structures used by the system call interface. There is, instead, a
shortened version in the form of struct mukevent:
struct kevent_id
{
union {
__u32 raw[2];
__u64 raw_u64 __attribute__((aligned(8)));
};
};
struct mukevent
{
struct kevent_id id;
__u32 ret_flags;
};
The id field contains some information about what happened: the
relevant file descriptor, for example. The actual event code itself is not
present, however.
The event ring is not quite a pure circular buffer. It is formatted with a
four-byte field at the beginning of each page, followed by as many
mukevent structures as will fit within the page. The four-byte
field in the first page contains the current event index - where the kernel
will write the next event. The application will, presumably, keep track of
the last event it read from the buffer, moving that counter forward until
it catches up with the kernel. The application must take care, however, to
notice every time it crosses a page boundary so it can skip the count
field.
Since there is no way to inform the kernel that events have been consumed
from the memory-mapped ring, and since the full event information is not
available via that ring, the application must still call into the kernel
for events. Otherwise, if nothing else, they will accumulate there until
they reach their maximum allowed number. So the advantage of the
memory-mapped approach will be hard to obtain with the current code. As
was noted above, however, this API is very young. One assumes that these
little problems will be ironed out in the near future.
Meanwhile, kevents have created a separate discussion on how new APIs go
into the kernel. One Nicholas Miell requested that some documentation for this
interface be written:
Is any of this documented anywhere? I'd think that any new
userspace interfaces should have man pages explaining their use and
some example code before getting merged into the kernel to shake
out any interface problems.
The response he got was "Get
real". Others suggested that, if Mr. Miell really wanted
documentation, he could sit down and write it himself. It must be said
that, through the discussion, Mr. Miell has comported himself in a way
which is highly unlikely to inspire cooperation from anybody. He seems to
carry a certain contempt for the interface, the process, and the people
involved in it.
But it must also be said that he has a point. The creation of user-space
APIs differs from how most kernel code is written. Much is made of the
evolutionary nature of the kernel itself - things continually evolve as
better solutions to problems are found. User-space interfaces, however,
cannot evolve - once they are shipped as part of a mainline kernel, they
are set in stone and must be maintained forever. They must be right from
the outset. So it is not
unreasonable to expect that the level of review for new user-space APIs
would be higher, and that documentation of proposed APIs, which can be
expected to help the review process, should be provided. It is true,
however, that the original developer is not always the best person to
provide that documentation.
One question which has been raised about this interface has to do with its
similarity to the FreeBSD kqueue
mechanism. The intent of the interface is the same, but no attempt to
emulate the kqueue API has been made. Andrew Morton has said:
I mean, if there's nothing wrong with kqueue then let's minimise
app developer pain and copy it exactly. If there _is_ something
wrong with kqueue then let us identify those weaknesses and then
diverge. Doing something which looks the same and works the same
and does the same thing but has a different API doesn't benefit
anyone.
There are, evidently, real reasons for not replicating the kqueue
interface, but those reasons have not, yet, been made clear.
Kevents will, it is hoped, be a major improvement for people writing
applications for Linux. This new API should bring together all information
of interest into a single place, provide significant performance benefits,
and ease porting of applications from other operating systems. But, if
this API is going to meet the high expectations being placed on it, it will
require a high level of review from a number of interested parties. That
review is now starting to happen, so expect this API to remain in flux for
some time yet.
Comments (7 posted)
August 21, 2006
This article was contributed by Valerie Henson
We've all been there - you're wandering around a party at some Linux
event clutching your drink and looking for someone to talk to, but
everyone is having some obscure technical conversation full of
unfamiliar jargon. Then, as you slide past a cluster of
important-looking people, you overhear the word "superblock" and
think, "Superblock, that's a file system thing... I read about file
systems in operating systems class once." Gratefully, you join the
conversation, only to discover while you know some of the terms -
cylinder group, indirect block, inode - you're still unable to come up
with stunning ripostes like, "Aha, but that's really just another
version of soft updates, and it doesn't solve the nlinks problem."
(Admiring silence ensues.) Now what? You want to be able to make
witty remarks about the pros and cons of journaling while throwing
back the last of your martini, but you don't know where to start.
Fortunately, you can get a decent grasp of modern file systems without
reading a whole book on file systems. (I haven't yet read a book on
file systems I would recommend, anyway.) After reading these file
systems papers (or at least their abstracts), you'll be able to at
least fake a working knowledge of file systems - as long as everyone
is drinking and it's too loud to hear anyone clearly. Enjoy!
The Basics
These papers are oldies but goodies. While the systems they describe
are fairly obsolete and have been heavily improved since these initial
descriptions, they make a good introduction to file systems structure
and terminology.
A Fast File
System for UNIX by Marshall Kirk McKusick, William Joy, Samuel
Leffler and Robert Fabry. This paper describes the first version of
the original UNIX file system that was suitable for production use.
It became known as FFS (Fast File System) or UFS (UNIX File System).
The "fast" part of the name comes from the fact that the original UNIX
file system maxed out at about 5% of disk bandwidth, whereas the first
iteration of FFS could use about 50% - a huge improvement. This paper
is absolutely foundational, as the majority of production UNIX file
systems are FFS-style file systems. While some parts of this paper
are obsolete (check out the section on rotational delay), it's a
simple, readable explanation of basic file system architecture that
you can refer back to time and again. Also, it's pretty fun to read a
paper describing the first implementation of, for example, symbolic
links for a UNIX file system.
For extra credit, you can read the original file system checker paper,
Fsck
- the UNIX file system check program, by Marshall Kirk McKusick
and T. J. Kowalski. It describes the major issues in checking and
repairing file system metadata consistency. Improving fsck is a hot topic in file systems
right now, so reading this paper might be worthwhile.
Vnodes:
An Architecture for Multiple File System Types in Sun UNIX by
Steve Kleiman. The original UNIX file system interface had been
designed to support exactly one kind of file system. With the advent
of FFS and other file systems, operating systems now needed to support
several different file systems. Several solutions were proposed, but
the dominant solution ended up being the VFS (Virtual File System)
interface, first proposed and implemented by Sun. This paper explains
the rationale behind VFS and vnodes.
Design
and Implementation of the Sun Network Filesystem by Russel
Sandberg, David Goldberg, Steve Kleiman, Dan Walsh, and Bob Lyon.
Once upon a time (1985, specifically), people weren't really clear on
why you would want a network file system (as opposed to, for example,
a network disk or copying around files via rcp). This paper explains
the needs and requirements that resulted in the invention of NFS, the
network file system everyone loves to hate but uses all the time
anyway. It also discusses the design of the VFS. A fun quote from
the paper: "One of the advantages of the NFS was immediately obvious:
as the df output below shows, a diskless workstation can have access
to more than a Gigabyte of disk!"
Slaying the fsck dragon
One of the major problems in file systems is keeping the on-disk data
consistent in the event that a file system is interrupted in the
middle of update (for example, if the system loses power). Original
FFS solved this problem by running fsck on the file system after a
crash or other unclean unmount, but this took a really long time and
could lose data. Many smart people thought about this problem and
came up with four major approaches: journaling, log-structured file
systems, soft updates, and copy-on-write. Each method provided a way
of quickly recovering the file system after a crash. The most popular
approach was journaling, since it was both relatively simple and easy
to "bolt-on" to existing FFS-style file systems.
Journaling file systems solve the fsck problem by first writing an
entry describing an update to the file system to a on-disk journal - a
record of file system operations. Once the journal entry is complete,
the main file system is updated; if the operation is interrupted, the
journal entry is replayed on the next mount, completing any
half-finished operations in progress at the time of the crash. Most
production file systems (including ext3, XFS, VxFS, logging UFS, and
reiserfs) use journaling to avoid fsck after a crash. No canonical
journaling paper exists outside the database literature (from whence
the idea was lifted wholesale), but Journaling
the Linux ext2fs Filesystem by Stephen Tweedie is a good choice
for learning both journaling techniques in general and the details of
ext3 in particular.
The
Design and Implementation of a Log-Structured File System by
Mendel Rosenblum and John K. Ousterhout. Journaling file systems have
to write each operation to disk twice: once in the log, and once in
the final location. What would happen if we only wrote the data to
disk once - in the journal? While the log-structured architecture was an
electrifying new idea, it ultimately turned out to be impractical for
production use, despite the concerted efforts of many computer science
researchers. Today, no major production file system is
log-structured. (Note that a log-structured file system is
not the same as a logging file system - logging is another
name for journaling.)
If you're looking for cocktail party gossip, Margot Seltzer
and several colleagues published papers critiquing and comparing
log-structured file systems to variations of FFS-style file systems,
in which LFS usually came out rather the worse for the wear. This led
to a semi-famous flame war in the form of web pages, archived
here.
Soft
Updates: A Technique for Eliminating Most Synchronous Writes in the
Fast Filesystem by Marshall Kirk McKusick and Greg Ganger. Soft
updates carefully orders writes to a file system such that in the
event of a crash, the only inconsistencies are relatively harmless
ones - leaked blocks and inodes. After a crash, the file system is
mounted immediately and fsck runs in the background. The performance
of soft updates is excellent, but the complexity is very high - as in,
soft updates has been implemented only once (on BSD) to my knowledge.
Personally, it took me about 5 years to thoroughly understand soft
updates and I haven't met anyone other than the authors who claimed to
understand it well enough to implement it. The paper is pretty
understandable up to about page 5, at which point your head will
explode. Don't feel bad about this, it happens to everyone.
File System Design
for an NFS File Server Appliance by Dave Hitz, James Lau, and
Michael Malcom. This paper describes the file system used inside
NetApp file servers, Write-Anywhere File Layout (WAFL), as of 1994
(it's been improved in many ways since then). WAFL was the first
major use of a copy-on-write file system - one in which "live" (in
use) metadata is never overwritten in place but copied elsewhere on
disk. Once a consistent set of updates has been written to disk, the
"superblock" is re-written to point to the new set of metadata.
Copy-on-write has an interesting set of trade-offs all its own, but
has been implemented in a production file system twice now; Solaris's ZFS
is also a copy-on-write file system.
File system performance
Each of these papers focuses on file system performance, but also
introduces more than one interesting idea and makes a good starting
point for exploring several areas of file system design and
implementation.
Extent-like
Performance from a UNIX File System by Larry McVoy and Steve
Kleiman. This 1991 paper describes optimizations to FFS that doubled
file system bandwidth for sequential I/O workloads. While the
optimizations described in this paper are considered old hat these
days (ever heard of readahead?), it's a good introduction to file
system performance.
Sidebar: Where are they now?
You might have recognized some of the names in the author lists of the
papers in this article - and chances are, you aren't recognizing their
names because of their file system work. What else did these people
do? Here's a totally non-scientific selection.
- Bill Joy - co-founded Sun Microsystems
- Larry McVoy - wrote BitKeeper, co-founded BitMover
- Steve Kleiman - CTO, Network Appliance
- Mendel Rosenblum - co-founder, VMWare
- John Ousterhout - wrote Tcl/Tk, co-founded several companies
- Margot Seltzer - co-founder, Sleepycat Software
- Dave Hitz - co-founder, Network Appliance
Obviously, anyone wanting to found a successful company and make
millions of dollars should consider writing a file system first.
|
Scalability
in the XFS File System by Adam Sweeney, Doug Doucette, Wei Hu,
Curtis Anderson, Mike Nishimoto, and Geoff Peck. This paper describes
the motivation and implementation of XFS, a 64-bit file system using
extents, B+ trees, dynamically allocated inodes, and journaling. XFS
is not by any means an FFS-style file system and reading this paper
will give you the basics on most extent-based file systems. It also
describes quite a few useful optimizations for avoiding fragmentation,
scaling to multiple threads, and the like.
The
Utility of File Names by Daniel Ellard, Jonathan Ledlie, and
Margot Seltzer. File system performance and on-disk layout can be
vastly improved if the file system can predict (with reasonable
accuracy) the size and access pattern of a file before it writes it to
disk. The obvious solution is to add a new set of file system
interfaces allowing the application to give explicit hints about the
size and properties of a new file. Unfortunately, the history of file
systems is littered with unused per-file interfaces like this (how
often do you set the noatime flag on a file?). However, it turns out
that applications are already giving these hints - in the form of file
names, permissions, and other per-file properties. This paper is the
first in a series demonstrating that a file system can
make useful predictions about the future of a file based on the file
name and other properties.
Further reading and acknowledgments
If you are interested in learning more about file systems, check out
the
Linux file systems wiki,
especially the
reading
list. If you have a good file systems paper or book, please add
it to the
list, which is publicly editable (look for the password on the
front page of the wiki). Note that I will ignore any comments of the
form "You should have included paper XYZ!" unless it is also added to
the reading list on the file systems wiki - WITH a short summary of
the paper. With any luck, we'll have a fairly complete list of Linux
file systems papers in the next few days.
If you are interested in working on file systems, or any other area of
systems programming, you should contact the author at val dot henson
at gmail dot com.
Thanks to Nikita Danilov, Zach Brown, and Kristen Accardi for paper
suggestions and encouragement to write this article. Thanks to
Theodore Y. Ts'o for actually saying something very similar to the
stunning riposte in the first paragraph (which was, by the way, a
completely accurate and very incisive criticism of what I was working
on at the moment).
Comments (38 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>