Kernel development

Brief items

Kernel release status

The current development kernel remains 2.6.31-rc3; no kernel prepatches have been released in the last week. Patches continue to flow into the mainline repository, though, and the -rc4 release may be out by the time you read this.

The current stable kernel is 2.6.30.2, released (along with 2.6.27.27) on July 19. Both updates contain a number of security-relevant fixes, including some inspired by the recent NULL-pointer exploit. Note that users have reported boot-time problems with 2.6.27.27; it seems that this kernel ran afoul of a GCC 4.2 bug which causes it to be miscompiled.

2.4.37.3, the first 2.4 update in some time, was released on July 19. This release, too, was motivated by the NULL pointer exploit; it also fixes some serious problems with the r8169 network driver.

Comments (none posted)

Kernel development news

Quotes of the week

[T]he official policy of intentionally omitting security relevant information in modifications to the Linux codebase is a disgrace and a disservice to yourselves, to other vendors, and to your customers. It demonstrates a lack of integrity and trust in your own products, though I know you have no intention of changing this policy as you're currently enjoying a reputation for security that is ill-gotten and has no basis in reality. Truthfully representing the seriousness of vulnerabilities in your software would tarnish that image, and that's not good for business.

-- Brad Spengler (comments in the exploit code)

Trying to 'confine' the 'unconfined' user with SELinux is an open problem which we don't currently even reasonably attempt address on a broad scale. It's like besieging the user in a gentle mist of water hoping they won't try to escape.

-- Eric Paris

Ok, so this is getting ridiculous. Do we have _three_ different kernel issues going on at the same time, all subtly related to tools issues rather than the kernel source tree itself?

-- Linus Torvalds

As for writing code defensively. Doing that routinely doubles your number of bugs.

-- Eric Biederman

Comments (22 posted)

In brief

By Jonathan Corbet
July 22, 2009

Hyper-V. Very few kernel submissions draw as much attention as Microsoft's contribution of its Hyper-V drivers to the staging tree. The drivers enable the virtualization of Linux under Windows, a feature which some find useful. In general, reactions included surprise and concern, and at least one prediction of immediate and utter doom. Much of the development community, though, treated it like just another patch submission. The quality of the code is not held to be great, but fixing such things up is what the staging tree is for.

For the curious, a little bit of history behind this submission can be found in this weblog posting by Stephen Hemminger.

VFAT. Andrew "Tridge" Tridgell is back with a new set of VFAT patches aimed at working around the patents being asserted against that filesystem. He has made progress in addressing the interoperability problems reported by testers, though a few small issues remain. As always, he's looking for testers who can identify any remaining problems with the patch.

Checkpoint/restart. Oren Laadan has posted a new set of checkpoint/restart patches which, he says, is "already suitable for many types of batch jobs." The patch adds a new clone_with_pids() system call allowing restored processes to be created with the same process ID they had at checkpoint time; it's not clear whether the security concerns with that capability have been addressed or not. There are still plenty of open issues with checkpoint/restart, including pending signals, FIFO devices, pseudo terminals, and more. It's a messy problem to try to solve, but this patch set seems to be getting closer. There's instructions in the patch for those who would like to experiment with it.

Flexible arrays. Kernel developers often find themselves needing to allocate multi-page chunks of contiguous memory. Typically such allocations are done with vmalloc(), but that solution is not ideal. The address space for vmalloc() allocations is restricted (on 32-bit systems, at least), and these allocations are rather less efficient than normal kernel memory allocations.

Responding to a request from Andrew Morton, Dave Hansen has proposed the addition of a flexible array API to the kernel. Flexible arrays would handle large allocations, but, under the hood, they use single-page chunks which can be allocated in a normal (and reliable) fashion. In brief, a flexible array is created with:

    struct flex_array *flex_array_alloc(int element_size, int total, 
    	   	      			gfp_t flags);

Once the array is created, data can be moved into and out of it with:

    int flex_array_put(struct flex_array *fa, int element_nr, void *src, 
    		       gfp_t flags);
    void *flex_array_get(struct flex_array *fa, int element_nr);

There's a number of other functions for freeing parts of an array, preallocating memory, etc.; see the patch posting for the full API.

Coarse clocks. Some applications want to get access to the system time as quickly as possible, but they are not concerned about obtaining absolute accuracy. To fill this need, John Stultz has proposed a couple of new clock types: CLOCK_REALTIME_COARSE and CLOCK_MONOTONIC_COARSE. In essence, these clocks work by returning the system's latest idea of the current time without actually asking any hardware. The idea was reasonably well received, with one concern: developers would hate to see this feature become one more obstacle to removing the periodic clock tick (and jiffies) in the future. This removal is far from imminent - there's a lot of work to be done first - but it remains desirable for a number of reasons.

Comments (6 posted)

Fun with NULL pointers, part 2

By Jonathan Corbet
July 21, 2009

Fun with NULL pointers, part 1 took a detailed look at the long chain of failures which allowed the kernel to be compromised by way of a NULL pointer dereference. Eliminating that particular bug was a straightforward fix; it was, in fact, fixed before the nature of the vulnerability was widely understood. The importance of this particular problem is, in one sense, relatively small; there are very few distributions which shipped vulnerable versions of the kernel. But this exploit suggests that there could be a whole class of related problems in the kernel; there is a definite chance that similar vulnerabilities could be discovered - if, indeed, they have not already been found.

One obvious problem is that ~~when the security module mechanism is configured into the kernel, the administrator-specified limits on the lowest valid user-space virtual address are ignored~~ security modules are allowed to override the administrator-specified limit (mmap_min_addr) on the lowest valid user-space address. This behavior is a violation of the understanding by which security modules operate: they are supposed to be able to restrict privileges, but never increase them. In this case, the mere presence of SELinux increased privilege, and the policy enforced by most SELinux deployments failed to close that hole (comments in the exploit code suggest that AppArmor fared no better).

Additionally, with security modules configured out entirely, mmap_min_addr was not enforced at all. The mainline now has a patch which causes the map_min_addr sysctl knob to always be in effect; this patch has also been put into the 2.6.27.27 and 2.6.30.2 updates (as have many of the others described here).

Things are also being fixed at the SELinux level. Future versions of Red Hat's SELinux policy will no longer allow unconfined (but otherwise unprivileged) processes to map pages into the bottom of the address space. There are still some open problems, though, especially when programs like WINE are thrown into the mix. It's not yet clear how the system can securely support a small number of programs needing the ability to map the zero page. Ideas like running WINE with root privilege - thus, perhaps, carrying Windows-like behavior a little too far - have garnered little enthusiasm.

There is another way around map_min_addr which also must be addressed: a privileged process which is run under the SVR4 personality will, at exec() time, have a read-only page mapped at the zero address. Evidently some old SVR4 programs expect that page to be there, but its presence helps to make null-pointer exploits possible. So another patch merged into mainline and the stable updates resets the SVR4 personality (or, at least, the part that maps the zero page) whenever a setuid program is run. This patch is enough to defeat the pulseaudio-based trick which was used to gain access to a zero-mapped page.

This change is not enough for some users, who have requested the ability to turn off the personality feature altogether. The ability to run binaries from 386-based Unix systems just lacks the importance it had in, say, 1995, so some question whether the personality feature makes any sense given its costs. Linus answered:

We could probably get rid of that idiotic feature. It's simply not important enough any more. Does anybody really care? At the same time, over years we've grown _other_ personality flags, and some of them are still relevant.

In particular, it seems that the ability to disable address-space randomization (which is a personality feature) is useful in a number of situations. So personality() is likely to stay, but its zero-page mapping feature might go away.

Yet another link in the chain of failure is the removal of the null-pointer check by the compiler. This check would have stopped the attack, but GCC optimized it out on the theory that the pointer could not (by virtue of already having been dereferenced) be NULL. GCC (naturally) has a flag which disables that particular optimization; so, from now on, kernels will, by default, be compiled with the -fno-delete-null-pointer-checks flag. Given that NULL might truly be a valid pointer value in the kernel, it probably makes sense to disable this particular optimization indefinitely.

One could well argue, though, that while all of the above changes are good, they also partly miss the point: a quality kernel would not be dereferencing NULL pointers in the first place. It's those dereferences which are the real bug, so they should really be the place where the problem is fixed. There is some interesting history here, though, in that kernel developers have often been advised to omit checks for NULL pointers. In particular, code like:

    BUG_ON(some_pointer == NULL);
    /* dereference some_pointer */

has often seen the BUG_ON() line removed with a comment like:

If we dereference NULL then the kernel will display basically the same information as would a BUG, and it takes the same action. So adding a BUG_ON here really doesn't gain us anything.

This reasoning is based on the idea that dereferencing a NULL pointer will cause a kernel oops. On its face, it makes sense: if the hardware will detect a NULL-pointer dereference, there is little point in adding the overhead of a software check too. But that reasoning is demonstrably faulty, as shown by this exploit. There are even legitimate reasons for mapping page zero, so it will never be true that a NULL pointer is necessarily invalid. One assumes that the relevant developers understand this now, but there may be a lot of places in the kernel where necessary pointer checks were removed from the code.

Most of the NULL pointer problems in the kernel are probably just oversights, though. Most of those, in turn, are not exploitable; if there is no way to cause the kernel to actually encounter a NULL pointer in the relevant code, the lack of a check does not change anything. Still, it would be nice to fix all of those up.

One way of finding these problems may be the Smatch static analysis tool. Smatch went quiet for some years, but it appears that Dan Carpenter is working on it again; he recently posted a NULL pointer bug that Smatch found for him. If Smatch could be turned into a general-purpose tool that could find this sort of problem, the result should be a more secure kernel. It is unfortunate that checkers like this do not seem to attract very many interested developers; free software is very much behind the state of the art in this area and it hurts us.

Another approach is being taken by Julia Lawall, who has put together a Coccinelle "semantic patch" to find and fix check-after-dereference bugs like the one found in the TUN driver. A series of patches (example) has been posted to fix a number of these bugs. Cases where a pointer is checked after the first dereference are probably a small subset of all the NULL pointer problems in the kernel, but each one indicates a situation where the programmer thought that a NULL pointer was possible and problematic. So they are all certainly worth fixing.

All told, the posting of this exploit has served as a sort of wakeup call for the kernel community; it will, with luck, result in the cleaning up of a lot of code and the closing of a number of security problems. Brad Spengler, the author of the exploit, is clearly hoping for a little more, though: he has often expressed concerns that serious kernel security bugs are silently fixed or dismissed as being denial-of-service problems at worst. Whether that will change remains to be seen; in the kernel environment, many bugs can have security implications which are not immediately obvious when the bug is fixed. So we may not see more bugs explicitly advertised as security issues, but, with luck, we will see more bugs fixed.

Comments (63 posted)

A short history of btrfs

July 22, 2009

This article was contributed by Valerie Aurora

You probably have heard of the cool new kid on the file system block, btrfs (pronounced "butter-eff-ess") - after all, Linus Torvalds is using it as his root file system on one of his laptops. But you might not know much about it beyond a few high-level keywords - copy-on-write, checksums, writable snapshots - and a few sensational rumors and stories - the Phoronix benchmarks, btrfs is a ZFS ripoff, btrfs is a secret plan for Oracle domination of Linux, etc. When it comes to file systems, it's hard to tell truth from rumor from vile slander: the code is so complex, the personalities are so exaggerated, and the users are so angry when they lose their data. You can't even settle things with a battle of the benchmarks: file system workloads vary so wildly that you can make a plausible argument for why any benchmark is either totally irrelevant or crucially important.

In this article, we'll take a behind-the-scenes look at the design and development of btrfs on many levels - technical, political, personal - and trace it from its origins at a workshop to its current position as Linus's root file system. Knowing the background and motivation for each step will help you understand why btrfs was started, how it works, and where it's going in the future. By the end, you should be able to hand-wave your way through a description of btrfs's on-disk format.

Disclaimer: I have two huge disclaimers to make: One, I worked on ZFS for several years while at Sun. Two, I have already been subpoenaed and deposed for the various Sun/NetApp patent lawsuits and I'd like to avoid giving them any excuse to subpoena me again. I'll do my best to be fair, honest, and scrupulously correct.

btrfs: Pre-history

Imagine you are a Linux file system developer. It's 2007, and you are at the Linux Storage and File systems workshop. Things are looking dim for Linux file systems: Reiserfs, plagued with quality issues and an unsustainable funding model, has just lost all credibility with the arrest of Hans Reiser a few months ago. ext4 is still in development; in fact, it isn't even called ext4 yet. Fundamentally, ext4 is just a straightforward extension of a 30-year-old format and is light-years behind the competition in terms of features. At the same time, companies are clamping down on funding for Linux development; IBM's Linux division is coming to the end of its grace period and needs to show profitability now. Other companies are catching wind of an upcoming recession and are cutting research across the board. They want projects with time to results measured in months, not years.

Ever hopeful, the file systems developers are meeting anyway. Since the workshop is co-located with USENIX FAST '07, several researchers from academia and industry are presenting their ideas to the workshop. One of them is Ohad Rodeh. He's invented a kind of btree that is copy-on-write (COW) friendly [PDF]. To start with, btrees in their native form are wildly incompatible with COW. The leaves of the tree are linked together, so when the location of one leaf changes (via a write - which implies a copy to a new block), the link in the adjacent leaf changes, which triggers another copy-on-write and location change, which changes the link in the next leaf... The result is that the entire btree, from top to bottom, has to be rewritten every time one leaf is changed.

Rodeh's btrees are different: first, he got rid of the links between leaves of the tree - which also "throws out a lot of the existing b-tree literature", as he says in his slides [PDF] - but keeps enough btree traits to be useful. (This is a fairly standard form of btrees in file systems, sometimes called "B+trees".) He added some algorithms for traversing the btree that take advantage of reference counts to limit the amount of the tree that has to be traversed when deleting a snapshot, as well as a few other things, like proactive split and merge of interior nodes so that inserts and deletes don't require any backtracking. The result is a simple, robust, generic data structure which very efficiently tracks extents (groups of contiguous data blocks) in a COW file system. Rodeh successfully prototyped the system some years ago, but he's done with that area of research and just wants someone to take his COW-friendly btrees and put them to good use.

btrfs: The beginning

Chris Mason took these COW-friendly btrees and ran with them. Back in the day, Chris worked on Reiserfs, where he learned a lot about what to do and what not to do in a file system. Reiserfs had some cool features - small file packing, btrees for fast lookup, flexible layout - but the implementation tended to be haphazard and ad hoc. Code paths proliferated wildly, and along with them potential bugs.

Chris had an insight: What if everything in the file system - inodes, file data, directory entries, bitmaps, the works - was an item in a copy-on-write btree? All reads and writes to storage would go through the same code path, one that packed the items into btree nodes and leaves without knowing or caring about the item type. Then you only have to write the code once and you get checksums, reference counting (for snapshots), compression, fragmentation, etc., for anything in the file system.

Chris came up with the following basic structure for btrfs ("btrfs" comes from "btree file system"). Btrfs consists of three types of on-disk structures: block headers, keys, and items, currently defined as follows:

struct btrfs_header {
    u8 csum[32];
    u8 fsid[16];
    __le64 blocknr;
    __le64 flags;

    u8 chunk_tree_uid[16];
    __le64 generation;
    __le64 owner;
    __le32 nritems;
    u8 level;
}

struct btrfs_disk_key {
    __le64 objectid;
    u8 type;
    __le64 offset;
}

struct btrfs_item {
    struct btrfs_disk_key key;
    __le32 offset;
    __le32 size;
}

Inside the btree (that is, the "branches" of the tree, as opposed to the leaves at the bottom of the tree), nodes consist only of keys and block headers. The keys tell you where to go looking for the item you want, and the block headers tell you where the next node or leaf in the btree is located on disk.

[Btrfs block structure] The leaves of the btree contain items, which are a combination of keys and data. Similarly to reiserfs, the items and data are packed in extremely space-efficient way: the item headers (that is, the item structure described above) are packed together starting at the beginning of the block, and the data associated with each item is packed together starting at the end of the block. So item headers and data grow towards each other, as shown in the diagram to the right.

Besides being code efficient, this scheme is space and time efficient as well. Normally, file systems put only one kind of data - bitmaps, or inodes, or directory entries - in any given file system block. This wastes disk space, since unused space in one kind of block can't be used for any other purpose, and it wastes time, since getting to one particular piece of file data requires reading several different kinds of metadata, all located in different blocks in the file system. In btrfs, items are packed together (or pushed out to leaves) in arrangements that optimize both access time and disk space. You can see the difference in these (very schematic, very simplified) diagrams. Old-school filesystems tend to organize data like this:

Btrfs, instead, creates a disk layout which looks more like:

In both diagrams, red blocks denote wasted disk space and red arrows denote seeks.

Each kind of metadata and data in the file system - a directory entry, an inode, an extended attribute, file data itself - is stored as a particular type of item. If we go back to the definition of an item, we see that its first element is a key:

struct btrfs_disk_key {
    __le64 objectid;
    u8 type;
    __le64 offset;
}

Let's start with the objectid field. Each object in the file system - generally an inode - has a unique objectid. This is fairly standard practice - it's the equivalent of inode numbers. What makes btrfs interesting is that the objectid makes up the most significant bits of the item key - what we use to look up an item in the btree - and the lower bits are different kinds of items related to that objectid. This results in grouping together all the information associated with a particular objectid. If you allocate adjacent objectids, then all the items from those objectids are also allocated close together. The <objectid, type> pair automatically groups related data close to each other regardless of the actual content of the data, as opposed to the classical file system approach, which writes separate optimized allocators for each kind of file system data.

The type field tells you what kind of data is stored in the item. Is it the inode? Is it a directory entry? Is it an extent telling you where the file data is on disk? Is it the file data itself? With the combination of objectid and the type, you can look up any file system data you need in the btree.

We should take a quick look at the structure of the btree nodes and leaves themselves. Each node and leaf is an extent in the btree - nodes are extents full of <key, block header> pairs, and leaves contain items. Large file data is stored outside of the btree leaves, with the item describing the extent kept in the leaf itself. (What constitutes a "large" file is tunable based on the workload.) Each extent describing part of the btree has a checksum and a reference count, which permits writable snapshots. Each extent also includes an explicit back reference to each of the extents that refer to it.

Back references give btrfs a major advantage over every other file system in its class because now we can quickly and efficiently migrate data, incrementally check and repair the file system, and check the correctness of reference counts during normal operation. The proof is that btrfs already supports fast, efficient device removal and shrinking of the available storage for a file system. Many other file systems list "shrink file system" as a feature, but it usually ends up implemented inefficiently and slowly and several years late - or not at all. For example, ext3/4 can shrink a file system - by traversing the entire file system searching for data located in the area of the device being removed. It's a slow, fraught, bug-prone process. ZFS still can't shrink a file system.

The result is beautifully generic and elegant: Everything on disk is a btree containing reference counted, checksummed extents of items, organized by <objectid, type> keys. A great deal of the btrfs code doesn't care at all what is stored in the items, it just knows how to add or remove them from the btree. Optimizing disk layout is simple: allocate things with similar keys close together.

btrfs: The politics

At the same time that Chris was figuring out the technical design of btrfs, he was also figuring out how to fund the development of btrfs in both the short and the long term. Chris had recently moved from SUSE to a special Linux group at Oracle, one that employs several high-level Linux storage developers, including Martin K. Petersen, Zach Brown, and Jens Axboe. Oracle funds a lot of Linux development, some of it obviously connected to the Oracle database (OCFS2, DIF/DIX), and some of it less so (generic block layer work, syslets). Here's how Chris put it in a recent interview with Amanda McPherson from the Linux Foundation:

Amanda: Why did you start this project? Why is Oracle supporting this project so prominently?

Chris: I started Btrfs soon after joining Oracle. I had a unique opportunity to take a detailed look at the features missing from Linux, and felt that Btrfs was the best way to solve them.

Linux is a very important platform for Oracle. We use it heavily for our internal operations, and it has a broad customer base for us. We want to keep Linux strong as a data center operating system, and innovating in storage is a natural way for Oracle to contribute.

In other words, Oracle likes having Linux as a platform, and is willing to invest development effort in it even if it's not directly related to Oracle database performance. Look at it this way: how many operating systems are written and funded in large part by your competitors? While it is tempting to have an operating system entirely under your control - like Solaris - it also means that you have to pay for most of the development on that platform. In the end, Oracle believes it is in its own interest to use its in-house expertise to help keep Linux strong.

After a few months of hacking and design discussions with Zach Brown and many others, Chris posted btrfs for review. From there on out, you can trace the history of btrfs like any other open source project through the mailing lists and source code history. Btrfs is now in the mainline kernel and developers from Red Hat, SUSE, Intel, IBM, HP, Fujitsu, etc. are all working on it. Btrfs is a true open source project - not just in the license, but also in the community.

btrfs: A brief comparison with ZFS

People often ask about the relationship between btrfs and ZFS. From one point of view, the two file systems are very similar: they are copy-on-write checksummed file systems with multi-device support and writable snapshots. From other points of view, they are wildly different: file system architecture, development model, maturity, license, and host operating system, among other things. Rather than answer individual questions, I'll give a short history of ZFS development and compare and contrast btrfs and ZFS on a few key items.

When ZFS first got started, the outlook for file systems in Solaris was rather dim as well. Logging UFS was already nearing the end of its rope in terms of file system size and performance. UFS was so far behind that many Solaris customers paid substantial sums of money to Veritas to run VxFS instead. Solaris needed a new file system, and it needed it soon.

Jeff Bonwick decided to solve the problem and started the ZFS project inside Sun. His organizing metaphor was that of the virtual memory subsystem - why can't disk be as easy to administer and use as memory? The central on-disk data structure was the slab - a chunk of disk divided up into the same size blocks, like that in the SLAB kernel memory allocator, which he also created. Instead of extents, ZFS would use one block pointer per block, but each object would use a different block size - e.g., 512 bytes, or 128KB - depending on the size of the object. Block addresses would be translated through a virtual-memory-like mechanism, so that blocks could be relocated without the knowledge of upper layers. All file system data and metadata would be kept in objects. And all changes to the file system would be described in terms of changes to objects, which would be written in a copy-on-write fashion.

In summary, btrfs organizes everything on disk into a btree of extents containing items and data. ZFS organizes everything on disk into a tree of block pointers, with different block sizes depending on the object size. btrfs checksums and reference-counts extents, ZFS checksums and reference-counts variable-sized blocks. Both file systems write out changes to disk using copy-on-write - extents or blocks in use are never overwritten in place, they are always copied somewhere else first.

So, while the feature list of the two file systems looks quite similar, the implementations are completely different. It's a bit like convergent evolution between marsupials and placental mammals - a marsupial mouse and a placental mouse look nearly identical on the outside, but their internal implementations are quite a bit different!

In my opinion, the basic architecture of btrfs is more suitable to storage than that of ZFS. One of the major problems with the ZFS approach - "slabs" of blocks of a particular size - is fragmentation. Each object can contain blocks of only one size, and each slab can only contain blocks of one size. You can easily end up with, for example, a file of 64K blocks that needs to grow one more block, but no 64K blocks are available, even if the file system is full off nearly empty slabs of 512 byte blocks, 4K blocks, 128K blocks, etc. To solve this problem, we (the ZFS developers) invented ways to create big blocks out of little blocks ("gang blocks") and other unpleasant workarounds. In our defense, at the time btrees and extents seemed fundamentally incompatible with copy-on-write, and the virtual memory metaphor served us well in many other respects.

In contrast, the items-in-a-btree approach is extremely space efficient and flexible. Defragmentation is an ongoing process - repacking the items efficiently is part of the normal code path preparing extents to be written to disk. Doing checksums, reference counting, and other assorted metadata busy-work on a per-extent basis reduces overhead and makes new features (such as fast reverse mapping from an extent to everything that references it) possible.

Now for some personal predictions (based purely on public information - I don't have any insider knowledge). Btrfs will be the default file system on Linux within two years. Btrfs as a project won't (and can't, at this point) be canceled by Oracle. If all the intellectual property issues are worked out (a big if), ZFS will be ported to Linux, but it will have less than a few percent of the installed base of btrfs. Check back in two years and see if I got any of these predictions right!

Btrfs: What's next?

Btrfs is heading for 1.0, a little more than 2 years since the first announcement. This is much faster than many file systems veterans - including myself - expected, especially given that during most of that time, btrfs had only one full-time developer. Btrfs is not ready for production use - that is, storing and serving data you would be upset about losing - but it is ready for widespread testing - e.g., on your backed-up-nightly laptop, or your experimental netbook that you reinstall every few weeks anyway.

Be aware that there was a recent flag day in the btrfs on-disk format: A commit shortly after the 2.6.30 release changed the on disk format in a way that isn't compatible with older kernels. If you create your btrfs file system using the old, 2.6.30 or earlier kernel and tools, and boot into a newer kernel with the new format, you won't be able to use your file system with a 2.6.30 or older kernel any longer. Linus Torvalds found this out the hard way. But if this does happen to you, don't panic - you can find rescue images and other helpful information on the the btrfs wiki.

Comments (53 posted)

A kernel.org update

By Jonathan Corbet
July 22, 2009

Your editor made a brief visit to the 2009 Linux Symposium, held in Montreal for the first time. One of the talks which could be seen during that short time was an update on kernel.org, presented by John Hawley. It was an interesting look into a bit of infrastructure that many of us rely upon, but which we tend to take for granted.

The "state of the server address" started off with the traditional display of bizarre email sent to kernel.org. Suffice to say, the kernel.org administrators get a lot of strange mail. They also have no qualms about displaying that mail (lightly sanitized) for amusement value.

The board of kernel.org is currently made up of five people: H. Peter Anvin, Jeff Uphoff, Chris Wright, Kees Cook, and Linus Torvalds. Linus, it is said, never attends the board meetings; John assumes that he's busy doing something related to the kernel. Peter continues to serve as the president of the organization, doing the work required to keep it as a nonprofit corporation in good standing. Much of the rest of the work is done by John, who was hired in September, 2008, to be the first full-time system administrator for kernel.org. He is employed by the Linux Foundation to do this job.

Over the last year, kernel.org has handled the mirroring a of a number of major distribution releases. They have added two new distributions (Gentoo and Moblin) to the mirror network, and Slackware is being added into the mix now. A number of new wiki instances have been added to wiki.kernel.org. John says that wikis are easy to create; he encourages relevant projects to ask for a kernel.org wiki if it would be helpful.

Internally, kernel.org runs on ten "disgustingly nice" machines donated by HP. John was strong in his praise of HP and ISC (which provides the bulk of the considerable bandwidth used by kernel.org); without them, kernel.org would not function the way it does. Beyond ISC, there are a couple of machines hosted at the OSU open source lab and one at Umeå University in Sweden. A lengthy process has finally gotten all of these machines upgraded to Fedora 9 - just in time, John noted wryly, for Fedora to end support for that distribution. So another round of upgrades in in the works for the near future.

Another significant change over the last year is the adoption of GeoDNS for the kernel.org domains. GeoDNS enables the DNS server to take the location of the requesting system into account and return the addresses of an appropriate set of servers. So kernel.org users now use local kernel.org mirrors, even if they do not explicitly ask for one using a country-specific host name.

One upcoming initiative is archive.kernel.org. This site is intended to be a permanent archive for older distribution updates. Should somebody find the urge to, say, install Red Hat Linux 5 on a system, it can be satisfied by a visit to archive.kernel.org. Filling in the archive is a work in progress; a number of older distribution releases seem to have fallen off the net. But, experience shows, many of the older releases will be located over time.

Another work in progress is "boot.kernel.org". This site is intended to be a repository of network-bootable distributions. The distributor can create a tiny boot image which does little more than setting up the network and downloading the next stage from boot.kernel.org. The idea here is that it will become easy to boot rescue or live CD distributions from the net. Distributions which support network installation can also be hosted on boot.kernel.org. This feature should be ready for a public launch sometime in the near future.

John closed with more amusing email. But, silliness aside, it seems clear that kernel.org is on a solid foundation. It is supporting our community areas going well beyond the kernel itself, and it looks well set to continue doing so for some time.

Comments (21 posted)

Patches and updates

Kernel trees

Greg KH Linux 2.6.30.2 ?

Greg KH Linux 2.6.27.27 ?

Willy Tarreau Linux 2.4.37.3 ?

Architecture-specific

Gurudatt, Sreenidhi B x86: IPC driver patch for Intel Moorestown platform Jul 20

Core kernel code

john stultz Introduce CLOCK_REALTIME_COARSE ?

Arjan van de Ven sched: Provide iowait counters Jul 20

Dave Hansen flexible array implementation ?

Dave Hansen flexible array implementation v3 ?

Oren Laadan Kernel based checkpoint/restart ?

Dan Smith Add Checkpoint/Restart support for UNIX sockets ?

Development tools

Masami Hiramatsu [PATCH -tip -v12 00/11] tracing: kprobe-based event tracer and x86 instruction decoder ?

Xiao Guangrong ftrace: add tracepoint for timer event ?

Christoph Hellwig xfs: event tracing support ?

Steven Rostedt trace-cmd - command line reader for ftrace ?

Mathieu Desnoyers LTTng 0.149 adds experimental ascii output module ?

Frederic Weisbecker hw-breakpoints: Make the API generic + support for perfcounters ?

Device drivers

Michael S. Tsirkin uio: add generic driver for PCI 2.3 devices ?

Matthew Garrett scsi: Allow hosts to be flagged as hotpluggable ?

Marek Vasut SPI driver for LMS283GF05 LCD Jul 16

Tiago Vignatti VGA arbiter implementation (v2) ?

Pavel Machek HTC Dream: touchscreen driver for staging ?

Wan ZongShun Add mac driver for w90p910 ?

Ben Dooks net: Micrel KS8851 SPI network driver ?

Greg Kroah-Hartman [Announce] Microsoft Hyper-V drivers for Linux ?

Mike Rapoport add SBC-FITPC2 watchdog driver ?

Rafael J. Wysocki PM: Introduce core framework for run-time PM of I/O devices (rev. 10) ?

Mark Brown regulator: Add regulator_get_exclusive() API ?

Rémi Denis-Courmont USB host CDC Phonet network interface driver ?

Filesystems and block I/O

Sage Weil ceph: Ceph distributed file system client v0.10 ?

Ryo Tsuruta I/O bandwidth controller and BIO tracking ?

tridge@samba.org FAT: Add FAT_NO_83NAME flag ?

Memory management

KAMEZAWA Hiroyuki ZERO PAGE again v4. ?

Izik Eidus ksm resend ?

Tejun Heo implement and use sparse embedding first chunk allocator ?

Roland Dreier ummunot: Userspace support for MMU notifications ?

Networking

Steffen Klassert [PATCH 0/7] IPsec: convert to ahash ?

Security-related

Siarhei Liakh x86: NX protection for kernel data ?

Eric Paris [PATCH -v2 1/2] VM/SELinux: require CAP_SYS_RAWIO for all mmap_zero operations ?

Virtualization and containers

Gregory Haskins [KVM PATCH] KVM: introduce "xinterface" API for external interaction with guests ?

Paul Menage CGroup: Support for named and empty hierarchies ?

Miscellaneous

Karel Zak util-linux-ng v2.16 ?

Pablo Neira Ayuso libnetfilter_conntrack 0.0.100 release ?

Pablo Neira Ayuso conntrack-tools 0.9.13 released ?

Page editor: Jonathan Corbet
Next page: Distributions>>