User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current stable 2.6 release is, released on April 7. It contains several fixes, including the BIC collision window fix discussed last week's Kernel Page.

The current 2.6 prepatch remains 2.6.12-rc2. Kernel development has slowed significantly while the source code management issues are being worked out - see below.

The current -mm tree is 2.6.12-rc2-mm3. Recent changes to -mm include a big x86-64 update, an NFSv4 update, some scheduler tweaks, the removal of the last user of the deprecated inter_module functions, and lots of fixes.

The current 2.4 kernel remains 2.4.30; no 2.4.31 prepatches have been released.

Comments (none posted)

Kernel development news

Quotes of the week

My second plan is to make somebody else so fired up about the problem that I can just sit back and take patches. That's really what I'm best at. Sitting here, in the (rain) on the patio, drinking a foofy tropical drink, and pressing the "apply" button. Then I take all the credit for my incredible work.
-- Linus Torvalds

There are a number of very good Linux kernel developers, but they tend to get outshouted by a large crowd of arrogant fools. Trying to communicate user requirements to these people is a waste of time. They are much too "intelligent" to listen to lesser mortals.
-- Jack O'Quin

Comments (2 posted)

The guts of git

Now that BitKeeper is no more, how will the kernel development process function? In the short term, the answer is "painfully." The rest of the 2.6.12 process looks like the good old days: patches emailed to Linus, who will apply them (hopefully) and occasionally release a snapshot tree. That mode might work for the short term, since only bug fixes should be merged before 2.6.12 comes out, but nobody wants to try to run the process that way for any period of time. The kernel team needs much better patch and workflow support if it is going to sustain a reasonable development pace. So a replacement for BitKeeper will have to come from somewhere.

For a while, the leading contender appeared to be monotone, which supports the distributed development model used with the kernel. There are some issues with monotone, however, with performance being at the top of the list: monotone simply does not scale to a project as large as the kernel. So Linus has, in classic form, gone off to create something of his own. The first version of the tool called "git" was announced on April 7. Since then, the tool has progressed rapidly. It is, however, a little difficult to understand from the documentation which is available at this point. Here's an attempt to clarify things.

Git is not a source code management (SCM) system. It is, instead, a set of low-level utilities (Linus compares it to a special-purpose filesystem) which can be used to construct an SCM system. Much of the higher-level work is yet to be done, so the interface that most developers will work with remains unclear.

At the lower levels, Git implements two data structures: an object database, and a directory cache. The object database can contain three types of objects:

  • Blobs are simply chunks of binary data - they are the contents of files. One blob exists in the object database for every revision of every file that git knows about. There is no direct connection between a blob and the name (or location) of the file which contains that blob. If a file is renamed, its blob in the object database remains unchanged.

  • Trees are a collection of blobs, along with their file names and permissions. A tree object describes the state of a directory hierarchy at a particular given time.

  • Commits (or "changesets") mark points in the history of a tree; they contain a log message, a tree object, and pointers to one or more "parent" commits (the first commit will have no parent).

The object database relies heavily on SHA hashes to function. When an object is to be added to the database, it is hashed, and the resulting checksum (in its ASCII representation) is used as its name in the database (almost - the first two bytes of the checksum are used to spread the files across a set of directories for efficiency). Some developers have expressed concerns about hash collisions, but that possibility does not seem to worry the majority. The object itself is compressed before being checksummed and stored.

It's worth repeating that git stores every revision of an object separately in the database, addressed by the SHA checksum of its contents. There is no obvious connection between two versions of a file; that connection is made by following the commit objects and looking at what objects were contained in the relevant trees. Git might thus be expected to consume a fair amount of disk space; unlike many source code management systems, it stores whole files, rather than the differences between revisions. It is, however, quite fast, and disk space is considered to be cheap.

The directory cache is a single, binary file containing a tree object; it captures the state of the directory tree at a given time. The state as seen by the cache might not match the actual directory's contents; it could differ as a result of local changes, or of a "pull" of a repository from elsewhere.

If a developer wishes to create a repository from scratch, the first step is to run init-db in the top level of the source tree. People running PostgreSQL want to be sure not to omit the hyphen, or they may not get the results they were hoping for. init-db will create the directory cache file (.dircache/index); it will also, by default, create the object database in .dircache/objects. It is possible for the object database to be elsewhere, however, and possibly shared among users. The object database will initially be empty.

Source files can be added with the update-cache program. update-cache --add will add blobs to the object database for new files and create new blobs (leaving the old ones in place) for any files which have changed. This command will also update the directory cache with entries associating the current files' blobs with their current names, locations, and permissions.

What update-cache will not do is capture the state of the tree in any permanent way. That task is done by write-tree, which will generate a new tree object from the current directory cache and enter that object into the database. write-tree writes the SHA checksum associated with the new tree object to its standard output; the user is well-advised to capture that checksum, or the newly-created tree will be hard to access in the future.

The usual thing to do with a new tree object will be to bind it into a commit object; that is done with the commit-tree command. commit-tree takes a tree ID (the output from write-tree) and a set of parent commits, combines them with the changelog entry, and stores the whole thing as a commit object. That object, in essence, becomes the head of the current version of the source tree. Since each commit points to its parents, the entire commit history of the tree can be traversed by starting at the head. Just don't lose the SHA checksum for the last commit. Since each commit contains a tree object, the state of the source tree at commit time can be reconstructed at any point.

The directory cache can be set to a given version of the tree by using read-tree; this operation reads a tree object from the object database and stores it in the directory cache, but does not actually change any files outside of the cache. From there, checkout-cache can be used make the actual source tree look like the cached tree object. The show-diff tool prints the differences between the directory cache and what's actually in the directory tree currently. There is also a diff-tree tool which can generate the differences between any two trees.

An early example of what can be done with these tools can be had by playing with the git-pasky distribution by Petr Baudis. Petr has layered a set of scripts over the git tools to create something resembling a source management system. The git-pasky distribution itself is available as a network repository; running "git pull" will update to the current version.

A "pull" operation, as implemented in git-pasky, performs these steps:

  • The current "head" commit for the local repository is found; git-pasky keeps the SHA checksum for the current commit in .dircache/HEAD.

  • The current head is obtained from the remote repository (using rsync) and compared with the local head. If the two are the same, no changes have been made and the job is done.

  • The remote object database is downloaded, again with rsync. This operation will add any new objects to the database.

  • Using diff-tree, a patch from the previous (local) version to the current (remote) version is generated. That patch is then applied to the current directory's contents. The patch technique is used to help preserve, if possible, any local changes to the files.

  • A call to read-tree updates the directory cache to match the current revision as obtained from the remote repository.

Petr's version of git adds a number of other features as well. It is a far cry from a full-blown source code management system, since it lacks little details like release tagging, merging, graphical interfaces, etc. A beginning structure is beginning to emerge, however.

When this work was begun, it was seen as a sort of insurance policy to be used until a real source management system could be found. There is a good chance, however, that git will evolve into something with staying power. It provides the needed low-level functionality in a reasonably simple way, and it is blindingly fast. Linus places a premium on speed:

If it takes half a minute to apply a patch and remember the changeset boundary etc (and quite frankly, that's _fast_ for most SCM's around for a project the size of Linux), then a series of 250 emails (which is not unheard of at all when I sync with Andrew, for example) takes two hours.

As if on cue, Andrew announced a set of 198 patches to be merged for 2.6.12:

This is the first live test of Linus's git-importing ability. I'm about to disappear for 1.5 weeks - hope we'll still have a kernel left when I get back.

If this test (and the ones that come after) goes well, and the resulting system evolves to where it meets Linus's needs, he may be unlikely to switch to yet another system in the future. So git is worth watching; it could develop into a powerful system in a hurry.

Comments (32 posted)

Some git updates

Since LWN's look at git was published, development has continued at a rapid pace. A number of features and capabilities have been added to the system. Look for an updated article at some future point when things stabilize somewhat.

A mailing list has been set up to take discussion of git off linux-kernel. The list is called "git," and it is hosted on; sending a message containing "subscribe git" to will get you onto the list. As of this writing, the traffic is not small.

A couple of quotes from that list, that didn't quite make the "quotes of the week":

Trust me, not needing locking is a huge boon. I don't think people realize just how much thought I've put into my database selection and what the implications are.

It's perfect, I tell you.

-- Linus Torvalds

Sooner or later we'll find a flaw in it. Really! I mean, you've started this OS thing 10+ years ago and we are still busy fixing it! ;)
-- Ingo Molnar

Linus has an experimental kernel repository on, and has committed Andrew Morton's initial 200-patch bomb to it. It's in:


for those who are interested. Commits to this repository are also being broadcast to the same "commits" list that tracked the BitKeeper repository. Here's an example patch for those interested in what a git commit looks like, or in the ioread/iowrite API change that your editor has not yet managed to cover on this page.

Comments (none posted)

Extending netlink

The netlink mechanism implements a special sort of datagram socket for communication between the kernel and user space. Most of the users of netlink are currently in the networking subsystem itself - netlink protocols exist, for example, for the management of routing table entries and firewall rules. Netlink is also used by SELinux and the kernel event notification mechanism.

Use of netlink is relatively straightforward - for kernel developers who have some familiarity with the networking subsystem. To be able to communicate via netlink, a kernel subsystem must first create an in-kernel socket:

    struct sock *netlink_kernel_create(int unit, 
                         void (*input)(struct sock *sk, int len));

Here, unit is the netlink protocol number (as defined in <linux/netlink.h>), and input() is a function to be called when data arrives on the given socket. The naming of unit dates back to an early netlink implementation, which worked with virtual devices; unit was the minor number of the relevant device. The input() callback can be NULL, in which case user space will not be able to write to the socket.

If there is an input() callback, it will be called whenever data arrives. That data will be represented in one or more sk_buff structures (SKBs) queued to the socket itself. So the core of a typical input() function will look something like:

    struct sk_buff *skb;

    while ((skb = skb_dequeue(sk->sk_receive_queue)) != NULL) {

Sending data to user space involves allocating an SKB, filling it with the data, and writing it to the netlink socket. Here is how the kernel events mechanism does it:

    static int send_uevent(const char *signal, const char *obj,
		           char **envp, int gfp_mask)
	struct sk_buff *skb;
	char *pos;
	int len;

	len = strlen(signal) + 1;
	len += strlen(obj) + 1;

	/* allocate buffer with the maximum possible message size */
	skb = alloc_skb(len + BUFFER_SIZE, gfp_mask);
	pos = skb_put(skb, len);
	sprintf(pos, "%s@%s", signal, obj);

	/* copy the environment key by key to our continuous buffer */
	if (envp) {
	    int i;

	    for (i = 2; envp[i]; i++) {
		len = strlen(envp[i]) + 1;
		pos = skb_put(skb, len);
		strcpy(pos, envp[i]);
	return netlink_broadcast(uevent_sock, skb, 0, 1, gfp_mask);

(Some error handling has been removed for brevity; see lib/kernel_uevent.c for the full version). The call to netlink_broadcast() sends the data in the SKB to every user-space process which is currently connected to the netlink socket. There is also netlink_unicast(), which takes a process ID and sends only to that process. Netlink writes can be restricted to specific "groups," allowing user-space processes to sign up for an interesting subset of the data written to a given socket.

There is more to the netlink interface than has been presented here; see <linux/netlink.h> for the rest.

Evgeniy Polyakov thinks that the netlink protocol is too complicated; it should not be necessary to understand the networking layer just to communicate with user space. His response is connector, a layer on top of netlink which is designed to make things simpler.

The connector code multiplexes all possible message types over a single netlink socket number. Individual messages are distinguished by way of a cb_id structure:

    struct cb_id
	__u32 idx;
	__u32 val;

idx can be thought of as a protocol type, and val as a message type within the given protocol. A kernel subsystem which is prepared to receive messages of a given type set up a callback with:

    int cn_add_callback(struct cb_id *id, char *name,
                        void (*callback)(void *msg));

That callback will be invoked every time a message with the given id is received from user space. The msg parameter to the callback function, despite its void * type, is always a pointer to a structure of this type:

    struct cn_msg
	struct cb_id 		id;
	__u32			len;	/* Length of the following data */
	__u8			data[0];
        /* Some fields omitted */

The callback can process the given message data and return.

Writing to a socket via connector is done with:

    void cn_netlink_send(struct cn_msg *msg, u32 __groups, int gfp_mask);

The msg contains the cb_id structure describing the message; __groups can be used to restrict the list of recipients, and gfp_mask controls how memory allocation is done. This call can fail (netlink is an unreliable service), but it returns no indication of whether it succeeded or not.

For kernel code which needs to send significant amounts of data to user space, perhaps from hot paths, there is also a "CBUS" layer over the connector. That layer exports one function:

    int cbus_insert(struct cn_msg *msg, int gfp_flags);

This function does not send the message immediately; it simply adds it to a per-CPU queue. A separate worker thread will eventually come along, find the message, and send it on to user space.

The code seems to work, though some concerns have been raised about the implementation. Not everybody feels that the connector solution is necessary, however. The core netlink API is not all that hard to use, so it is not clear that another layer needs to be wrapped around it. Those who do think that netlink could be made easier do not agree on how it should be done; some developers would like to see the netlink API itself changed rather than having another layer put on top of it. Various user-space needs (auditing, accounting, desktop functionality, etc.) are all creating pressure for more communication channels with the kernel. Some way of making that communication easier on the kernel side may well get added, eventually, but it is far from clear what form that code will take.

Comments (2 posted)

FUSE hits a snag

The filesystems in user space (FUSE - covered here in January, 2004) provides a kernel interface and library which makes it easy to implement filesystems with a user-space process. With FUSE, a user can mount a filesystem contained with a tar archive, implemented via an FTP session, or "tunneled" from a remote system via ssh. It is a powerful tool with many users, and its authors have been pushing for inclusion into the mainline kernel for some time now. That merge has been delayed pending a review of the patch by a few interested developers.

That review has happened, and it has turned up a problem; it seems that FUSE, in some situations, implements some rather strange filesystem semantics.

Consider the case of a filesystem hosted in a tar archive. FUSE will present files within the archive with the owners and permission modes specified inside that archive. The owner and permissions of the files, in other words, do not necessarily have anything to do with the owner of the archive or the user who mounted it as a filesystem. To allow that user to actually work with files in the archive, the "tarfs" FUSE module disables ordinary permissions checking. A file may, according to a tool like ls, be owned by another user and inaccessible, but the user who mounted the filesystem has full access anyway. FUSE also ensures that no other user has any access to the mounted filesystem - not even root.

This twisting of filesystem semantics does not sit well with some kernel developers, who tend to think that Linux systems should behave like Linux. The FUSE semantics have the potential to confuse programs which think that the advertised file permissions actually mean something (though, evidently, that tends not to be a problem in real use) and it makes it impossible to mount a filesystem for use by more than one user. So these developers have asked that the FUSE semantics be removed, and that a FUSE filesystem behave more like the VFAT-style systems; the user mounting the filesystem should own the files, and reasonable permissions should be applied.

In fact, FUSE does provide an option ("allow_others") which causes it to behave in this way. But that approach goes against what FUSE is trying to provide, and raises some security issues of its own. FUSE hacker Miklos Szeredi sees the issue this way:

I want the tar filesystem to be analogous to running tar. When I run tar, other users are not notified of the output, it's only for me. If they want to run tar, they can too. The same can be true for tarfs. I mount it for my purpose, others can mount it for theirs. Since the daemon providing the filesystem always runs with the same capabilities as the user who did the mount, I and others will always get the permissions that we have on the actual tar file.

In this view, a FUSE filesystem is very much a single-user thing. In some cases, it really should be that way; consider a remote filesystem implemented via an ssh connection. The user mounting the filesystem presumably has the right to access the remote system, on the remote system's terms. The local FUSE filesystem should not be trying to figure out what the permissions on remote files should be. Other users on the local system - even the root user - may have no right to access the remote system, and should not be able to use the FUSE filesystem to do so.

It's not clear where this discussion will go. There are some clear reasons behind the behavior implemented by FUSE, and it may remain available, though, perhaps, not as a default, and possibly implemented in a different way. The little-used Linux namespace capability has been mentioned as a way of hiding single-user FUSE filesystems, though there may be some practical difficulties in making namespaces actually work with FUSE. Until the core filesystem hackers are happy, however, FUSE is likely to have a rough path into the mainline.

Comments (7 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers


Filesystems and block I/O

Memory management




Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds