Brief items
The current stable 2.6 release is 2.6.11.7,
released on April 7. It contains several
fixes, including the BIC collision window fix discussed
last week's Kernel Page.
The current 2.6 prepatch remains 2.6.12-rc2. Kernel development has slowed
significantly while the source code management issues are being worked out
- see below.
The current -mm tree is 2.6.12-rc2-mm3. Recent changes
to -mm include a big x86-64 update, an NFSv4 update, some scheduler tweaks,
the removal of the last user of the deprecated inter_module functions, and
lots of fixes.
The current 2.4 kernel remains 2.4.30; no 2.4.31 prepatches have
been released.
Comments (none posted)
Kernel development news
My second plan is to make somebody else so fired up about the
problem that I can just sit back and take patches. That's really
what I'm best at. Sitting here, in the (rain) on the patio,
drinking a foofy tropical drink, and pressing the "apply"
button. Then I take all the credit for my incredible work.
--
Linus Torvalds
There are a number of very good Linux kernel developers, but they
tend to get outshouted by a large crowd of arrogant fools. Trying
to communicate user requirements to these people is a waste of
time. They are much too "intelligent" to listen to lesser mortals.
--
Jack O'Quin
Comments (2 posted)
Now that BitKeeper is no more, how will the kernel development process
function? In the short term, the answer is "painfully." The rest of the
2.6.12 process looks like the good old days: patches emailed to Linus, who
will apply them (hopefully) and occasionally release a snapshot tree. That
mode might work for the short term, since only bug fixes should be merged
before 2.6.12 comes out, but nobody wants to try to run the process that
way for any period of time. The kernel team needs much better patch and
workflow support if it is going to sustain a reasonable development pace.
So a replacement for BitKeeper will have to come from somewhere.
For a while, the leading contender appeared to be monotone, which supports the
distributed development model used with the kernel. There are some issues
with monotone, however, with performance being at the top of the list:
monotone simply does not scale to a project as large as the kernel. So
Linus has, in classic form, gone off to create something of his own. The
first version of the tool called "git" was announced on April 7. Since then, the
tool has progressed rapidly. It is, however, a little difficult to
understand from the documentation which is available at this point. Here's
an attempt to clarify things.
Git is not a source code management (SCM) system. It is, instead, a
set of low-level utilities (Linus compares it to a special-purpose
filesystem) which can be used to construct an SCM system. Much of the
higher-level work is yet to be done, so the interface that most developers
will work with remains unclear.
At the lower levels,
Git implements two data structures: an object database, and a directory
cache. The object database can contain three types of objects:
- Blobs are simply chunks of binary data - they are the contents
of files. One blob exists in the object database for every revision
of every file that git knows about. There is no direct connection
between a blob and the name (or location) of the file which contains
that blob. If a file is renamed, its blob in the object database
remains unchanged.
- Trees are a collection of blobs, along with their file names
and permissions. A tree object describes the state of a directory
hierarchy at a particular given time.
- Commits (or "changesets") mark points in the history of a tree;
they contain a log message, a tree object, and pointers to one or more
"parent" commits (the first commit will have no parent).
The object database relies heavily on SHA hashes to function. When an
object is to be added to the database, it is hashed, and the resulting
checksum (in its ASCII representation) is used as its name in the database
(almost - the first two bytes of the checksum are used to spread the files
across a set of directories for efficiency). Some developers have
expressed concerns about hash collisions,
but that possibility does not seem to worry the majority. The object itself is
compressed before being checksummed and stored.
It's worth repeating that git stores every revision of an object separately
in the database, addressed by the SHA checksum of its contents. There is
no obvious connection between two versions of a file; that connection is
made by following the commit objects and looking at what objects were
contained in the relevant trees. Git might thus be expected to consume a
fair amount of disk space; unlike many source code management systems, it
stores whole files, rather than the differences between revisions. It is,
however, quite fast, and disk space is considered to be cheap.
The directory cache is a single, binary file containing a tree object; it
captures the state of the directory tree at a given time. The state as
seen by the cache might not match the actual directory's contents; it could
differ as a result of local changes, or of a "pull" of a repository from
elsewhere.
If a developer wishes to create a repository from scratch, the first step
is to run init-db in the top level of the source tree.
People running PostgreSQL want to be sure not to omit the hyphen, or they
may not get the results they were hoping for. init-db will create
the directory cache file (.dircache/index); it will also, by
default, create the object database in .dircache/objects. It is
possible for the object database to be elsewhere, however, and possibly
shared among users. The object database will initially be empty.
Source files can be added with the update-cache program.
update-cache --add will add blobs to the object database for new
files and create new blobs (leaving the old ones in place) for any files which have changed.
This command will also update the directory cache with entries associating
the current files' blobs with their current names, locations, and
permissions.
What update-cache will not do is capture the state of the
tree in any permanent way. That task is done by write-tree, which
will generate a new tree object from the current directory cache and enter
that object into the database. write-tree writes the SHA checksum
associated with the new tree object to its standard output; the user is
well-advised to capture that checksum, or the newly-created tree will be
hard to access in the future.
The usual thing to do with a new tree object will be to bind it into a
commit object; that is done with the commit-tree command.
commit-tree takes a tree ID (the output from
write-tree) and a set of parent commits,
combines them with the changelog entry, and stores the whole thing as a
commit object. That object, in essence, becomes the head of the current
version of the source tree. Since each commit points to its parents, the
entire commit history of the tree can be traversed by starting at the
head. Just don't lose the SHA
checksum for the last commit.
Since each commit contains a tree object, the state of the source tree
at commit time can be reconstructed at any point.
The directory cache can be set to a given version of the tree by using
read-tree; this operation reads a tree object from the object
database and stores it in the directory cache, but does not actually change any files
outside of the cache. From there, checkout-cache can be used make
the actual source tree look like the cached tree object. The
show-diff tool prints the differences between the directory cache
and what's actually in the directory tree currently. There is also a
diff-tree tool which can generate the differences between any two
trees.
An early example of what can be done with these tools can be had by playing
with the git-pasky distribution by Petr
Baudis. Petr has layered a set of scripts over the git tools to create
something resembling a source management system. The git-pasky
distribution itself is available as a network repository; running
"git pull" will update to the current version.
A "pull"
operation, as implemented in git-pasky, performs these steps:
- The current "head" commit for the local repository
is found; git-pasky keeps the SHA checksum
for the current commit in .dircache/HEAD.
- The current head is obtained from the remote repository (using
rsync) and compared with the local head. If the two are the
same, no changes have been made and the job is done.
- The remote object database is downloaded, again with rsync.
This operation will add any new objects to the database.
- Using diff-tree, a patch from the previous (local) version to the
current (remote) version is generated. That patch is then applied to the
current directory's contents. The patch technique is used to help
preserve, if possible, any local changes to the files.
- A call to read-tree updates the directory cache to match the
current revision as obtained from the remote repository.
Petr's version of git adds a number of other features as well. It is a far
cry from a full-blown source code management system, since it lacks little
details like release tagging, merging, graphical interfaces, etc. A
beginning structure is beginning to emerge, however.
When this work was begun, it was seen as a sort of insurance policy to be
used until a real
source management system could be found. There is a good chance, however,
that git will evolve into something with staying power. It provides the
needed low-level functionality in a reasonably simple way, and it is
blindingly fast. Linus places a premium on
speed:
If it takes half a minute to apply a patch and remember the
changeset boundary etc (and quite frankly, that's _fast_ for most
SCM's around for a project the size of Linux), then a series of 250
emails (which is not unheard of at all when I sync with Andrew, for
example) takes two hours.
As if on cue, Andrew announced a set of 198
patches to be merged for 2.6.12:
This is the first live test of Linus's git-importing ability. I'm
about to disappear for 1.5 weeks - hope we'll still have a kernel
left when I get back.
If this test (and the ones that come after) goes well, and the resulting
system evolves to where it meets Linus's needs, he may be unlikely to
switch to yet another system in the future. So git is worth watching; it
could develop into a powerful system in a hurry.
Comments (32 posted)
Since
LWN's look at git was
published, development has continued at a rapid pace. A number of features
and capabilities have been added to the system. Look for an updated
article at some future point when things stabilize somewhat.
A mailing list has been set up to take discussion of git off linux-kernel.
The list is called "git," and it is hosted on vger.kernel.org; sending a
message containing "subscribe git" to
majordomo@vger.kernel.org will get you onto the list. As of this
writing, the traffic is not small.
A couple of quotes from that list, that didn't quite make the "quotes of
the week":
Trust me, not needing locking is a huge boon. I don't think people realize
just how much thought I've put into my database selection and what the
implications are.
It's perfect, I tell you.
--
Linus Torvalds
Sooner or later we'll find a flaw in it. Really! I mean, you've started
this OS thing 10+ years ago and we are still busy fixing it! ;)
--
Ingo Molnar
Linus has an experimental kernel repository on kernel.org, and has
committed Andrew Morton's initial 200-patch bomb to it. It's in:
pub/linux/kernel/people/torvalds/kernel-test.git
for those who are
interested. Commits to this repository are also being broadcast to the
same "commits" list that tracked the BitKeeper repository. Here's an example patch for those interested in what
a git commit looks like, or in the ioread/iowrite API change that your
editor has not yet managed to cover on this page.
Comments (none posted)
The netlink mechanism implements a special sort of datagram socket for
communication between the kernel and user space. Most of the users of
netlink are currently in the networking subsystem itself - netlink
protocols exist, for example, for the management of routing table entries
and firewall rules. Netlink is also used by SELinux and the kernel event
notification mechanism.
Use of netlink is relatively straightforward - for kernel developers who
have some familiarity with the networking subsystem. To be able to
communicate via netlink, a kernel subsystem must first create an in-kernel
socket:
struct sock *netlink_kernel_create(int unit,
void (*input)(struct sock *sk, int len));
Here, unit is the netlink protocol number (as defined in
<linux/netlink.h>), and input() is a function to be
called when data arrives on the given socket. The naming of unit
dates back to an early netlink implementation, which worked with virtual
devices; unit was the minor number of the relevant device. The
input() callback can be NULL, in which case user space
will not be able to write to the socket.
If there is an input() callback, it will be called whenever data
arrives. That data will be represented in one or more sk_buff
structures (SKBs) queued to the socket itself. So the core of a typical
input() function will look something like:
struct sk_buff *skb;
while ((skb = skb_dequeue(sk->sk_receive_queue)) != NULL) {
deal_with_incoming_data(skb);
kfree_skb(skb);
}
Sending data to user space involves allocating an SKB, filling it with the
data, and writing it to the netlink socket. Here is how the kernel events
mechanism does it:
static int send_uevent(const char *signal, const char *obj,
char **envp, int gfp_mask)
{
struct sk_buff *skb;
char *pos;
int len;
len = strlen(signal) + 1;
len += strlen(obj) + 1;
/* allocate buffer with the maximum possible message size */
skb = alloc_skb(len + BUFFER_SIZE, gfp_mask);
pos = skb_put(skb, len);
sprintf(pos, "%s@%s", signal, obj);
/* copy the environment key by key to our continuous buffer */
if (envp) {
int i;
for (i = 2; envp[i]; i++) {
len = strlen(envp[i]) + 1;
pos = skb_put(skb, len);
strcpy(pos, envp[i]);
}
}
return netlink_broadcast(uevent_sock, skb, 0, 1, gfp_mask);
}
(Some error handling has been removed for brevity; see
lib/kernel_uevent.c for the full version). The call to
netlink_broadcast() sends the data in the SKB to every user-space
process which is currently connected to the netlink socket. There is also
netlink_unicast(), which takes a process ID and sends only to that
process. Netlink writes can be restricted to specific "groups," allowing
user-space processes to sign up for an interesting subset of the data
written to a given socket.
There is more to the netlink interface than has been presented here; see
<linux/netlink.h> for the rest.
Evgeniy Polyakov thinks that the netlink protocol is too complicated; it
should not be necessary to understand the networking layer just to
communicate with user space. His response is connector, a layer on top of netlink which is
designed to make things simpler.
The connector code multiplexes all possible message types over a single
netlink socket number. Individual messages are distinguished by way of a
cb_id structure:
struct cb_id
{
__u32 idx;
__u32 val;
};
idx can be thought of as a protocol type, and val as a
message type within the given protocol. A kernel subsystem which is
prepared to receive messages of a given type set up a callback with:
int cn_add_callback(struct cb_id *id, char *name,
void (*callback)(void *msg));
That callback will be invoked every time a message with the given
id is received from user space. The msg parameter to the
callback function, despite its void * type, is always a
pointer to a structure of this type:
struct cn_msg
{
struct cb_id id;
__u32 len; /* Length of the following data */
__u8 data[0];
/* Some fields omitted */
};
The callback can process the given message data and return.
Writing to a socket via connector is done with:
void cn_netlink_send(struct cn_msg *msg, u32 __groups, int gfp_mask);
The msg contains the cb_id structure describing the
message; __groups can be used to restrict the list of recipients,
and gfp_mask controls how memory allocation is done. This call
can fail (netlink is an unreliable service), but it returns no indication
of whether it succeeded or not.
For kernel code which needs to send significant amounts of data to user
space, perhaps from hot paths, there is also a "CBUS" layer over the
connector. That layer exports one function:
int cbus_insert(struct cn_msg *msg, int gfp_flags);
This function does not send the message immediately; it simply adds it to a
per-CPU queue. A separate worker thread will eventually come along, find
the message, and send it on to user space.
The code seems to work, though some concerns have been raised about the
implementation. Not everybody feels that the connector solution
is necessary, however. The core netlink
API is not all that hard to use, so it is not clear that another layer
needs to be wrapped around it. Those who do think that netlink could be
made easier do not agree on how it should be done; some developers would
like to see the netlink API itself changed rather than having another layer
put on top of it. Various user-space needs
(auditing, accounting, desktop functionality, etc.) are all creating
pressure for more communication channels with the kernel. Some way of
making that communication easier on the kernel side may well get added,
eventually, but
it is far from clear what form that code will take.
Comments (2 posted)
The filesystems in user space (FUSE - covered here
in January, 2004) provides a kernel interface
and library which makes it easy to implement filesystems with a user-space
process. With FUSE, a user can mount a filesystem contained with a tar
archive, implemented via an FTP session, or "tunneled" from a remote system
via ssh. It is a powerful tool with many users, and its authors have been
pushing for inclusion into the mainline kernel for some time now. That
merge has been delayed pending a review of the patch by a few interested
developers.
That review has happened, and it has turned up a problem; it seems that
FUSE, in some situations, implements some rather strange filesystem
semantics.
Consider the case of a filesystem hosted in a tar archive. FUSE will
present files within the archive with the owners and permission modes
specified inside that archive. The owner and permissions of the files, in
other words, do not
necessarily have anything to do with the owner of the archive or the user
who mounted it as a filesystem. To allow that user to actually work with
files in the archive, the "tarfs" FUSE module disables ordinary permissions
checking. A file may, according to a tool like ls, be owned by
another user and inaccessible, but the user who mounted the filesystem has
full access anyway. FUSE also ensures that no other user has any
access to the mounted filesystem - not even root.
This twisting of filesystem semantics does not sit well with some kernel
developers, who tend to think that Linux systems should behave like Linux.
The FUSE semantics have the potential to confuse programs which think that
the advertised file permissions actually mean something (though, evidently,
that tends not to be a problem in real use) and it makes it impossible to
mount a filesystem for use by more than one user. So these developers have
asked that the FUSE semantics be removed, and that a FUSE filesystem behave
more like the VFAT-style systems; the user mounting the filesystem should
own the files, and reasonable permissions should be applied.
In fact, FUSE does provide an option ("allow_others") which causes
it to behave in this way. But that approach goes against what FUSE is
trying to provide, and raises some security issues of its own. FUSE hacker
Miklos Szeredi sees the issue this way:
I want the tar filesystem to be analogous to running tar. When I
run tar, other users are not notified of the output, it's only for
me. If they want to run tar, they can too. The same can be true
for tarfs. I mount it for my purpose, others can mount it for
theirs. Since the daemon providing the filesystem always runs with
the same capabilities as the user who did the mount, I and others
will always get the permissions that we have on the actual tar
file.
In this view, a FUSE filesystem is very much a single-user thing. In some
cases, it really should be that way; consider a remote filesystem
implemented via an ssh connection. The user mounting the
filesystem presumably has the right to access the remote system, on the
remote system's terms. The local FUSE filesystem should not be trying to
figure out what the permissions on remote files should be. Other users on
the local system - even the root user - may have no right to access the
remote system, and should not be able to use the FUSE filesystem to do so.
It's not clear where this discussion will go. There are some clear reasons
behind the behavior implemented by FUSE, and it may remain available,
though, perhaps, not as a default, and possibly implemented in a different
way. The little-used Linux namespace capability has been mentioned as a
way of hiding single-user FUSE filesystems, though there may be some
practical difficulties in making namespaces actually work with FUSE. Until
the core filesystem hackers are happy, however, FUSE is likely to have a
rough path into the mainline.
Comments (7 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>