Brief itemsreleased on November 8. This kernel contains a single patch for a sysctl-related oops. There was some unhappiness that the patch for the "zero-length datagrams get dropped" bug, which breaks bind and tcpdump, was not included. That patch will turn up in 18.104.22.168, which should be released around November 12.
There is still no 2.6.15 prepatch as of this writing. The merge window for this cycle is about to close, however, so 2.6.15-rc1 may be out by the time you read this. An impressive pile of patches has been merged into the mainline git repository; see the article below for a list of significant additions since last week.
The current -mm tree is 2.6.14-mm1. Recent changes to -mm include 64Kb page support for the ppc64 architecture, the swap migration patches, and the lean-and-mean "slob" allocator. The -mm tree has slimmed down considerably as patches have been merged into the mainline.
The current 2.4 prepatch is 2.4.32-rc3, released by Marcelo on November 9. This release candidate adds exactly two patches for serious problems; the final 2.4.32 release will likely happen soon.
Kernel development news
There is also the usual big pile of fixes, and a number of architecture updates.
It should be noted that the patches merged into the mainline are not a complete solution for either of the above problems, but they are a step in that direction. The per-user namespaces example will be used in what follows to illustrate how the various subtree options work.
Every filesystem in Linux is mounted within a specific namespace. The kernel has long supported the creation of multiple namespaces, but, in most situations, that feature is not used. So the typical Linux system has a single namespace which is shared between all processes on the system. When separate namespaces are used, they are usually in the context of sandboxing and isolation. There would be advantages, however, to making more extensive use of namespaces.
Imagine, for starters, a simple filesystem hierarchy which looks something like the diagram at the right. Clearly, a few directories have been left out for simplicity. The only unusual thing is that a couple of directories have been created under /subtree for users "alice" and "bob". We would like to use those directories as the root for each user's own private view of the filesystem.
The first step is to create a copy of the root filesystem under each user's subtree directory using bind mounts. The result of such an operation will look like the diagram below.
This loss of isolation and explosion of mount points can be avoided through the use of "unbindable" mounts, a new feature added by the sharable subtrees patch. Said mounts cannot be bound into other places, and will not be propagated into new subtrees. So the administrator could execute a series of commands like:
mount --bind /subtree /subtree mount --make-unbindable /subtree
This incantation turns /subtree into a magic point which cannot be rebound. If, after this has been done, the administrator makes the per-user bind mounts of the root filesystem, the portion under /subtree will be pruned, with a result which looks like this:
Now imagine that the system administrator mounts a CDROM under /mnt. The result will look like:
Note that the CDROM mount is not visible in the per-user namespaces, so bob and alice will be unable to look at the contents of the CD. That might be the intended result, but imagine it's not, that the administrator wants all users to be able to see things mounted on /mnt. The answer is a "sharable" mount, one which is automatically propagated into every place where the original mount appears. So, the administrator need only perform another new incantation:
mount --bind /mnt /mnt mount --make-shared /mntAfter this, /mnt is a sharable mount. Any changes made there will appear in any namespace where /mnt appears. The resulting tree would look something like this:
Many administrators might rather just make the entire filesystem tree sharable, rather than try to anticipate where changes could be made. If the root is made sharable in this way, any new filesystems which are mounted will propagate throughout the tree. This propagation works all ways; if alice mounts the CD within her subtree, it will still appear in all of the subtrees.
Of course, this behavior might not always be desirable. If, for example, bob is using FUSE to mount an "ssh filesystem" from a remote host, he would prefer that this filesystem not be visible to other users at all. But bob would still like to see filesystems mounted elsewhere, and does not want to give up the advantages of a shared subtree. The answer is yet another type of mount, called a "slave" mount. Slave mounts are selfish: they remain tied to their parent mount, and receive new mounts from there. Anything mounted underneath the slave mount, however, will not be propagated elsewhere. So each user can have his or her own filesystems which are not part of the global hierarchy:
The shared subtrees patch also adds a "private" mount type, which is essentially how mounts in 2.6.14 and prior kernels work. A private mount will not be propagated to any other mounts, but it can (unlike an unbindable mount) be explicitly propagated via a bind operation.
Internally, the patches create the concept of a "peer group," among which mount events are propagated. A new mnt_share field (a list of peers) has been added to the vfsmount structure for this purpose. A couple of other lists (mnt_slave_list and mnt_slave) have been added for keeping track of slave mount relationships. A new MNT_UNBINDABLE flag marks unbindable mounts. And, of course, a great deal of locking work has been done to make all of this work in a safe manner. Al Viro has worked with a few iterations of the shared subtrees patch, with the result that it is now considered to be ready for the mainline.
The shared subtrees patch is a big step forward: it is a fundamental change to the virtual filesystem layer which greatly increases the flexibility in how namespaces can be populated and presented to users. What remains, at this point, is some work on the namespace side of things. Namespaces are still unnamed objects which can only be inherited from a parent process; there is no easy way to create and attach to a per-user namespace. Finishing the job will take some work, but, chances are, the hardest part of the problem has been solved.
For more information, see the extensive documentation file shipped with the patch.seq_file mechanism is a helper for kernel subsystems wanting to create lengthy virtual files, usually in /proc. 2.6.15 will include a small enhancement which may prove helpful for some users.
When user space opens a virtual file, the kernel must, in turn, call seq_open() to set things up. On return, the file structure passed to seq_open() will have, in its private_data field, a pointer to the seq_file structure created at open time. That is the same structure which will be passed to the seq_file iterator functions, and which must be used when actually generating output.
Traditionally, seq_open() has always allocated the seq_file structure itself. In 2.6.15, however, it will examine the private_data field first, and, if that field is non-NULL, it will assume that the seq_file has already been allocated by the caller. This change allows seq_file users to embed the structure within something larger. It is worth noting, though, that seq_release() still frees the seq_file structure regardless of who created it. Among other things, that implies that, if the caller allocates a seq_file structure within a larger structure, the seq_file structure must appear at the beginning.Last week's article on fragmentation avoidance concluded with these famous last words:
One thing which can keep a patch out of the kernel, however, is opposition from Linus, and that is what has happened in this case. His position is that fragmentation avoidance is "totally useless," and he concludes:
The right solution, according to Linus, is to create a special memory zone on the (rare) systems which need to be able to free up large, contiguous blocks of memory. Kernel memory allocations would not be allowed in that zone, so it would only contain user-space pages. Those pages are relatively easy to move when the need arises, so most needs would be satisfied. A certain amount of kernel tuning would be required, but that is the price to be paid for running highly-specialized applications.
This approach is not pleasing to everybody involved. Andi Kleen noted:
Others have noted that it can be hard to tune a machine for all workloads, especially on systems with a large number of users. Objections notwithstanding, it begins to look like active fragmentation avoidance is not likely to go into the 2.6 kernel anytime soon.
Patches and updates
Core kernel code
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds