Brief items
The current stable 2.6 kernel is 2.6.14.1,
released on November 8. This kernel
contains a single patch for a sysctl-related oops. There was some
unhappiness that the patch for the "zero-length datagrams get dropped" bug,
which breaks bind and tcpdump, was not included. That patch will
turn up in 2.6.14.2, which should be released around November 12.
There is still no 2.6.15 prepatch as of this writing. The merge
window for this cycle is about to close, however, so 2.6.15-rc1 may be out
by the time you read this. An impressive pile of patches has been merged
into the mainline git repository; see the article below for a list of
significant additions since last week.
The current -mm tree is 2.6.14-mm1. Recent changes to
-mm include 64Kb page support for the ppc64 architecture, the swap migration patches, and the
lean-and-mean "slob" allocator. The -mm tree has slimmed down considerably
as patches have been merged into the mainline.
The current 2.4 prepatch is 2.4.32-rc3, released by Marcelo on
November 9. This release candidate adds exactly two patches for
serious problems; the final 2.4.32 release will likely happen soon.
Comments (none posted)
Kernel development news
When you hear voices in your head that tell you to shoot the pope,
do you do what they say? Same thing goes for customers and
managers. They are the crazy voices in your head, and you need to
set them right, not just blindly do what they ask for.
--
Linus Torvalds
Comments (4 posted)
Last week's
what's going into
2.6.15 article had a long list of changes merged into the mainline.
The kernel developers weren't done, however. Here is a list of changes
merged since that article was written:
- A big XFS update (including barrier support).
- A SCSI RDMA protocol initiator for InfiniBand.
- The open-iSCSI patches.
- The removal of the (broken) Compaq fibre channel driver.
- RapidIO bus support.
- The netlink connector patch, with the
process events connector on top.
- A number of packet scheduler improvements.
- An ALSA update.
- The un-exporting of a number of kernel symbols
(clear_page_dirty_for_io,
console_unblank,
cpu_core_id
hugetlb_total_pages,
idle_cpu,
nr_swap_pages,
phys_proc_id,
reprogram_timer,
swapper_space,
sysctl_overcommit_memory,
sysctl_overcommit_ratio,
sysctl_max_map_count,
total_swap_pages,
user_get_super,
uts_sem,
vm_acct_memory, and
vm_committed_space,).
- A big purge of code which checks pointers for NULL prior to
passing them to kfree().
- A big reorganization of the block subsystem code (it has its own top-level
block directory now).
- A memory technology devices update, including support for OneNAND,
Sibley, and resident flash disk devices.
- The shared subtrees
patches.
- An MPPE encryption module for PPP.
- The removal of all Bluetooth-related files from /proc (they are in
/sys/class/bluetooth now).
- Some significant reworking and cleanup of the software suspend code.
- Big changes to the DVB and Video4Linux subsystems, including support
for a number of new devices.
- A number of open sound system drivers are now explicitly scheduled for
removal in January (probably 2.6.16, in other words).
- Version 1 of the Video4Linux API has also been scheduled for removal
(in July, 2006).
- Support for rotation of the console screen (to support mobile devices
which have a natural orientation which is not zero degrees).
- A number of scheduler tweaks to improve efficiency and resource usage
on larger systems.
- Big updates to the ipw2100 and ipw2200 drivers.
There is also the usual big pile of fixes, and a number of architecture
updates.
Comments (1 posted)
The shared subtrees patch set, written primarily by Ram Pai, has been in
circulation for some time, but without a whole lot of discussion. Those
patches have now been merged into the pre-2.6.15 mainline, so the time has
come for a closer look.
In short, shared subtrees allow a system administrator to configure, in
great detail, how various filesystem mounts should appear in the tree, how
they relate to each other, and how they propagate between namespaces.
There are two motivations for this work:
- The "files as directories" feature of the reiser4 filesystem allows
a user to create, via hard links, a directory which appears in
multiple places in the filesystem. That feature has long been
disabled due to the deadlock issues which it raised. Shared subtrees
are a step toward implementing "files as directories" in a safe
manner.
- The merging of the filesystems in user space patch, and some of the
permissions issues
associated with it, has increased the desire to be
able to run users in their own filesystem namespaces. Per-user
namespaces are currently awkward at best; shared subtrees will help
make them easier to manage.
It should be noted that the patches merged into the mainline are not a
complete solution for either of the above problems, but they are a step in
that direction. The per-user namespaces example will be used in what
follows to illustrate how the various subtree options work.
Every filesystem in Linux is mounted within a specific namespace. The
kernel has long supported the creation of multiple namespaces, but, in most
situations, that feature is not used. So the typical Linux system has a
single namespace which is shared between all processes on the system.
When separate namespaces are used, they are usually in the context of
sandboxing and isolation. There would be advantages, however, to making
more extensive use of namespaces.
Imagine, for starters, a simple filesystem hierarchy which looks something
like the diagram at the right. Clearly, a few directories have been left
out for simplicity. The only unusual thing is that a couple of directories
have been created under /subtree for users "alice" and "bob". We
would like to use those directories as the root for each user's own private
view of the filesystem.
The first step is to create a copy of the root filesystem under each user's
subtree directory using bind mounts. The result of such an operation will
look like the diagram below.
Note that the
/subtree tree has been bound into each user's namespace as well.
This propagation cuts down on the isolation between users, since they can
see each others' subtrees. As the number of users grows, it also
complicates the namespaces considerably, as each set of subtrees must be
replicated over and over.
This loss of isolation and explosion of mount points can be avoided through
the use of "unbindable" mounts, a new feature added by the sharable
subtrees patch. Said mounts cannot be
bound into other places, and will not be propagated into new subtrees. So
the administrator could execute a series of commands like:
mount --bind /subtree /subtree
mount --make-unbindable /subtree
This incantation turns /subtree into a magic point which cannot be
rebound. If, after this has been done, the administrator makes the
per-user bind mounts of the root filesystem, the portion under
/subtree will be pruned, with a result which looks like this:
Now imagine that the system administrator mounts a CDROM under
/mnt. The result will look like:
Note that the CDROM mount is not visible in the per-user namespaces, so bob
and alice will be unable to look at the contents of the CD. That might be the
intended result, but imagine it's not, that the administrator wants all
users to be able to see things mounted on /mnt. The answer is a
"sharable" mount, one which is automatically propagated into every place
where the original mount appears. So, the administrator need only perform
another new incantation:
mount --bind /mnt /mnt
mount --make-shared /mnt
After this,
/mnt is a sharable mount. Any changes made there will
appear in any namespace where
/mnt appears. The resulting tree
would look something like this:
Many administrators might rather just make the entire filesystem tree
sharable, rather than try to anticipate where changes could be made. If
the root is made sharable in this way, any new filesystems which are
mounted will propagate throughout the tree. This propagation works all
ways; if alice mounts the CD within her subtree, it will still appear in
all of the subtrees.
Of course, this behavior might not always be desirable. If, for example, bob is
using FUSE to mount an "ssh filesystem" from a remote host, he would prefer
that this filesystem not be visible to other users at all. But bob would
still like to see filesystems mounted elsewhere, and does not want to give
up the advantages of a shared subtree. The answer is yet another type of
mount, called a "slave" mount. Slave mounts are selfish: they remain tied
to their parent mount, and receive new mounts from there. Anything mounted
underneath the slave mount, however, will not be propagated elsewhere. So
each user can have his or her own filesystems which are not part of the
global hierarchy:
The shared subtrees patch also adds a "private" mount type, which is
essentially how mounts in 2.6.14 and prior kernels work. A private mount
will not be propagated to any other mounts, but it can (unlike an
unbindable mount) be explicitly propagated via a bind operation.
Internally, the patches create the concept of a "peer group," among which
mount events are propagated. A new mnt_share field (a list of
peers) has been added to the vfsmount structure for this purpose.
A couple of other lists (mnt_slave_list and mnt_slave)
have been added for keeping track of slave mount relationships. A new
MNT_UNBINDABLE flag marks unbindable mounts. And, of course, a
great deal of locking work has been done to make all of this work in a safe
manner. Al Viro has worked with a few iterations of the shared subtrees
patch, with the result that it is now considered to be ready for the
mainline.
The shared subtrees patch is a big step forward: it is a fundamental change
to the virtual filesystem layer which greatly increases the flexibility in
how namespaces can be populated and presented to users. What remains, at
this point, is some work on the namespace side of things. Namespaces are
still unnamed objects which can only be inherited from a parent process;
there is no easy way to create and attach to a per-user namespace.
Finishing the job will take some work, but, chances are, the hardest part
of the problem has been solved.
For more information, see the extensive
documentation file shipped with the patch.
Comments (18 posted)
The
seq_file mechanism is a
helper for kernel subsystems wanting to create lengthy virtual files,
usually in
/proc. 2.6.15 will include a small enhancement which
may prove helpful for some users.
When user space opens a virtual file, the kernel must, in turn, call
seq_open() to set things up. On return, the file
structure passed to seq_open() will have, in its
private_data field, a pointer to the seq_file structure
created at open time. That is the same structure which will be passed to
the seq_file iterator functions, and which must be used when actually
generating output.
Traditionally, seq_open() has always allocated the
seq_file structure itself. In 2.6.15, however, it will examine
the private_data field first, and, if that field is
non-NULL, it will assume that the seq_file has already
been allocated by the caller. This change allows seq_file users to embed
the structure within something larger. It is worth noting, though, that
seq_release() still frees the seq_file structure
regardless of who created it. Among other things, that implies that, if
the caller allocates a seq_file structure within a larger
structure, the seq_file structure must appear at the beginning.
Comments (none posted)
Last week's article on
fragmentation avoidance concluded with these famous last words:
But there are legitimate reasons for wanting this capability in the
kernel, and the issue is unlikely to go away. Unless somebody comes
up with a better solution, it could be hard to keep Mel's patch out
forever.
One thing which can keep a patch out of the kernel, however, is
opposition from Linus, and that is what has happened in this case. His position is that fragmentation avoidance is
"totally useless," and he concludes:
Don't do it. We've never done it, and we've been fine.
The right solution, according to Linus, is to create a special memory zone
on the (rare) systems which need to be able to free up large, contiguous
blocks of memory. Kernel memory allocations would not be allowed in that
zone, so it would only contain user-space pages. Those pages are
relatively easy to move when the need arises, so most needs would be
satisfied. A certain amount of kernel tuning would be required, but that
is the price to be paid for running highly-specialized applications.
This approach is not pleasing to everybody involved. Andi Kleen noted:
You have two choices if a workload runs out of the kernel
allocatable pages. Either you spill into the reclaimable zone or
you fail the allocation. The first means that the huge pages thing
is unreliable, the second would mean that all the many problems of
limited lowmem would be back.
Others have noted that it can be hard to tune a machine for all workloads,
especially on systems with a large number of users. Objections
notwithstanding, it begins to look like active fragmentation avoidance is
not likely to go into the 2.6 kernel anytime soon.
Comments (none posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>