Brief items
The current 2.6 prepatch is 2.6.24-rc8,
released by Linus on
January 15. It contains a fair number of fixes but not much else.
"
So I'm pretty sure this is the last -rc, and the final 2.6.24 will
probably be out next weekend or so. But in the meantime, let's give this a
final shakedown, and see if we can fix any last regressions still."
See
the
long-format changelog for the details.
As of this writing, a very small number of fixes has been merged post-rc8.
There have been no -mm releases over the last week.
The current stable 2.6 kernel is 2.6.23.14, released (along with 2.6.22.16) on January 14.
These releases contain a single patch: a fix for the filesystem security
vulnerability discussed on this
week's Security Page.
For older kernels: 2.6.16.58 was released on
January 16 with several fixes.
Comments (none posted)
Kernel development news
I wonder what a tiny, SANE register-based bytecode
interface might look like. Have a single page shared between kernel and
userland, for each thread. Userland fills that page with bytecode, for a
virtual machines with 256 registers -- where instructions roughly equate to
syscalls.
The common case -- a single syscall like open(2) -- would be a single byte
bytecode, plus a couple VM register stores. The result is stored in
another VM register.
But this format enables more complex cases, where userland programs can
pass strings of syscalls into the kernel, and let them execute until some
exceptional condition occurs. Results would be stored in VM registers (or
userland addresses stored in VM registers...).
--
Jeff Garzik
Comments (2 posted)
By Jonathan Corbet
January 15, 2008
Chris Mason has recently released
Btrfs v0.10, which contains a
number of interesting new features. In general, Btrfs has come a long way
since LWN first
wrote about
it last June. Btrfs may, in some years, be the filesystem most of us
are using - at least, for those of us who will still be using rotating
storage then. So it bears watching.
Btrfs, remember, is an entire new filesystem being developed by Chris
Mason. It is a copy-on-write system which is capable of quickly creating
snapshots of the state of the filesystem at any time. The snapshotting is
so fast, in fact, that it is used as the Btrfs transactional mechanism,
eliminating the need for a separate journal. It supports subvolumes -
essentially the existence of multiple, independent filesystems on the same
device. Btrfs is designed for speed, and also provides checksumming for
all stored data.
Some kernel patches show up and quickly find their way into production
use. For example, one year ago, nobody (outside of the -ck list, perhaps) was talking
about fair scheduling; but, as of this writing, the CFS scheduler has been
shipping for a few months. KVM also went from initial posting to merged
over the course of about two kernel release cycles.
Filesystems do not work that way, though.
Filesystem developers tend to be a cautious, conservative bunch; those who
aren't that way tend not to survive their first few encounters with users
who have lost data. This is all a way of saying that, even though Btrfs is
advancing quickly, one should not plan on using it in any sort of
production role for a while yet. As if to drive that point home, Btrfs
still crashes the system when the filesystem runs out of space. The v0.10
patch, like its predecessors, also changes the on-disk format.
The on-disk format change is one of the key features in this version of the
Btrfs patch. The format now includes back references on almost all objects
in the filesystem. As a result, it is now easy to answer questions like
"to which file does this block belong?" Back references have a few uses,
not the least of which is the addition of some redundant information which
can be used to check the integrity of the filesystem. If a file claims to
own a set of blocks which, in turn, claim to belong to a different file,
then something is clearly wrong. Back references can also be used to
quickly determine which files are affected when disk blocks turn bad.
Most users, however, will be more interested in another new feature which
has been enabled by the existence of back references: online resizing. It
is now possible to change the size of a Btrfs filesystem while it is
mounted and busy - this includes shrinking the filesystem. If the Btrfs
code has to give up some space, it can now quickly find the affected files
and move the necessary blocks out of the way. So Btrfs should work nicely
with the device mapper code, growing or shrinking filesystems as conditions
require.
Another interesting feature in v0.10 is the associated in-place ext3
converter. It is now possible to non-destructively convert an existing
ext3 filesystem to Btrfs - and to go back if need be. The converter works
by stashing a copy of the ext3 metadata found at the beginning of the disk, then
creating a parallel directory tree in the free space on the filesystem. So
the entire ext3 filesystem remains on the disk, taking up some space but
preserving a fallback should Btrfs not work out. The actual file data is
shared between the two filesystems; since Btrfs does copy-on-write, the
original ext3 filesystem remains even after the Btrfs filesystem has been
changed. Switching to Btrfs forevermore is a simple matter of deleting the
ext3 subvolume, recovering the extra disk space in the process.
Finally, the copy-on-write mechanism can be turned off now with a mount option. For
certain types of workloads, copy-on-write just slows things down without
providing any real advantages. Since (1) one of those workloads is
relational database management, and (2) Chris works for Oracle, the
only surprise here is that this option took as long as it did to arrive.
If multiple snapshots reference a given file, though, copy-on-write is
still performed; otherwise it would not be possible to keep the snapshots
independent of each other.
For those who are curious about where Btrfs will go from here, Chris has
posted a
timeline describing what he plans to accomplish over the coming year.
Next on the list would appear to be "storage pools," allowing a Btrfs
filesystem to span multiple devices. Once that's in place, striping and
mirroring will be implemented within the filesystem. Longer-term projects
include per-directory snapshots, fine-grained locking (the filesystem
currently uses a single, global lock), built-in incremental backup support,
and online filesystem checking. Fixing that pesky out-of-space problem
isn't on the list, but one assumes Chris has it in the back of his mind
somewhere.
Comments (28 posted)
By Jonathan Corbet
January 15, 2008
There are a number of filesystem-related patches aimed at the upcoming
2.6.25 merge window; one of those is the
unprivileged mount patch by
Miklos Szeredi. This patch enables an unprivileged user process to call
the
mount() system call and - in certain circumstances - have that
call actually succeed. It could eventually lead to a situation where users
have more flexibility to create their own environments and the setuid
mount utility is no longer needed.
This patch adds a new field (uid) to the vfsmount
structure, allowing the kernel to keep track of the owner of a specific
filesystem mount. The system administrator can give ownership of a
specific mount to a user with the new MNT_SETUSER flag. A common
pattern might be to bind-mount a user's home directory on top of itself,
giving the user the ownership of that mount. Once that
has been done, the user is allowed to freely mount other filesystems below
that mount point - with a couple of conditions:
- There is a system-wide limit on the number of allowed user mounts;
once that limit is hit, no more unprivileged mounts will be allowed
until somebody unmounts something. The current patch has no provision
for per-user or per-group mount limits, but such a feature would not
be particularly hard to add should the need arise.
- The filesystem type must be marked as being safe for unprivileged
mounts. Miklos notes that a filesystem must go through "a thorough
audit" before this flag can be set with any confidence. The patch, as
posted, marks the fuse filesystem (which allows for the creation of
filesystems implemented in user space) as being safe; fuse was
designed for this mode of operation in the first place. Bind mounts
are also allowed, with some additional conditions.
If the system allows the mount, the flags allowing for setuid and device
files will be forcibly cleared - unless the user has the requisite
capabilities anyway. Users are allowed to unmount filesystems they own,
again without privilege, but cannot unmount any others. Another new mount
flag (MNT_NOMNT) marks a specific filesystem as being the end of
the line - no unprivileged submounts are allowed below it.
The end result of
[PULL QUOTE:
One might well wonder why this change to the mount() system call
is called for, given that users have been able to do unprivileged mounts
for years.
END QUOTE]
all this should be a mechanism by which users can organize their filesystem
hierarchies without any need for administrative privileges, and without the
risk of compromising system security.
One might well wonder why this change to the mount() system call
is called for, given that users have been able to do unprivileged mounts
for years. The answer is that the current mechanism has a couple of
shortcomings. Every potential unprivileged mount must be explicitly
enabled via a line in /etc/fstab. That works well for simple
situations, such as allowing a user to mount a CD or a USB storage device.
When users start wanting to do more complicated things, like mounting their
own special fuse filesystems, the /etc/fstab mechanism breaks
down. There is a separate, setuid program which grants the right to make
unprivileged fuse mounts, but it represents a workaround rather than a
proper solution.
The current user mount mechanism also requires that the mount
utility be installed setuid root. Every setuid binary is a potential
security hole, so there is value in eliminating privileged programs when
possible. The unprivileged mount patch offers the possibility of
eliminating the setuid mount program while simultaneously leaving policy
control in the hands of the system administrator. So, unless something
surprising comes up, chances are good that this capability will appear in
the 2.6.25 kernel.
Comments (3 posted)
By Jonathan Corbet
January 16, 2008
The ext3 system uses the classic Unix block pointer method for keeping
track of the blocks in each file. For a given file, the on-disk inode
structure contains space for twelve block numbers; they point to the first
twelve blocks in the file - the first 48KB of space. If the file is larger
than that, a 13th pointer contains the address of the first
indirect
block; this block contains another 1024 (on a 4K block filesystem)
block pointers. Should that not suffice, there's a 14th pointer for the
double-indirect block - each entry in that block is the address of an
indirect block. And if even that is not enough, there's a 15th entry
pointing to a triple-indirect block full of pointers to double-indirect
blocks.
This is a very efficient representation for small files - the kinds of
files Unix systems typically held, once upon a time. In current times, when one can forget
about that directory full of DVD images and never even notice the lost
space, it does not work quite as well - there is a lot of overhead for all
of those individual block pointers, and a large data structure to manage.
That is why removing a large file on an ext3 filesystem can take a long
time - the system has to chase down all of those indirect blocks, which, in
turn, forces a lot of disk activity and head seeks. For this reason,
contemporary filesystems tend to use extent-based mechanisms to associate
blocks with files, but that is not really an option for ext3.
An additional problem with all those indirect blocks is that filesystem
checkers must locate and verify them all. That, again, causes a lot of
head seeking and makes fsck run slowly. Slow filesystem checking was the
motivation behind this patch from
Abhishek Rai which attempts to improve performance on filesystems with
a lot of indirect blocks.
The approach taken is relatively simple: the patch just tries to group
indirect block allocations together on the disk. The current ext3 code
will allocate indirect blocks when they are needed to account for data
blocks being added to the file; they are usually placed adjacent to those
data blocks. One might think that this placement would speed subsequent
accesses to the file, but that is not necessarily so; the reading or
writing of the indirect block will tend to happen at a different time than
operations on the data blocks. What this placement does accomplish,
though, is the distribution of the indirect blocks all over the disk. So a
process which must examine all of the indirect blocks associated with a
file must cause the disk to do a lot of head seeks.
The "metaclustering" approach works by reserving a set of contiguous
blocks at the end of each block group. Whenever an indirect block is
needed, the filesystem tries to get one from this dedicated area first.
The end result is that all of the indirect blocks are located next to each
other. Should somebody need to read a number of those blocks without being
interested in the contents of the data blocks, they can grab them all
quickly with minimal seeking. Filesystem checkers, as it happens, need to
do exactly that - as does the file removal process. The patch did not come
with benchmarks, but the speedup that comes from the elimination of all
those seeks should be significant.
Even so, Andrew Morton questioned the need
for this patch, worrying that its benefits do not justify the risks that
comes with modifying an established, heavily-used filesystem:
In any decent environment, people will fsck their ext3 filesystems
during planned downtime, and the benefit of reducing that downtime
from 6 hours/machine to 2 hours/machine is probably fairly small,
given that there is no service interruption.
Others disagreed, though, noting that it's the unplanned filesystem
checks which are often the most time-critical. That includes the
delightful "maximal mount count" boot-time check which, in your editor's
experience, always happens when one is trying to get set up to give a talk
somewhere. So this patch might just find eventual acceptance - it should
be relatively low-risk and does not require any on-disk format changes.
This is a filesystem patch, though, so nobody will be in any hurry to get
it into the mainline before a lot of testing and review has been done.
Comments (39 posted)
By Jonathan Corbet
January 15, 2008
LWN last
looked at the unionfs
filesystem almost exactly one year ago. Things have been relatively
quiet on the unionfs front during much of that time, but unionfs has not
gone away. Now the unionfs developers are back with an improved version
and a determined push to get the code into 2.6.25. So another look seems
indicated.
The core idea behind unionfs is to allow multiple, independent filesystems
to be merged into a single, coherent whole. As an example, consider a user
with a distribution install DVD full of packages, a small disk, and
painfully slow bandwidth. It would be nice to keep the DVD-stored packages
around for future installation. What is also nice, though, is to be able
to keep a directory full of updates from the distributor and use those,
when they exist, in favor of the read-only DVD version. Using unionfs,
this user could mount the DVD read-only, then mount a writable filesystem
(for the updates) on top of the DVD. Updated packages go into the writable
filesystem, but all of the available packages are visible, together, in the
unified view. To avoid confusion, the user could delete obsoleted
packages, at which point they would no longer be visible in the unionfs
filesystem, even though they cannot actually be deleted from the underlying
DVD. Thus unionfs allows the creation of an apparently writable filesystem
on a read-only base; many other applications are possible as well.
If a user rewrites a file which is stored on a read-only "branch" of a
union filesystem, the response is relatively straightforward: the
newly-written file is stored on a higher-priority, writable branch. If no
such branch exists, the operation fails. Dealing with the deletion of a
file from a read-only branch is trickier, though. In this case, unionfs
will create a "whiteout" in the form of a special file (starting with
.wh.) on a writable branch. Some reviewers have disliked this
approach since it will clutter the upper branch with those special files
over time. But it is hard to come up with another way to handle deletion,
especially if (as is the case here) your goal is to keep core VFS changes
to an absolute minimum.
That hasn't kept the unionfs developers from trying, though. Off to the
side, they have a version of unionfs which maintains a small,
special-purpose partition of its own (on writable storage). Metadata
(whiteouts, in particular) is stored to this special unionfs partition and no
longer clutters the component filesystems. There are other advantages to
the dedicated partition scheme, including the ability to include one
unionfs as a branch in a second union; see the unionfs ODF
document for more information on this approach, which the developers
hope to slowly migrate into the version they are currently proposing for
the mainline.
Another persistent problem with unionfs has been coping with modifications
made directly to the component branches without going through the union. The
January, 2007 version of the patch came packaged with some dire warnings:
direct modification of unionfs branches could lead to system crashes and
data loss. Given that filesystems which have been bundled into a union
still exist independently, they will always present a tempting target for
modification, even when there is not a specific reason (wanting to put
files onto a specific component filesystem, for example). So a unionfs
implementation which cannot handle such modifications sets a trap for every
user who uses it.
The developers claim to have solved this problem in the current version of the
patch. Now, almost every entry into the unionfs code causes it to check the
modification times for the relevant file in all layers of the union. If
the file turns out to have been changed, unionfs will forget about the file
and reload the information from scratch, causing the most current version
of the file (or directory) to be visible to the user. This approach solves
the problem in a relatively efficient manner, with one exception: unionfs
cannot tell when a process modifies a file which it has mapped into its
address space with mmap(). So, in that case, changes may not be
visible to processes accessing the affected file through the unionfs.
In both cases, the unionfs developers would really prefer to have better
support from the VFS. Some operating systems have provided native support
for whiteouts, but Linux lacks that support. There is also no way for a
filesystem at the bottom of a stack of filesystems to notify the higher
layers that something has been changed. Fixing either of these would
require significant VFS modifications, though, and the changes might
propagate down into the individual filesystem implementations as well. So
nobody is expecting them to happen anytime soon.
Another significant change in unionfs is the elimination of the
ioctl() interface for the management of branches. All changes to
an existing unionfs are now done using the remount option of the
mount command. This change eliminates the need for a separate
utility for unionfs configuration and makes it possible to do complicated
changes in an atomic manner.
The end result of all this is that the unionfs hackers think that the time
has come to put the code into the mainline. There, it would become the
second supported stacking filesystem (the first being eCryptfs), and would
help toward the long-term goal of making the VFS layer work better with
stacking. Some people speak as if the merging of unionfs into 2.6.25 is a
done deal, but that is not yet guaranteed. Christoph Hellwig, whose
opinion on such things carries a heavy weight, is opposed to the unionfs idea:
I think we made it pretty clear that unionfs is not the way to go,
and that we'll get the union mount patches clear once the
per-mountpoint r/o and unprivileged mount patches series are in
and stable.
Unionfs hacker Erez Zadok responds that
unionfs is working - and used - now, while getting union support into the
VFS is a distant prospect. So he recommends:
I think a better approach would be to start with Unionfs (a
standalone file system that doesn't touch the rest of the kernel).
And as Linux gradually starts supporting more and more features
that help unioning/stacking in general, to change Unionfs to use
those features (e.g., native whiteout support). Eventually there
could be basic unioning support at the VFS level, and concurrently
a file-system which offers the extra features (e.g., persistency).
When one looks at a recent posting of the union mount patch, it's hard
to see them as a near-term solution. As described by its author (Bharata
Rao), this work is in an early, exploratory state; there are a number of
problems for which solutions are not really in sight. The union mount
approach, which does the hard work in the VFS layer, may well be the right
long-term approach, but it will not be in a state where it can be shipped
to users anytime soon.
In the end, the problem is a hard one, and unionfs has a considerable lead
toward being a real solution. That, alone, is not enough to guarantee that
unionfs will make it into the 2.6.25 kernel, but it does help that cause
considerably. Anybody opposing the merger of unionfs will have to explain
why the union filesystem capability should not be available to Linux users
in 2008.
Comments (12 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>