Kernel development
Brief items
Kernel release status
The current development kernel remains 3.1-rc9; no prepatches have been released over the last week. The final 3.1 release can be expected in the near future, once Linus returns from vacation.Numerous subsystem trees have returned to kernel.org over the last week; things should be back to something resembling normal by the time the 3.2 merge window opens.
Stable updates: none have been released in the last week. The 3.0.7 update is in the review process as of this writing; the release can be expected sometime on or after October 13.
Quotes of the week
A Plumber's Wish List for Linux
Kay Sievers, Lennart Poettering, and Harald Hoyer have put out a "wish list" of features they (and other plumbers) would like to see added to Linux. The list covers wishes for a bunch of different areas including filesystems, capabilities, control groups, module loading, Unix domain sockets, and more.
Acknowledging that this wish list of ours only gets longer and not shorter, even though we have implemented a number of other features on our own in the previous years, we are posting this list here, in the hope to find some help.
Kernel development news
Running distributions in containers
One of the requests in the recently posted "Plumber's Wish List" was for a way for a process to reliably detect that it isn't in the root PID namespace (i.e. is in a "container", at least by some definition). That wish sparked an interesting discussion on linux-kernel about the nature of containers and what people might use them for. Some would like to be able to run standard Linux distributions inside a container, but others are not so sure that is a useful goal.
A container is a way to isolate a group of processes from the rest of a running Linux system. By using namespaces, that group can have its own private view of the OS—though, crucially, sharing the same kernel with whatever else is running—with its own PID space, filesystems, networking devices, and so on. Containers are, in some ways, conceptually similar to virtualization, with the separate vs. shared kernel being the obvious user-visible difference between the two. But there are straightforward ways to detect that you are running under virtualization and that is not true for containers/namespaces.
Lennart Poettering—one of the wishing plumbers—outlined the need for detecting whether a process is running in a child PID namespace:
He goes on to list a number of different things that are not "virtualized"
by namespaces, including sysfs, /proc/sys, SELinux, udev, and
more. Standard Linux distributions currently assume that they have full
control of the system and the init process will do a wide variety of
unpleasant things when it runs inside a container.
Distributions could
make use of a reliable way of detecting containerization to avoid (or
change) actions with effects outside the container.
Poettering went on to point out that
"having a way to detect execution in a container is
a minimum requirement to get general purpose distribution makers to
officially support and care for execution in container environments
".
Eric W. Biederman, who is one of the namespace developers, agreed with the idea: "I agree getting to the point where we can run a standard distribution
unmodified in a container sounds like a reasonable goal.
" He
suggested two possible solutions for a straightforward detection scheme
(either putting a file into the container's root directory or modifying the
output of uname), but also started looking at all of the different
areas that will need to be addressed to make it possible to run
distributions inside containers. Much of that depends on finishing up the
work on user (i.e. UID) namespaces.
But Ted Ts'o is a bit skeptical of the need to run full distributions inside a container. The advantage that containers have over virtual machines (VMs) is that they are lighter weight, he said, and adding multiple copies of system services (he mentions udev and D-Bus) starts to remove that advantage. He wonders if it makes more sense to just use a VM:
In a second message, Ts'o expands on his
thinking, particularly regarding security. He is not optimistic about
using containers that way: "given that kernel is shared, trying to use
containers to provide better security isolation between mutually
suspicious users is hopeless
". The likelihood that an "isolated"
user can find a local privilege escalation is just too high, and that will
allow the user to escape the container and compromise the system as a whole. He is concerned that adding in more kernel complexity to allow
distributions to run unchanged in containers may be wasted effort:
Biederman, though, thinks that there are
situations where it would be convenient to be able to run distribution
images "just
like I find it [convenient] to loopback mount an iso image to see
what is on a disk image
". But, firing up KVM to run the distribution
may be just as easy, and works today, as Ts'o pointed out.
There are more platforms out there than just those that KVM supports, however,
so Biederman believes there is a place for
supporting containerized distributions:
In the end, Biederman is not convinced that there is a "good reason to have a design that doesn't allow you to run a full
userspace
". He also notes that with the current implementation of
containers (i.e. without UID namespaces), all users in the container are
the same as their counterparts outside the container, and that includes the
root user. Adding UID namespaces would allow a container to partition its
users from those of the "external" system, so that root inside the
container can't make changes that affect the entire system:
UID namespaces are still a ways out, Biederman said, so problems with
global sysctl settings from within containers can still cause weirdness,
but "once the
user namespaces are in place accessing a truly global sysctl will
result in EPERM when you are in a container and everyone will be
happy. ;)
". There are some interesting implications of UID
namespaces that may
eventually need to be addressed, he said, including persistent UIDs in
filesystems:
We already have things like user mapping in 9p and nfsv4 so it isn't wholly uncharted territory. But it could get interesting.
Interesting indeed. One might wonder whether there will be some pushback from other kernel hackers about adding mapping layers to filesystems (presumably in the VFS code so that it works for all of them). Since virtualization can solve many of the problems that are still being worked on in containers (at least for some hardware platforms), there may be questions about adding further kernel complexity to support full-scale containerization as envisioned by Biederman (and others). That is essentially the argument that Ts'o is making, and one might guess that others have similar feelings.
In any case, no patches have yet appeared for detecting that a process is running in a container, but it may not require any changes to the kernel. Poettering mentioned that LXC containers set an environment variable that processes can use for that purpose, and Biederman seemed to think that might be a reasonable solution (and wouldn't require kernel changes as it is just a user-space convention). Making a new UTS namespace (and changing the output of uname) as Biederman suggested would be another way to handle the problem from user space. That part seems like it will get solved in short order, but the more general questions of containers and security isolation are likely to be with us for some time to come.
Securely deleting files from ext4 filesystems
Deleting a file is sufficient to make it go away as far as the directory structure is concerned; that important, not-backed-up document is gone before the careless user has even removed a repentant finger from the "enter" key. On the underlying storage device, though, parts or all of the doomed file can live on indefinitely. A suitably determined person can often recover data from a file that was thought to be deleted and gone. In some situations, this persistence of data can be most unwelcome. Paper shredders exist for situations where recovery of a "deleted" paper document is undesirable; there is clear value in a similar functionality for filesystem-based files.A look at the chattr man page indicates that "secure delete" functionality can be enabled for a file by setting the right attribute flag. The only problem is that most filesystems do not actually honor that flag; it's a sort of empty security theater. Even the TSA does a better job. The ext4 secure delete patch set from Allison Henderson may fill that particular gap for the ext4 filesystem, but a few things will need to be fixed up first.
The core part of the patch is relatively straightforward: if part or all of a file is being deleted, the relevant blocks will be overwritten synchronously as part of the operation. By default, the blocks are overwritten with zeroes, but there is an option to use random bytes from the kernel's entropy pool instead. A bit of care must be taken when the file involved contains holes - it wouldn't do to overwrite blocks that don't exist. There is also some logic to take advantage of the "secure discard" feature supported by some devices (solid-state disks, primarily) if that feature is available. Secure discard handles the deletion internally to the device - perhaps just by marking the relevant blocks unreadable until something else overwrites them - eliminating the need to perform extra I/O from the kernel.
The job does not stop there, though. The very existence of the file - or information contained in its name - could be sensitive as well. So the secure delete code must also clear out the directory entry associated with the file (if it is being deleted, as opposed to just truncated, obviously). Associated metadata - extended attributes, access control lists, etc - must also be cleaned out in this way.
Then there is the little issue of the journal. At any given time, the journal could contain any number of blocks from the deleted file, possibly in several versions. Clearly, sanitizing the journal is also required, but it must be done carefully: clearing journal blocks before the associated transaction has been committed could result in a corrupted filesystem and/or the inability to recover properly from a crash. So the first thing that must happen is a synchronous journal flush.
Once any outstanding transactions have been cleared, the (now old and unneeded) data in the journal can be cleaned up. The only problem is that the journal does not maintain any association between its internal blocks and the files they belong to in the filesystem. The patch addresses that problem by adding a new data structure mapping between journal blocks and file blocks; the secure deletion code then traverses that structure in search of blocks in need of overwriting.
The journal cleanup code drew some complaints from developers; the first problem is that directory and metadata blocks are not cleared from the journal. The deeper complaint, though, was that it represented an excessive mixing of the two layers; the filesystem really should not have an overly deep understanding of how the journal works. The end result is that some of this cleanup work is likely to move into the jbd2 journaling layer; as an added benefit, other filesystems should then be able to take advantage of it as well.
Darrick Wong also pointed out that this block mapping is not, itself, written to the journal; if the system were to crash before the journal could be cleaned, that cleanup would not happen after a reboot. He suggested that filesystems should mark blocks in need of secure deletion when they are first added to a journal transaction; the journal code could take care of things from then on. It seems like a cleaner solution, but there is, naturally, a catch.
That catch is that there is no way to create a file with the "secure delete" attribute set; that attribute must be added after the fact. One would assume that users would ask for secure delete before actually writing any data to the file, but associated information (the file name, for example) must exist before the file can be so marked. So the journal may well contain blocks for "secure delete" files that will not be marked for overwriting. There is no way to fix that within the existing POSIX API.
Darrick suggested a few ways to deal with this problem. One is to just give up and tell developers that, if even the name of the file is important, they should create that file under a temporary name, set the "secure delete" attribute, then rename the file to its real name. An alternative would be to mark the blocks for all newly-created files as needing overwriting; that would eliminate the race to get the attribute set at the cost of slowing down all file creation operations. Or, he said, the journal could track the mapping between blocks and files, much as Allison's patch does. One other option would be to mark the entire filesystem as needing secure delete; that feature, evidently, is already in the works, but enabling it will have an obvious performance cost.
The end result is that there are a few details to be worked out yet. The feature seems useful, though, for users who are concerned about data from their files hanging around after the files themselves have been deleted. Perhaps, around the 3.3 kernel or thereafter, "chattr +s" on ext4 will be more than bad security theater.
Whither btrfsck?
The btrfs filesystem was merged into the mainline in January, 2009 for the 2.6.29 kernel release. Since then, development on the filesystem has accelerated to the point that many consider it ready for production use and some distributions are considering using it by default. The filesystem itself is nearly functionally complete and increasingly stable, but there is still one big hole: there is no working filesystem checker for Btrfs. As user frustration over the lack of this essential utility grows, an interesting question arises: is some software too dangerous to be released early?
This tool (called "btrfsck") has been under development for some time, but,
despite occasional hints to the contrary, it has never escaped from Chris
Mason's laptop into the wild. This delay has had repercussions elsewhere;
Fedora's plan to move to btrfs by default, for example, cannot go forward
without a working filesystem checker. Most recently, Chris said that he hoped to be able to demonstrate
the program at the upcoming LinuxCon
Europe event. That, however, was not enough for some vocal users who
have started to let it be known that their patience has run out. Thus
we've seen accusations that Oracle really
intends to keep btrfs as a private, proprietary tool and statements that "It's really time for
Chris Mason to stop disgracing the open source community and tarnishing
Oracle's name.
" Those are strong words directed at somebody who has
done a lot to create a next-generation filesystem for Linux.
Your editor would like to be the first to say that both the open source community and Oracle benefit greatly from Chris's presence. The cynical might add that Oracle has delegated the task of "tarnishing its name" to employees who are more skilled in that area. That said, it is worth examining why btrfsck remains under wraps; had the tool been put out in the open - the way the filesystem itself was - chances are good that others would have helped with its development. One could argue that the failure to release btrfsck in any form has almost certainly retarded its development and, thus, the adoption of btrfs as a whole.
According to Chris, the early merging of btrfs was important for the creation of the filesystem's development community:
But, he says, the filesystem checker ("fsck") is a bit different, and is not ready yet even for the braver users:
Josef Bacik expressed the fears that keep btrfsck out of the community more clearly:
He went on to say "Release early and release often is nice for web
browsers and desktop environments, it's not so nice with things that could
result in data loss
". This is a claim that raises some interesting
questions, to say the least.
One could start by questioning the wisdom of running a new filesystem like btrfs in production with no backups and no working filesystem repair tool. How is it that releasing the filesystem itself is OK, but releasing the repair tool presents too much of a risk for users? How does that tool really differ from a web browser, especially given that the browser is exposed to all the net can throw at it and bugs can easily lead to exposure of users' credentials or the compromise of their systems? There is no shortage of software out there that can badly bite its users when things go wrong.
That said, there are some unique aspects to the development of filesystem repair tools. They are invoked when things have already gone wrong, so the usual rules of how the filesystem should be structured are out the window. They must perform deep surgery on the filesystem structure to recover from corruptions that may be hard to anticipate and correct; one could paraphrase Tolstoy and say that happy filesystems are all alike, but every corrupted filesystem is unhappy in its own way. As the checker tries to cope with a messed-up filesystem, it works in an environment where any change it makes could turn a broken-but-recoverable filesystem into one that is a total loss. In summary, btrfsck will not be an easy tool to write; it is a job that is almost certainly best left to developers with a lot of filesystem experience and who understand btrfs to its core. That narrows the development pool to a rather small and select group.
And, in the end, no responsible developer wants to release a tool which, in his or her opinion, could create misery for its users. Those users will run btrfsck on their filesystems regardless of any blood-curdling warnings that it may put up first; if it proceeds to destroy their data, they will not blame themselves for their loss. If Chris does not yet believe that he can responsibly release btrfsck for wider use, it is not really our place to second-guess his reasoning or to tell him that he should release it anyway. Anybody who feels they cannot trust him to make that decision probably should not be running the filesystem he designed to begin with.
Releasing software early and often is, in general, good practice for free software development; keeping code out of the public eye often does not benefit it in the long run. Perhaps btrfsck has been withheld for too long, but that is not our call to make. The need for the tool is clear - if nothing else, Oracle has decided to go with btrfs by default in the near future. There can be no doubt that this need is creating a fair amount of pressure. The LinuxCon demonstration may or may not happen, but btrfsck seems likely to make its much-delayed debut before too much longer.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Page editor: Jonathan Corbet
Next page:
Distributions>>
