The current development kernel remains 3.1-rc9
; no prepatches have
been released over the last week. The final 3.1 release can be expected in
the near future, once Linus returns from vacation.
Numerous subsystem trees have returned to kernel.org over the last week;
things should be back to something resembling normal by the time the 3.2
merge window opens.
Stable updates: none have been released in the last week. The 3.0.7
update is in the review process as of this
writing; the release can be expected sometime on or after October 13.
Comments (1 posted)
My PhD was in biology, working on fruitflies. They're a poorly
documented set of layering violations which only work because of
side-effects at the quantum level, and they tend to die at
inconvenient times. They're made up of 165 million bases of a byte
code language that's almost impossible to bootstrap and which
passes through an intermediate representations before it does
anything useful. It's an awful field to try to do rigorous work
in because your attempts to impose any kind of meaningful order on
what you're looking at are pretty much guaranteed to be
sufficiently naive that your results bear a resemblance to reality
more by accident than design.
-- Why Matthew Garrett
What a strange function. I look forward to seeing the documentation.
-- Andrew Morton
Comments (2 posted)
Kay Sievers, Lennart Poettering, and Harald Hoyer have put out a "wish list" of features they (and other plumbers) would like to see added to Linux. The list covers wishes for a bunch of different areas including filesystems, capabilities, control groups, module loading, Unix domain sockets, and more.
We'd like to share our current wish list of plumbing layer features we
are hoping to see implemented in the near future in the Linux kernel and
associated tools. Some items we can implement on our own, others are not
our area of expertise, and we will need help getting them implemented.
Acknowledging that this wish list of ours only gets longer and not
shorter, even though we have implemented a number of other features on
our own in the previous years, we are posting this list here, in the
hope to find some help.
Full Story (comments: 100)
Kernel development news
One of the requests in the recently posted "Plumber's Wish List" was for a way for a
process to reliably detect that it isn't in the root PID namespace (i.e. is
in a "container", at least by some definition). That wish sparked an
interesting discussion on linux-kernel about the nature of containers and
what people might use them for. Some would like to be able to run standard
Linux distributions inside a container, but others are not so sure that is
a useful goal.
A container is a way to isolate a group of processes from the rest of a
running Linux system. By using namespaces, that group can have its own
private view of the OS—though, crucially, sharing the same kernel
with whatever else is running—with its own PID space, filesystems,
networking devices, and so on. Containers are, in some ways, conceptually
similar to virtualization, with the separate vs. shared kernel being the
obvious user-visible difference between the two. But there are
straightforward ways to detect that you are running under virtualization
and that is not true for containers/namespaces.
Lennart Poettering—one of the wishing plumbers—outlined the need for detecting whether a
process is running in a child PID namespace:
To make a standard distribution run nicely in a Linux container you
usually have to make quite a number of modifications to it and disable
certain things from the boot process. Ideally however, one could simply
boot the same image on a real machine and in a container and would just
do the right thing, fully stateless. And for that you need to be able to
detect containers, and currently you can't.
He goes on to list a number of different things that are not "virtualized"
by namespaces, including sysfs, /proc/sys, SELinux, udev, and
more. Standard Linux distributions currently assume that they have full
control of the system and the init process will do a wide variety of
unpleasant things when it runs inside a container.
make use of a reliable way of detecting containerization to avoid (or
change) actions with effects outside the container.
Poettering went on to point out that
"having a way to detect execution in a container is
a minimum requirement to get general purpose distribution makers to
officially support and care for execution in container environments".
Eric W. Biederman, who is one of the namespace developers, agreed with the idea: "I agree getting to the point where we can run a standard distribution
unmodified in a container sounds like a reasonable goal." He
suggested two possible solutions for a straightforward detection scheme
(either putting a file into the container's root directory or modifying the
output of uname), but also started looking at all of the different
areas that will need to be addressed to make it possible to run
distributions inside containers. Much of that depends on finishing up the
work on user (i.e. UID) namespaces.
But Ted Ts'o is a bit skeptical of the need
to run full distributions inside a container. The advantage that
containers have over virtual machines (VMs) is that they are lighter weight, he
said, and adding multiple copies of system services (he mentions udev and
D-Bus) starts to remove that advantage. He wonders if it makes more sense
to just use a VM:
If you end up [with] so much overhead to provide the desired security and/or
performance isolation, then it becomes fair to ask the question
whether you might as well pay a tad bit more and get even better
security and isolation by using a VM solution....
In a second message, Ts'o expands on his
thinking, particularly regarding security. He is not optimistic about
using containers that way: "given that kernel is shared, trying to use
containers to provide better security isolation between mutually
suspicious users is hopeless". The likelihood that an "isolated"
user can find a local privilege escalation is just too high, and that will
allow the user to escape the container and compromise the system as a whole. He is concerned that adding in more kernel complexity to allow
distributions to run unchanged in containers may be wasted effort:
So if you want that kind of security isolation, you shouldn't be using
containers in the first place. You should be using KVM or Xen, and
then only after spending a huge amount of effort fuzz testing the
KVM/Xen paravirtualization interfaces. So at least in my mind, adding
vast amounts of complexities to try to provide security isolation via
containers is really not worth it.
Biederman, though, thinks that there are
situations where it would be convenient to be able to run distribution
like I find it [convenient] to loopback mount an iso image to see
what is on a disk image". But, firing up KVM to run the distribution
may be just as easy, and works today, as Ts'o pointed out.
There are more platforms out there than just those that KVM supports, however,
so Biederman believes there is a place for
supporting containerized distributions:
You can test a lot more logical machines interacting
with containers than you can with vms. And you can test on all the
[architectures] and platforms linux supports not just the handful that are
well supported by hardware virtualization.
In the end, Biederman is not convinced that there is a "good reason to have a design that doesn't allow you to run a full
userspace". He also notes that with the current implementation of
containers (i.e. without UID namespaces), all users in the container are
the same as their counterparts outside the container, and that includes the
root user. Adding UID namespaces would allow a container to partition its
users from those of the "external" system, so that root inside the
container can't make changes that affect the entire system:
With user namespaces what we get is that the global root user is not the
container root user and we have been working our way through the
permission checks in the kernel to ensure we get them right in the
context of the user namespace. This trivially means that the things
that we allow the global root user to do in /proc/ and /sysfs and
the like simply won't be allowed as a container root user. Which
makes doing something stupid and affecting other people much more
UID namespaces are still a ways out, Biederman said, so problems with
global sysctl settings from within containers can still cause weirdness,
but "once the
user namespaces are in place accessing a truly global sysctl will
result in EPERM when you are in a container and everyone will be
happy. ;)". There are some interesting implications of UID
namespaces that may
eventually need to be addressed, he said, including persistent UIDs in
once we have all of the permission checks in the kernel tweaked to care
about user namespaces we next look at the filesystems. The easy
initial implementation is going to be just associating a user namespace
with a super block. But farther out being able to store uids from
different user namespaces on the same filesystem becomes an interesting
We already have things like user mapping in 9p and nfsv4 so it isn't
wholly uncharted territory. But it could get interesting.
Interesting indeed. One might wonder whether there will be some pushback
from other kernel hackers about adding mapping layers to filesystems
(presumably in the VFS code so that it works for all of them). Since
virtualization can solve many of the problems that are still being worked
on in containers (at least for some hardware platforms), there may be
questions about adding further kernel complexity to support full-scale
containerization as envisioned by Biederman (and others). That is
argument that Ts'o is making, and one might guess that others have
In any case, no patches have yet appeared for detecting that a process is
running in a container, but it may not require any changes to the kernel.
Poettering mentioned that LXC containers set an
environment variable that processes can use for that purpose, and Biederman
seemed to think that might be a reasonable solution (and wouldn't require
kernel changes as it is just a user-space convention). Making a new UTS
namespace (and changing the output of uname) as Biederman
suggested would be another way to handle the problem from user space. That
part seems like it will get solved in short order, but the more general
questions of containers and security isolation are likely to be with us for
some time to come.
Comments (7 posted)
Deleting a file is sufficient to make it go away as far as the directory
structure is concerned; that important, not-backed-up document is gone
before the careless user has even removed a repentant finger from the
"enter" key. On the underlying storage device, though, parts or all of the
doomed file can live on indefinitely. A suitably determined person
can often recover data from a file that was thought to be deleted and
gone. In some situations, this persistence of data can be most unwelcome.
Paper shredders exist for situations where recovery of a "deleted" paper
document is undesirable; there is clear value in a similar functionality
for filesystem-based files.
A look at the chattr man page indicates that "secure delete"
functionality can be enabled for a file by setting the right attribute
flag. The only problem is that most filesystems do not actually honor that
flag; it's a sort of empty security theater. Even the TSA does a better job. The
secure delete patch set from Allison Henderson may fill that particular
gap for the ext4 filesystem, but a few things will need to be fixed up
The core part of the patch is relatively straightforward: if part or all of
a file is being deleted, the relevant blocks will be overwritten
synchronously as part of the operation. By default, the blocks are
overwritten with zeroes, but there is an option to use random bytes from
the kernel's entropy pool instead. A bit of care must be taken when the
file involved contains holes - it wouldn't do to overwrite blocks that
don't exist. There is also some logic to take advantage of the "secure
discard" feature supported by some devices (solid-state disks, primarily)
if that feature is available. Secure discard handles the deletion
internally to the device - perhaps just by marking the relevant blocks
unreadable until something else overwrites them - eliminating the need to
perform extra I/O from the kernel.
The job does not stop there, though. The very existence of the file - or
information contained in its name - could be sensitive as well. So the
secure delete code must also clear out the directory entry associated with
the file (if it is being deleted, as opposed to just truncated,
obviously). Associated metadata - extended attributes, access control
lists, etc - must also be cleaned out in this way.
Then there is the little issue of the journal. At any given time, the
journal could contain any number of blocks from the deleted file, possibly
in several versions. Clearly, sanitizing the journal is also required, but
it must be done carefully: clearing journal blocks before the associated
transaction has been committed could result in a corrupted filesystem
and/or the inability to recover properly from a crash. So the first thing
that must happen is a synchronous journal flush.
Once any outstanding transactions have been cleared, the (now old and
unneeded) data in the journal can be cleaned up. The only problem is that
the journal does not maintain any association between its internal blocks
and the files they belong to in the filesystem. The patch addresses that
problem by adding a new data structure mapping between journal blocks and
file blocks; the secure deletion code then traverses that structure in
search of blocks in need of overwriting.
The journal cleanup code drew some complaints from developers; the first
problem is that directory and metadata blocks are not cleared from the
journal. The deeper complaint, though, was that it represented an
excessive mixing of the two layers; the filesystem really should not have
an overly deep understanding of how the journal works. The end result is
that some of this cleanup work is likely to move into the jbd2 journaling
layer; as an added benefit, other filesystems should then be able to take
advantage of it as well.
Darrick Wong also pointed out that this
block mapping is not, itself, written to the journal; if the system were to
crash before the journal could be cleaned, that cleanup would not happen
after a reboot. He suggested that filesystems should mark blocks in need
of secure deletion when they are first added to a journal transaction; the
journal code could take care of things from then on. It seems like a
cleaner solution, but there is, naturally, a catch.
That catch is that there is no way to create a file with the "secure
delete" attribute set; that attribute must be added after the fact. One
would assume that users would ask for secure delete before actually writing
any data to the file, but associated information (the file name, for
example) must exist before the file can be so marked. So the journal may
well contain blocks for "secure delete" files that will not be marked for
overwriting. There is no way to fix that within the existing POSIX API.
Darrick suggested a few ways to deal with this problem. One is to just
give up and tell developers that, if even the name of the file is
important, they should create that file under a temporary name, set the
"secure delete" attribute, then rename the file to its real name. An
alternative would be to mark the blocks for all newly-created files
as needing overwriting; that would eliminate the race to get the attribute
set at the cost of slowing down all file creation operations. Or, he said,
the journal could track the mapping between blocks and files, much as
Allison's patch does. One other option would be to mark the entire
filesystem as needing secure delete; that feature, evidently, is already in
the works, but enabling it will have an obvious performance cost.
The end result is that there are a few details to be worked out yet. The
feature seems useful, though, for users who are concerned about data from
their files hanging around after the files themselves have been deleted.
Perhaps, around the 3.3 kernel or thereafter, "chattr +s" on
ext4 will be more than bad security theater.
Comments (24 posted)
The btrfs filesystem was merged into the mainline in January, 2009 for the
2.6.29 kernel release. Since then, development on the filesystem has
accelerated to the point that many consider it ready for production use and
some distributions are considering using it by default. The filesystem
itself is nearly functionally complete and increasingly stable, but there
is still one big hole: there is no working filesystem checker for Btrfs.
As user frustration over the lack of this essential utility grows, an
interesting question arises: is some software too dangerous to be released
This tool (called "btrfsck") has been under development for some time, but,
despite occasional hints to the contrary, it has never escaped from Chris
Mason's laptop into the wild. This delay has had repercussions elsewhere;
Fedora's plan to move to btrfs by default, for example, cannot go forward
without a working filesystem checker. Most recently, Chris said that he hoped to be able to demonstrate
the program at the upcoming LinuxCon
Europe event. That, however, was not enough for some vocal users who
have started to let it be known that their patience has run out. Thus
we've seen accusations that Oracle really
intends to keep btrfs as a private, proprietary tool and statements that "It's really time for
Chris Mason to stop disgracing the open source community and tarnishing
Oracle's name." Those are strong words directed at somebody who has
done a lot to create a next-generation filesystem for Linux.
Your editor would like to be the first to say that both the open source
community and Oracle benefit greatly from Chris's presence. The cynical
might add that Oracle has delegated the task of "tarnishing its name" to
employees who are more skilled in that area. That said, it is worth
examining why btrfsck remains under wraps; had the tool been put out in the
open - the way the filesystem itself was - chances are good that others
would have helped with its development. One could argue that the failure
to release btrfsck in any form has almost certainly retarded its
development and, thus, the adoption of btrfs as a whole.
According to Chris, the early merging of
btrfs was important for the creation of the filesystem's development
Keep in mind that btrfs was released and ran for a long time while
intentionally crashing when we ran out of space. This was a really
important part of our development because we attracted a huge
number of contributors, and some very brave users.
But, he says, the filesystem checker ("fsck") is a bit different, and is
not ready yet even for the braver users:
For fsck, even the stuff I have here does have a way to go before
it is at the level of an e2fsck or xfs_repair. But I do want to
make sure that I'm surprised by any bugs before I send it out, and
that's just not the case today. The release has been delayed
because I've alternated between a few different ways of repairing,
and because I got distracted by some important features in the
Josef Bacik expressed the fears that keep
btrfsck out of the community more clearly:
Fsck has the potential to make any users problems worse, and given
the increasing number of people putting production systems on btrfs
with no backups the idea of releasing a unpolished and not fully
tested fsck into the world is terrifying, and would likely cause
long term "I heard that file system's fsck tool eats babies" sort
He went on to say "Release early and release often is nice for web
browsers and desktop environments, it's not so nice with things that could
result in data loss." This is a claim that raises some interesting
questions, to say the least.
One could start by questioning the wisdom of running a new filesystem like
btrfs in production with no backups and no working filesystem repair
tool. How is it that releasing the filesystem itself is OK, but releasing
the repair tool presents too much of a risk for users? How does that tool
really differ from a web browser, especially given that the browser is
exposed to all the net can throw at it and bugs can easily lead to exposure
of users' credentials or the compromise of their systems? There is no
shortage of software out there that can badly bite its users when things go
That said, there are some unique aspects to the development of filesystem
repair tools. They are invoked when things have already gone wrong, so the
usual rules of how the filesystem should be structured are out the window.
They must perform deep surgery on the filesystem structure to recover from
corruptions that may be hard to anticipate and correct; one could
paraphrase Tolstoy and say that happy filesystems are all alike, but every
corrupted filesystem is unhappy in its own way. As the checker tries to
cope with a messed-up filesystem, it works in an environment where any
change it makes could turn a broken-but-recoverable filesystem into one
that is a total loss. In summary, btrfsck will not be an easy tool to
write; it is a job that is almost certainly best left to developers with a
lot of filesystem experience and who understand btrfs to its core. That
narrows the development pool to a rather small and select group.
And, in the end, no responsible developer wants to release a tool which, in
his or her opinion, could create misery for its users. Those users
will run btrfsck on their filesystems regardless of any
blood-curdling warnings that it may put up first; if it proceeds to destroy
their data, they will not blame themselves for their loss. If Chris does
not yet believe that he can responsibly release btrfsck for wider use, it
is not really our place to second-guess his reasoning or to tell him that
he should release it anyway. Anybody who feels they cannot trust him to
make that decision probably should not be running the filesystem he
designed to begin with.
Releasing software early and often is, in general, good practice for free
software development; keeping code out of the public eye often does not
benefit it in the long run. Perhaps btrfsck has been withheld for too
long, but that is not our call to make. The need for the tool is clear -
if nothing else, Oracle has decided to go with btrfs by default in the near
future. There can be no doubt that this need is creating a fair amount of
pressure. The LinuxCon demonstration may or may not happen, but btrfsck
seems likely to make its much-delayed debut before too much longer.
Comments (69 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>