Brief items
The current stable 2.6 kernel is 2.6.17.4,
released on July 6. It
contains a single fix for a locally-exploitable vulnerability in the
prctl() system call.
2.6.16.24 was also released with
the same fix.
The current 2.6 prepatch remains 2.6.18-rc1.
Almost 200 patches have gone into the mainline since -rc1 was released;
they are almost all fixes, but the "TCP Compound" congestion control
algorithm was also removed due to doubts about the code's origin.
The current -mm tree is 2.6.18-rc1-mm1. Recent changes
to -mm include a vast number of new warnings for unchecked return values, a
set of software suspend updates, and a new version of the vectored I/O operation patch
set.
Comments (1 posted)
Kernel development news
As has been
reported on LWN
recently, Andrew Morton has been heard to worry that bugs are being added
to the kernel more quickly than they are being fixed. But it is hard to
know for sure. In an attempt to obtain a little more data on the problem,
Andrew has asked LWN to run a survey of its subscribers. The results will,
hopefully, shed some light on how a wider part of the community sees the
kernel quality issue; they will be discussed at the upcoming kernel summit.
This opportunity is an honor for LWN subscribers, who are seen as being
more than sufficiently knowledgeable to provide good answers while being
unlikely to attempt to skew the results. It is a chance for all of us to
help with the development process. If you are an LWN subscriber, please
take a few minutes, proceed to
the survey and help out.
Comments (37 posted)
All these functions return error codes, and we're not checking
them. We should. So there's a patch which marks all these things
as __must_check, which causes around 1,500 new warnings.
These are all bugs and they all need to be fixed.
-- Andrew Morton releases
2.6.18-rc1-mm1
[Y]ou seem to be quite self-confident. That is a nice thing to have
for say a pro boxer, but it can be a disadvantage when dealing with
a complex OS.
-- Ingo Molnar
Comments (6 posted)
The
initramfs mechanism was
added to the 2.5.46 kernel. With initramfs, a boot-time filesystem can be
created (in
cpio format) and appended to the kernel image file.
When the system boots, it will have access to the filesystem from the very
beginning of the bootstrap process - far before it reaches the point of
being able to mount disks. Initramfs works much like the venerable initrd
facility, but, unlike initrd, initramfs does not require the system to be
able to mount a disk and find the filesystem image.
Initramfs is increasingly useful as hardware becomes more complex. Often,
simply finding the root filesystem can involve complex hardware setup,
conversations across the network, getting cryptographic keys, piecing
together RAID or LVM volumes, and more. Currently, much of this work is
done inside the kernel itself, leading to kernel code which duplicates
user-space tools - but with less review and maintenance. Moving this work
into a user-space boot-time filesystem promises to shrink the kernel, make
the boot process more reliable, and allow distributors (and users) to
customize the early bootstrap process in interesting ways.
Thus far, however, use of initramfs has been limited; in particular, all of
the early boot code remains in the kernel. One of the blocking points has
been the need for a minimal C library which would work in that
environment. This library (klibc) has been under development, slowly, for
years. That work has recently culminated in a set of klibc patches posted by
H. Peter Anvin. Klibc is now in a position to help rework the Linux
bootstrap process - and to force discussion of just how the kernel should
interact with tightly-coupled utilities.
The core klibc patch includes replacements for a long list of C library
functions and system call wrappers. It is sufficient, for example, to
support a minimal shell called "dash" and a port of the gzip utility.
There is a root filesystem mounting utility which can handle several
filesystem types, obtaining an IP address using bootp or DHCP, NFS mounts,
assembly of RAID volumes, resuming of suspended systems, and more. Much of
the code which performs those functions can then be removed from the kernel
itself. Klibc and the kinit program which comes with it appear to be
getting close to ready for real use.
This code, like other efforts to move core kernel features into user space,
raises a number of questions. Some of these are likely to come up at the
kernel summit in Ottawa, but a real solution is likely to be rather longer
in coming.
The fundamental question is this: are klibc and kinit part of the kernel?
They consist of code which used to be part of the kernel itself, and which
is a necessary part of the kernel bootstrap process - if the related code
is removed from the kernel, the kernel will not be able to run
without kinit. Both components are tightly tied to the kernel, to the
point that a kernel upgrade may often require upgrading kinit and klibc as
well. A system where the kernel and kinit go out of sync may well fail to
boot.
To many developers, these reasons are more than adequate to justify
packaging (and building) kinit and klibc with the kernel itself. If the
code is kept and built together, it has a much higher chance of continuing
to function as a coherent whole. Every kernel/kinit combination will have
been tested together and will be known to work. If, instead, the two are
separated, the resulting kinit will be, in essence, a large body of kernel
code which is not reviewed and maintained with the rest of the system. The
quality of kinit could be expected to suffer, complaints from users could
grow, and differences between distributions could increase.
On the other hand, if kinit must be part of the kernel, one could well ask
just where the line should be drawn. Should udev, which has
suffered from (rare) kernel version incompatibilities, be included? How
about the user-space software suspend code? Cluster membership utilities?
Filesystem checkers? Wireless network authentication daemons? Unless
Linux is going to head toward a more BSD-like organization (an unlikely
prospect), we will not see all of the above tools included in the kernel
tarball anytime soon. And so, according to some, kinit and klibc should be
maintained as out-of-kernel packages like any other user-space code.
There is another important issue here, however: compatibility between
distributions and between kernel versions. Earlier this year, your editor
had a system running a development distribution fail to boot; that
distribution's maintainers had concluded that, since the
distribution-specific initrd image mounted /proc and
/sys, there was no reason for the initialization scripts to do so
as well. Your editor, who has never had much use for initrd, was left with
a system which was unable to run a vanilla kernel.org kernel. That
particular change was (after your editor complained) backed out, but the
issue remains: distribution-specific initialization code can make it
impossible to run kernels obtained from elsewhere. Ted Ts'o has also pointed out an initialization problem which
makes RHEL4 unable to run current kernels on some systems. He says:
Kinit SHOULD be merged into the kernel, and the responsibility of
creating the initrd/initramfs image should be moved from the
distribution into the kernel build process. There can and should
be a way for distro's to add their own "value add specials" into
the initrd/initramfs image, but we have to take over creating the
base initial userspace environment.
This is a discussion which could go on for some time; it could become one
of the more contentious issues at the kernel summit. There is a subset of
the kernel development community which has a strong desire to move as much
code as possible into user space. Not everybody agrees that this is the
right approach, but, to the extent that code is shoved out of kernel space,
there must be a vision describing how all of the pieces will continue to
work well together into the future. That vision does not yet appear to
exist.
Comments (15 posted)
The developers behind a whole range of virtualization and containerization
projects are continuing to work on ways to get the isolation features they
need into the mainline kernel. Much of that work is centered around the
elimination of global namespaces and additions to the
unshare()
system call so that interested processes can retreat into their own,
private namespaces. For example, on mainline Linux systems today, the
process ID namespace is global - a given process ID identifies the same
process for every other process on the system. The container developers
would like to move away from a global PID namespace so that containers can
present their own process IDs to the processes trapped inside. Many other
kernel namespaces are receiving the same sort of treatment.
Cedric Le Goater has posted a
patch set which takes this work forward in an interesting way by
de-globalizing another namespace and adding a different interface for
creating new namespaces.
The new namespace type added by the patch is the "user" namespace - the
system's view of user ID values. For the most part, the kernel just uses
user IDs for the enforcement of permissions; it does not really care if one
set of processes interprets user ID values differently than another. So,
if processes within one container cannot see resources
(processes, SYSV IPC, filesystems) belonging to another container, there is
little opportunity for processes to interfere with each other, even if they
are running with the same numeric user ID value. That user ID can map to
two entirely different accounts in the different containers, and the
isolation provided by those containers will keep them separate.
The one little exception is the user_struct structure maintained
in kernel/user.c. This structure exists to allow the kernel to
enforce per-user resource limits; to that end, one is allocated for each
user ID currently active on the system. The function responsible for
looking up one of these structures (find_user()) implements a
global user ID namespace, so processes sharing a user ID number in
different containers will affect each others' resource limits.
Cedric's patch fixes this problem by creating a new namespace type for user
IDs, allowing resource limits to be isolated within containers. The
implementation of this namespace is simple, but allowing processes to move
into a new user namespace with unshare(), as it turns out, is
not. When a process gets around to calling unshare(), it may have
a long list of resources which are reflected in the user_struct
structure. Disconnecting from the old structure will require the system to
somehow disassociate the process's current resource usage from that
structure and add them to the new one instead. This process is detailed
and error-prone; even if it works once, keeping it maintained and
functional into the future could be a challenge.
The same challenge applies to SYSV IPC namespaces. A process which holds
references to a SYSV semaphore, for example, must have those references
taken away, any undo information handled properly, and so on.
Rather than try to fix up unshare() to handle all of these issues,
Cedric has taken a different approach: only allow a process to disconnect
from namespaces when all of its references to those namespaces are being
shut down anyway. That time is when the process calls a form of
exec() to run a new program. So Cedric has created a new form of
the execve() call:
int execns(int unshare_flags, char *filename, char **argv, char **envp);
This call will function like execve, in that it will cause the
process to run the program found in filename with the given
arguments and environment. The new unshare_flags argument,
however, allows the caller to specify a set of namespaces to be unshared at
the same time. As a result, the new program starts fresh with its new
namespaces and no dangling references into the older ones. To help ensure
that things happen this way, execns() closes all open
files, regardless of whether they are marked "close on exec."
Moving namespace creation into exec() would seem to make some
sense. The creation of namespaces is a rare act, done as part of the
establishment of a new container; it's not something that running processes
just occasionally decide to do. The execns() will allow a
container's init-like process to start with a clean slate while,
with luck, simplifying the unsharing logic within the kernel.
Comments (1 posted)
July 12, 2006
This article was contributed by Valerie Henson
Next time your Linux laptop crashes, pull out your watch (or your cell
phone) and time how long it takes to boot up. More than likely,
you're running a journaling file system, and not only did your system
boot up quickly, but it didn't lose any data that you cared
about. (Maybe you lost the last few bytes of your DHCP client's log
file, darn.) Now, keep your timekeeping device of choice handy and
execute a normal shutdown and reboot. More than likely, you will find
that it took longer to reboot "normally" than it did to crash your
system and recover it - and for no perceivable benefit.
George Candea and Armando Fox noticed that, counter-intuitively, many
software systems can crash and recover more quickly than they can be
shutdown and restarted. They reported the following measurements in
their paper, Crash-only
Software (published in Hot Topics in Operating
Systems IX in 2003):
| System | Clean reboot | Crash reboot | Speedup |
| RedHat 8 (ext3) | 104 sec | 75 sec | 1.4x |
| JBoss 3.0 app server | 47 sec | 39 sec | 1.2x |
| Windows XP | 61 sec | 48 sec | 1.3x |
In their experiments, no important data was lost. This is not
surprising as, after all, good software is designed to safely handle
crashes. Software that loses or ruins your data when it crashes isn't
very popular in today's computing environment - remember how
frustrating it was to use word processors without an auto-save
feature? What is surprising is that most systems have two methods of
shutting down - cleanly or by crashing - and two methods of starting
up - normal start up or recovery - and that frequently the
crash/recover method is, by all objective measures, a better choice.
Given this, why support the extra code (and associated bugs) to do a
clean start up and shutdown? In other words, why should I ever type
"halt" instead of hitting the power button?
The main reason to support explicit shutdown and start-up is simple:
performance. Often, designers must trade off higher steady state
performance (when the application is running normally) with
performance during a restart - and with acceptable data loss. File
systems are a good example of this trade-off: ext2 runs very quickly
while in use but takes a long time to recover and makes no guarantees
about when data hits disk, while ext3 has somewhat lower performance
while in use but is very quick to recover and makes explicit
guarantees about when data hits disk. When overall system
availability and acceptable data loss in the event of a crash are
factored into the performance equation, ext3 or any other journaling
file system is the winner for many systems, including, more than
likely, the laptop you are using to read this article.
Crash-only software is software that crashes safely and recovers
quickly. The only way to stop it is to crash it, and the only way to
start it is to recover. A crash-only system is composed of crash-only
components which communicate with retryable requests; faults are
handled by crashing and restarting the faulty component and retrying
any requests which have timed out. The resulting system is often more
robust and reliable because crash recovery is a first-class citizen in
the development process, rather than an afterthought, and you no
longer need the extra code (and associated interfaces and bugs) for
explicit shutdown. All software ought to be able to crash safely and
recover quickly, but crash-only software must have these qualities, or
their lack becomes quickly evident.
The concept of crash-only software has received quite a lot of
attention since its publication. Besides several well-received
research papers demonstrating useful implementations of crash-only
software, crash-only software has been covered in several popular
articles in publications as diverse as Scientific American, Salon.com,
and CIO Today. It was cited as one of the reasons Armando Fox was
named one of Scientific American's list of top 50 scientists for 2003
and George Candea as one of MIT Technology Review's Top 35 Young
Innovators for 2005. Crash-only software has made its mark outside
the press room as well; for example, Google's distributed file system,
GoogleFS, is implemented as crash-only software, all the way through
to the metadata server. The term "crash-only" is now regularly
bandied about in design discussions for production software. I myself
wrote a blog
entry on crash-only software back in 2004. Why bother writing
about it again? Quite simply, the crash-only software meme became so
popular that, inevitably, mutations arose and flourished, sometimes to
the detriment of allegedly crash-only software systems. In this
article, we will review some of the more common misunderstandings
about designing and implementing crash-only software.
Misconceptions about crash-only software
The first major misunderstanding is that crash-only software is a form
of free lunch: you can be lazy and not write shutdown code, not handle
errors (just crash it! whee!), or not save state. Just pull up your
favorite application in an editor, delete the code for normal start up
and shutdown, and voila! instant crash-only software. In fact,
crash-only software involves greater discipline and more careful
design, because if your checkpointing and recovery code doesn't work,
you will find out right away. Crash-only design helps you produce
more robust, reliable software, it doesn't exempt you from writing
robust, reliable software in the first place.
Another mistake is overuse of the crash/restart "hammer." One of the
ideas in crash-only software is that if a component is behaving
strangely or suffering some bug, you can just crash it and restart it,
and more than likely it will start functioning again. This will often
be faster than diagnosing and fixing the problem by hand, and so a
good technique for high-availability services. Some programmers
overuse the technique by deliberately writing code to crash the
program whenever something goes wrong, when the correct solution is to
handle all the errors you can think of correctly, and then rely on
crash/restart for unforeseen error conditions. Another overuse of
crash/restart is that when things go wrong, you should crash and
restart the whole system. One tenet of crash-only system
design is the idea that crash/restart is cheap - because you are only
crashing and recovering small, self-contained parts of the system (see
the paper on
microreboots). Try telling your users that your whole web browser
crashes and restarts every 2 minutes because it is crash-only software
and see how well that goes over. If instead the browser quietly crashes and
recovers only the thread that is misbehaving
you will have much happier users.
On the face of it, the simplest part of crash-only software would be
implementing the "crash" part. How hard is it to hit the power
button? There is a subtle implementation point that is easy to miss,
though: the crash mechanism has to be entirely outside and independent
of the crash-only system - hardware power switch, kill -9, shutting
down the virtual machine. If it is implemented through internal code,
it takes away a valuable part of crash-only software: that you have an
all-powerful, reliable method to take any misbehaving component of the
system and crash/restart it into a known state.
I heard of one
"crash-only" system in which the shutdown code was replaced with an
abort() system call as part of a "crash-only" design. There were two
problems with this approach. One, it relied on the system to not have
any bugs in the code path leading to the abort() system call or any
deadlocks which would prevent it being executed. Two, shutting down
the system in this manner only exercised a subset of the total
possible crash space, since it was only testing what happened when the
system successfully received and handled a request to shutdown. For
example, a single-threaded program that handled requests in an event
loop would never be crashed in the middle of handling another request,
and so the recovery code would not be tested for this case. One more
example of a badly implemented "crash" is a database that, when it ran
out of disk space for its event logging, could not be safely shut down
because it wanted to write a log entry before shutting down, but it
was out of disk space, so...
Another common pattern is to ignore the trade-offs of performance
vs. recovery time vs. reliability and take an absolutist approach to
optimizing for one quality while maintaining superficial allegiance to
crash-only design. The major trade-off is that checkpointing your
application's state improves recovery time and reliability but reduces
steady state performance. The two extremes are checkpointing or
saving state far too often and checkpointing not at all; like
Goldilocks, you need to find the checkpoint frequency that is Just
Right for your application.
What frequency of checkpointing will give
you acceptable recovery time, acceptable performance, and acceptable
data loss? I once used a web browser which only saved preferences and
browsing history on a clean shutdown of the browser. Saving the
history every millisecond is clearly overkill, but saving changed
items every minute would be quite reasonable. The chosen strategy,
"save only on shutdown," turned out to be equivalent to "save never" -
how often do people close their browsers, compared to how often they
crash? I ended up solving this problem by explicitly starting up the
browser for the sole purpose of changing the settings and immediately
closing it again after the third or fourth time I lost my
settings. (This is good example of how all software should be written
to crash safely but does not.) Most implementations of bash I have
used take the same approach to saving the command history; as a result
I now explicitly "exit" out of running shells (all 13 or so of them)
whenever I shut down my computer so I don't lose my command history.
Shutdown code should be viewed as, fundamentally, only of use to
optimize the next start up sequence and should not be used to do
anything required for correctness. One way to approach shutdown code
is to add a big comment at the top of the code saying "WISHFUL
THINKING: This code may never be executed. But it sure would be
nice."
Another class of misunderstanding is about what kind of systems are
suitable for crash-only design. Some people think crash-only software
must be stateless, since any part of the system might crash and
restart, and lose any uncommitted state in the process. While this
means you must carefully distinguish between volatile and non-volatile
state, it certainly doesn't mean your system must be stateless!
Crash-only software only says that any non-volatile state your system
needs must itself be stored in a crash-only system, such as a database
or session state store. Usually, it is far easier to use a special
purpose system to store state, rather than rolling your own. Writing
a crash-safe, quick-recovery state store is an extremely difficult
task and should be left to the experts (and will make your system
easier to implement).
Crash-only software makes explicit the trade-off between optimizing
for steady-state performance and optimizing for recovery. Sometimes
this is taken to mean that you can't use crash-only design for high
performance systems. As usual, it depends on your system, but many
systems suffer bugs and crashes often enough that crash-only design is
a win when you consider overall up time and performance, rather than
performance only when the system is up and running. Perhaps your
system is robust enough that you can optimize for steady state
performance and disregard recovery time... but it's unlikely.
Because it must be possible to crash and restart components, some
people think that a multi-threaded system using locks can't be
crash-only - after all, what happens if you crash while holding a
lock? The answer is that locks can be used inside a crash-only
component, but all interfaces between components need to allow for the
unexpected crash of components. Interfaces between components need to
strongly enforce fault boundaries, put timeouts on all requests, and
carefully formulate requests so that they don't rely on uncommitted
state that could be lost. As an example, consider how the recently-merged
robust futex facility makes
crash recovery explicit.
Some people end up with the impression that crash-only software is
less reliable and unsuitable for important "mission-critical"
applications because the design explicitly admits that crashes are
inevitable. Crash-only software is actually more reliable because it
takes into account from the beginning an unavoidable fact of computing
- unexpected crashes.
A criticism often leveled at systems designed to improve reliability
by handling errors in some way other than complete system crash is
that they will hide or encourage software bugs by masking their
effects. First, crash-only software in many ways exposes previously
hidden bugs, by explicitly testing recovery code in normal use.
Second, explicitly crashing and restarting components as a workaround
for bugs does not preclude taking a crash dump or otherwise recording
data that can be used to solve the bug.
How can we apply crash-only design to operating systems? One example
is file systems, and the design of chunkfs (discussed in last week's
LWN article on the 2006
Linux file systems workshop and in more detail here). We are trying to
improve reliability and data availability by separating the on-disk
data into individually checkable components with strong fault
isolation. Each chunk must be able to be individually "crashed" -
unmounted - and recovered - fsck'd - without bringing down the other
chunks. The code itself must be designed to allow the failure of
individual chunks without holding locks or other resources
indefinitely, which could cause system-wide deadlocks and
unavailability. Updates within each chunk must be crash-safe and
quickly recoverable. Splitting the file system up into smaller,
restartable, crash-only components creates a more reliable, easier to
repair crash-only system.
The conclusion
Properly implemented, crash-only software produces higher quality,
more reliable code; poorly understood it results in lazy programming.
Probably the most common misconception is the idea that writing
crash-only software is that it allows you to take shortcuts when
writing and designing your code. Wake up, Sleeping Beauty, there
ain't no such thing as a free lunch. But you can get a more reliable,
easier to debug system if you rigorously apply the principles of
crash-only design.
[Thanks to Brian Warner for
inspiring this article, George Candea and Armando Fox for comments and
for codifying crash-only design in general, and the implementers(s) of
the Emacs auto-save feature, which has saved my work too many times to
count.]
Comments (29 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>