The current 2.6 development kernel is 2.6.26-rc3
by Linus on
May 18. Lots of fixes, of course, and things are stabilizing (though
the list of regressions
remains long). Linus also notes that the kernel developers have now been
using git for as long as they used BitKeeper - but there's a lot more
developers now. As always, the long-format
has the details.
As of this writing, almost 300 changesets have been merged into the
mainline git repository since 2.6.26-rc3. They include a new test driver
for MMC memory cards, a new device_create_drvdata() function
(intended to fix a race condition caused by the previous separation of
device_create() and dev_set_drvdata()), a USB wireless
device management driver, and a lot of fixes.
The current stable 2.6 kernel is 18.104.22.168, released on May 15. It
contains a fairly long list of fixes, a couple of which are
Comments (none posted)
Kernel development news
So you can either try to drink from the firehose and inevitably be
bitched about because you're holding something up or not giving
something the attention it deserves, or you can try to make sure
that you can let others help you. And you'd better select the "let
other people help you", because otherwise you _will_ burn out. It's
not a matter of "if", but of "when".
-- Linus Torvalds
on git workflows (worth
reading in its entirety)
I have spoken with engineers both individual and within companies
who have developed and who plan to develop substantial kernel
features. I'm forever explaining to people why they should work to
get that code merged up. One reason for their not yet having done
so which comes up again and again is apprehension at the reception
they will receive. In public. This problem appears to be
especially strong in Asian countries. You have just made the
But it's not just a self-interest thing. It is inevitably and
unavoidably the case that when one senior kernel developer acts
like an arrogant hostile dickhead, we will all be increasingly
regarded as arrogant hostile dickheads.
-- Andrew Morton
I suppose alternately I could send another patch to remove
"remember that ext3/4 by default offers higher data integrity
guarantees than most." from Documentation/filesystems/ext4.txt ;)
-- Eric Sandeen
Comments (2 posted)
A steady stream of random events allows the kernel to
keep its entropy pool stocked up, which in turn allows processes to use the
strongest random numbers that Linux can provide. Exactly which events
qualify as random—and just how much randomness they
provide—is sometimes difficult to decide. A recent move to eliminate
a source of
contributions to the entropy pool has worried some, especially in the embedded
The kernel samples unpredictable events for use in generating random
numbers, storing that data in the entropy pool. Entropy is a measure of
the unpredictability or randomness of a data set, so the kernel estimates
the amount of entropy each of those events contributes to the pool.
Many kernels run on hardware that is lacking some of the
traditional sources of entropy. In those cases, the timing of interrupts
devices has been used as a source of entropy, but it has always been
controversial, so it was recently proposed for removal.
Two of the best sources of random data for the entropy pool—user interaction via a
keyboard or mouse and disk interrupts—are often not present in embedded
devices. In addition, some disk interfaces, notably ATA, do not add
entropy, which extends the problem to many "headless" servers. But network
interrupts are seen as a dubious source of entropy because they may be able
to be observed, or manipulated, by an attacker. In addition, as network
traffic rises, many network drivers turn off receive interrupts from the
hardware, allowing the kernel to poll periodically for incoming packets.
This would reduce entropy collection just at the time when it might be needed for
encrypting the traffic.
This is not the first time eliminating the IRQF_SAMPLE_RANDOM flag
from network drivers has come up; we looked at the issue two years
ago (though the flag was called SA_SAMPLE_RANDOM at that time).
It has come up again, starting with a query on linux-kernel from
Chris Peterson: "Should network devices be allowed to contribute
entropy to /dev/random?" Jeff Garzik, kernel network device driver
maintainer, answered: "I tend to push people to /not/ add
IRQF_SAMPLE_RANDOM to new drivers,
but I'm not interested in going on a pogrom with existing code."
For anyone that is interested in such a pogrom, Peterson proposed a
eliminate the flag from the twelve network drivers that still use it.
This sparked a long discussion on how to provide entropy for those devices
that do not have anything else to use. While the actual contribution of
entropy from network devices is questionable, mixing that data into the
pool does not harm it, as long as no entropy credit—the current
estimate of entropy in the pool—is awarded.
Alan Cox proposed a new flag to track sources
A more interesting alternative might be to mark things like network
drivers with a new flag say IRQF_SAMPLE_DUBIOUS so that users can be
given a switch to enable/disable their use depending upon the environment.
Some were in favor of an approach like this, but Adrian Bunk notes that:
If he can live with dubious data he can simply use /dev/urandom .
If a customer wants to use /dev/random and demands to get dubious data
there if nothing better is available fulfilling his wish only moves
the security bug from his crappy application to the Linux kernel.
Part of the problem stems from a misconception about random numbers
gotten from /dev/random versus those that are read from
/dev/urandom, which we described in a Security page
article last December. In general, applications should read from
/dev/urandom. Only the most sensitive uses of random
numbers—keys for GPG for example—need the entropy guarantee
that /dev/random provides. In a system that is getting regular
entropy updates, the quality of the random numbers from both sources is the same.
There is still an initialization problem for some systems, though, as Ted
Ts'o points out:
Hence, if you don't think the system hasn't run long enough to collect
significant entropy, you need to distinguish between "has run long
enough to collect entropy which is causes the entropy credits using a
somewhat estimation system where we try to be conservative such that
/dev/random will let you extract the number of bits you need", and
"has run long enough to collect entropy which is unpredictable by an
outside attacker such that host keys generated by /dev/urandom really
A potential entropy source, even for embedded systems, is to sample
other kernel and system parameters that are not predictable externally.
EGD demonstrates this, for example: http://egd.sourceforge.net/
at snmp, w, last, uptime, iostats, vmstats, etc.
And there are plenty of untapped entropy sources even so, such as reading
temperature sensors, fan speed sensors on variable-speed fans, etc.
Heck, "smartctl -d ata -a /dev/FOO" produces output that could be hashed
and added as entropy.
Another source is from hardware random number generators. The kernel
already has support for some, including the VIA
Padlock that seems to be well thought of. Not all processors have such
support, however. The Trusted
Platform Module (TPM) does have random number generation and is
becoming more widespread, especially in laptops, but there is no kernel
hw_random driver for TPM.
Garzik advocates adding a kernel driver for what he calls the "Treacherous
Platform Module", but as others pointed out, it can all be done in user
space using the TrouSerS
library. Even for the hardware random number generators that are supported
in the kernel there is no automatic entropy collection, as it is left up to
user space to decide whether to do that. This is done to try and keep
policy decisions about the quality of the random data out of kernel code.
Systems that wish to sample that data should use rngd to feed the
kernel entropy pool. rngd will apply FIPS 140-2 tests to
verify the randomness of the data before passing it to the kernel. Andi
Kleen is not in favor of that approach:
Just think a little bit: system has no randomness source except the
hardware RNG. you do your strange randomness verification. if it fails
what do you do? You don't feed anything into your entropy pool and all
your random output is predictable (just boot time) If you add anything
predictable from another source it's still predictable, no difference.
There is concern that some of the hardware random number generators are
poorly implemented or could malfunction, so it would be dangerous to
automatically add that data into the pool. Doing the FIPS testing in the
kernel is not an option, leaving it up to user space applications to make
the decision. There is nothing stopping any superuser process from adding bits
to the entropy pool—no matter how weak—but the consensus is that the
kernel itself must use sources it knows it can trust.
Another instance of this problem—in a different guise—appears in a discussion about random numbers for virtualized I/O, with Garzik asking: "Has anyone yet written a "hw" RNG
module for virt, that reads the host's
random number pool?" Rusty Russell responded with a patch for a virtio "hardware"
random number generator as well as one that adds it into his lguest
hypervisor. The lguest patch reads data from the host's
which is not where H. Peter Anvin thinks it
should come from:
There is no point in feeding the host /dev/urandom to the guest (except
for seeding, which can be handled through other means); it will do its
own mixing anyway. The reason to provide anything at all from the host
is to give it "golden" entropy bits.
The virtio implementation only provides the hw_random
implementation, thus it requires user space help to get entropy data into
the kernel. Much like any process that can read /dev/random,
lguest could exhaust the host entropy pool, so there was some discussion of
limiting how much random data guests can request from the device. A guest
implementation could then use a small pool of entropy read from the host to
seed its own random number generator for the simulated hardware device.
Removing the last remaining uses of IRQF_SAMPLE_RANDOM in network
drivers seems likely, though some way to mix that data into the entropy
pool without giving it any credit is still a possibility. With luck, that
will encourage more effort into incorporating new sources of entropy using
tools like EGD or, for systems that have it available, random number
hardware. For systems that lack the traditional entropy sources, this
should lead to a better initialized and fuller pool, while eliminating a
potential attack by way of network packet manipulation.
Comments (33 posted)
Last week's big kernel lock
discussed a BKL-related performance regression and concluded
that we would likely see a new interest in its elimination. In the
intervening week, that interest has indeed come to the fore. There are now
a couple of different efforts afoot to get rid of this long-lasting lock.
One might well wonder why the BKL is so persistent. Over the last
(approximately) fifteen years, thousands of locks have been added to the
kernel, pushing the BKL into increasingly obscure corners. But there are a
lot of those corners, including a great many explicit
lock_kernel() calls, the open() method for every char
device, most ioctl() implementations, all fasync()
implementations, and more. The BKL can be found throughout the kernel, and
doesn't appear ready to go without a fight.
Part of the problem is simply that locking is hard. So going in and
changing the locking of some crufty, old driver is not at the top of the
list for a lot of developers, who would generally rather be creating crufty
new drivers. Beyond that, though, the BKL is special. It was originally
created to be more than just a locking primitive; its purpose is to allow
BKL-covered code to pretend that it is still running on an old,
uniprocessor system. So its semantics are very different from any other
lock in the Linux kernel.
For example, the BKL nests, so programmers can add lock_kernel()
calls anywhere without worrying about whether the BKL might already have
been acquired elsewhere. As with a mutex, code holding the BKL can sleep;
however, the scheduler will magically release the BKL until the holding
thread wakes up again. So there can be various threads in kernel space,
all of which think they hold the BKL, but only one of them will actually be
running at any given time. The end result is that it is hard to get a
handle on what is happening with the BKL at any given time; code can depend
on it without ever really being aware of its existence.
As Ingo Molnar put it in his kill
the BKL tree announcement:
Furthermore, the BKL is not covered by lockdep, so its dependencies
are largely unknown and invisible, and it is all lost in the haze
of the past ~15 years of code changes. All this has built up to a
kind of Fear, Uncertainty and Doubt about the BKL: nobody really
knows it, nobody really dares to touch it and code can break
silently and subtly if BKL locking is wrong.
That doesn't mean that people aren't willing to try; Ingo's tree - to which
we will return shortly - is a major
effort in that direction. But first,
consider another initiative which, somewhat accidentally, turned up an
example of just how subtle BKL-related issues can be. As was mentioned
above, the kernel grabs the BKL whenever a process opens a char device; the
BKL is held while the associated driver's open() function runs.
To eliminate BKL, one must remove this particular use of it; one cannot
just take it out, however, without breaking every driver which does not
have proper locking internally. So, in fact, this lock_kernel()
call cannot be removed until every driver's open() function has
been audited and, if necessary, fixed. That's a big flag day.
An alternative, which your editor rashly jumped into doing, is to push the
acquisition of the BKL down one level. Every open() function is
forced to be correct through the addition of explicit
lock_kernel() and unlock_kernel() calls; once all of the
in-tree drivers have been fixed in this way, the higher-level call in
chrdev_open() can be removed. This work may seem like a step
backward, in that it replaces a single lock_kernel() call with
approximately 100 others. But it's actually a big step forward, in that
each driver can now be audited and fixed independently. This work has now
been done, the resulting tree is in linux-next, and, if all goes well, it
should be ready for 2.6.27.
While doing this work, though, your editor noticed quite a few drivers with
open functions that were either completely empty (all they do is
"return 0") or they do something relatively trivial. These
functions, one would think, do not need to acquire the BKL; they touch no
global resources and cannot possibly race with any other part of the
kernel. In fact, as was suggested by others, the empty open()
functions could just be removed altogether.
It was Alan Cox who pointed out that life
is not quite so simple. Under the current regime, an open function which
looks like this:
static int empty_open(struct inode *inode, struct file *filp)
is really better modeled as this:
static int empty_open(struct inode *inode, struct file *filp)
These two may seem the same, but there is a crucial difference: in the
second form, empty_open() will not return until it can acquire the
BKL. In other words, after empty_open() runs, one knows that the BKL became available
at least once. And this matters: a classic device driver error is to
(1) register a device with the kernel, then (2) initialize all of
the internal data structures needed to manage that device. Should some
other process attempt to open and use the device between those two steps,
unpleasant things can happen. The lock_kernel() call in the
open() function, despite protecting no critical section directly,
serializes the opening of the device with the driver's initialization, and
thus prevents mayhem. So, says Alan,
I think it would be best to make them lock/unlock kernel in the
first pass and then work through them. The BKL can be subtle and
evil, but as I brought it into the world I guess I must banish it
Alan will not be alone in that effort, though, and Ingo Molnar's "kill the
BKL" tree is likely to help this work considerably. Ingo's approach is to
get rid of most of the features which make the BKL special. So, with his
patches, the BKL becomes just another mutex which, crucially, can be
tracked with the lock
validator. It is no longer released when a thread calls
schedule(), a change which forced the addition of a few explicit
"release, schedule, and reacquire" changes in code which would otherwise
deadlock. There's a number of warnings added to point out calls made
holding the BKL which should not be. And so on.
This patch set, in essence, removes the BKL entirely, replacing it with
just another big lock which happens to do nesting. And the nesting might
go too at some point. So the BKL becomes more visible and easier to
understand. And, presumably, easier to eliminate.
Linus likes this approach, though he would
like to see it reworked to the point that it can be merged into the
mainline relatively soon. Doing that would require putting most of the
changes behind a configuration option decorated with a sufficient number of
scary warnings; then people who wanted to test this code could turn it on
and see what explodes. The number of explosions would probably be
relatively small - but probably not zero.
This set of changes, along with the other work being done, suggests that
significant progress toward the elimination of the BKL can be expected over
the next few kernel development cycles. Once it's gone, we'll have a
kernel without legacy locking issues, and without the unpleasant
performance issues that the BKL can bring. That will still take a while,
though; there is simply no substitute for actually looking at all the
BKL-covered code and ensuring that it will run safely in the absence of
that protection. It's a painstaking job requiring moderate skills which
can only be rushed so much.
Comments (2 posted)
Journaling filesystems come with a big promise: they free system
administrators from the need to worry about disk corruption resulting from
system crashes. It is, in fact, not even necessary to run a filesystem
integrity checker in such situations. The real world, of course, is a
little messier than that. As a recent discussion shows, it may be even
messier than many of us thought, with the integrity promises of
journaling filesystems being traded off against performance.
A filesystem like ext3 works by maintaining a journal on a dedicated
portion of the disk. Whenever a set of filesystem metadata changes are to
be made, they are first written to the journal - without changing the rest
of the filesystem. Once all of those changes have been journaled, a
"commit record" is added to the journal to indicate that everything else
there is valid. Only after the journal transaction has been committed in
this fashion can the kernel do the real metadata writes at its leisure;
should the system crash in the middle, the information needed to safely
finish the job can be found in the journal. There will be no filesystem
corruption caused by a partial metadata update.
There is a hitch, though: the filesystem code must, before writing the
commit record, be absolutely sure that all of the transaction's information
has made it to the journal. Just doing the writes in the proper order is
insufficient; contemporary drives maintain large internal caches and will
reorder operations for better performance. So the filesystem must
explicitly instruct the disk to get all of the journal data onto the media
before writing the commit record; if the commit record gets written first,
the journal may be corrupted. The kernel's block I/O subsystem makes this
capability available through the use of barriers; in essence, a barrier forbids the
writing of any blocks after the barrier until all blocks written before the
barrier are committed to the media. By using barriers, filesystems can
make sure that their on-disk structures remain consistent at all times.
There is another hitch: the ext3 and ext4 filesystems, by default, do not
use barriers. The option is there, but, unless the administrator has
explicitly requested the use of barriers, these filesystems operate
without them - though some distributions (notably SUSE) change that default.
Eric Sandeen recently decided that this was not the best situation, so he
submitted a patch changing
the default for ext3 and ext4. That's when the discussion started.
Andrew Morton's response tells a lot about
why this default is set the way it is:
Last time this came up lots of workloads slowed down by 30% so I
dropped the patches in horror. I just don't think we can quietly
go and slow everyone's machines down by this much...
There are no happy solutions here, and I'm inclined to let this dog
remain asleep and continue to leave it up to distributors to decide
what their default should be.
So barriers are disabled by default because they have a serious impact on
performance. And, beyond that, the fact is that people get away with
running their filesystems without using barriers. Reports of ext3
filesystem corruption are few and far between.
It turns out that the "getting away with it" factor is not just luck. Ted
Ts'o explains what's going on: the journal
on ext3/ext4 filesystems is normally contiguous on the physical media. The
filesystem code tries to create it that way, and, since the journal is
normally created at the same time as the filesystem itself, contiguous
space is easy to come by. Keeping the journal together will be good for
performance, but it also helps to prevent reordering. In normal usage, the
commit record will land on the block just after the rest of the journal
data, so there is no reason for the drive to reorder things. The commit
record will naturally be written just after all of the other journal log
data has made it to the media.
That said, nobody is foolish enough to claim that things will always happen
that way. Disk drives have a certain well-documented tendency to stop
cooperating at inopportune times. Beyond that, the journal is essentially
a circular buffer; when a transaction wraps off the end, the commit record
may be on an earlier block than some of the journal data. And so on. So
the potential for corruption is always there; in fact, Chris Mason has a torture-test program which can make it happen
fairly reliably. There can be no doubt that running without barriers is
less safe than using them.
Anybody can turn on barriers if they are willing to take the performance
hit. Unless, of course, their filesystem is based on an LVM volume (as
certain distributions do by default); it turns out that the device mapper
code does not pass through or honor barriers. But, for everybody else, it
would be nice if that
performance cost could be reduced somewhat. And it seems that might be
The current ext3 code - when barriers are enabled - performs a sequence of
operations like this for each transaction:
- The log blocks are written to the journal.
- A barrier operation is performed.
- The commit record is written.
- Another barrier is executed.
- Metadata writes begin at some later point.
On ext4, the first barrier (step 2) can be omitted because the ext4
filesystem supports checksums on the journal. If the journal log data and
the commit record are reordered, and if the operation is interrupted by a
crash, the journal's checksum will not match the one stored in the commit
record and the transaction will be discarded. Chris Mason suggests that it would be "mostly safe" to
omit that barrier with ext3 as well, with a possible exception when the
journal wraps around.
Another idea for making things faster is to defer barrier operations when
possible. If there is no pressing need to flush things out, a few
transactions can be built up in the journal and all shoved out with a
single barrier. There is also some potential for improvement by carefully
ordering operations so that barriers (which are normally implemented as
"flush all outstanding operations to media" requests) do not force the
writing of blocks which do not have specific ordering requirements.
In summary: it looks like the time has come to figure out how to make the
cost of barriers palatable. Ted Ts'o seems to
feel that way:
I think we have to enable barriers for ext3/4, and then work to
improve the overhead in ext4/jbd2. It's probably true that the
vast majority of systems don't run under conditions similar to what
Chris used to demonstrate the problem, but the default has to be
Your editor's sense is that this particular
dog is now wide awake and is likely to bark for some time. That may
disturb some of the neighbors, but it's better than letting somebody get
bitten later on.
Comments (64 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>