The current development kernel is 2.6.31-rc2
on July 4. It
contains a long list of bug fixes. The short-form changelog is in the
announcement, or see the
for all the details.
The current stable kernel is 22.214.171.124, released on July 2. It
contains a long list of fixes (147 files changed) for serious problems in a
number of areas. The 126.96.36.199 and 188.8.131.52 updates came out at
the same time; 184.108.40.206 is the final stable update for the 2.6.29 series.
Comments (none posted)
Kernel development news
We put in workarounds for broken hardware, even though we
rightfully dislike it. Similarly, in this case a minor change
prevents a real problem experienced by our users (yes, I know we
only have one public bug report).
It's bad. Yes, let's bitch about it. And workaround it anyway.
-- Rusty Russell
When the vendor of an operating system doesn't even bother to
display a clean "sorry, you don't get updates any more" page for
their OS then I think it is safe to say that the operating system
is dead and buried.
-- Andrew Tridgell
How is shipping something labelled as vfat that doesn't work
properly "fighting back". It's "trying to make the users think our
code is crap".
-- Alan Cox
Comments (1 posted)
. The VFAT patent workaround discussion continues, focused
primarily on two points. The first is whether it is appropriate to put
patent workarounds into the kernel. The main opposition comes from Alan
Cox, who asserts that individual companies need to find ways to navigate
around country-specific legal landmines; workarounds for these problems
should not find their way into the upstream code. Besides, Alan claims
that vendors who are worried about patent suits will not be satisfied with
a kernel configuration option; they will hack out the code entirely.
Meanwhile, Andrew Tridgell continues to push for the use of workarounds - a
strategy which has worked well for Samba. Tridge says:
The only weapon we have to fight patents right now is our
collective ability to find smart solutions to problems. The "every
vendor for themselves" approach that you seem to be so keen on
makes that collective approach pretty much impossible.
How this side of things will play out remains to be seen. Meanwhile, there
has also been an extensive technical discussion focused on interoperability
problems caused by the patch. Some of these problems, as explained by Tridge, are really the result of
various existing VFAT mount options; in some cases, even without the patch
under discussion, Linux will create long name entries for names which
appear to fit the 8.3 format. This kind of interoperability problem has
existed for quite some time.
Mount options do not explain all of the problems, though. It seems that
there are some music players out there which understand long names, but
which still require that valid 8.3 names exist. There are also
difficulties with Windows98 which may (or may not) be resolved by changes
in how the patch fills the 8.3 information. Tridge suggests that Windows98
is old enough to not be worth supporting, but not all agree on that.
The checkpatch police. Alan Cox recently proposed the addition of a new
event interface for the virtual terminal driver. Ingo Molnar responded with a list of errors from the
checkpatch.pl script, requesting that they be fixed. Alan's reply was:
Given the code doesn't currently *work* yet but is a first draft
its hardly important and there are better uses of time than playing
What followed was an extensive discussion on the value of checkpatch.pl,
whether code should be checkpatch-clean even before first submission,
whether coding style problems should be fixed piece-by-piece as other
work is done, and so on. Ingo would like to see coding style issues dealt
with early on; among other things, consistent coding style makes reviewing
the code much easier. Alan sees that sort of cleanup as a distraction, to
be done after more substantial issues (which are not lacking in the TTY
code) have been dealt with.
Consensus was not to be had in this discussion; expect this to be one of
those themes which returns regularly to linux-kernel.
Comments (10 posted)
Computers use a lot of zeroes. Early in your editor's programming career,
he worked on a machine that provided a special hardware register containing
zero; programmers on this system knew they could use all the zeroes they
needed with no fear of running out. Meanwhile, in this century, the Linux
kernel sets aside a page full of zeros. It's called
on the x86 architecture, and it's even exported to
modules. Interestingly, this special page is not used as heavily as it was
prior to the 2.6.24 kernel, but that may be about to change.
In the good old days, the kernel would use the zero page in situations where it knew
it needed a page full of zeroes. So, for example, if a process incurred a
read fault on a page it had never used, the kernel would simply map the
zero page into that address. A copy-on-write mapping would be used, of
course; if the process subsequently modified the page, it would end up with
its own copy. But deferring the creation of a new, zero-filled page helped
to conserve zeroes, keeping the kernel from running out. Incidentally, it
also saved memory, reduced cache pressure, and eliminated the need to clear
the new page.
Memory management changes made back in 2007 had the effect of adding reference
counting to the zero page. And that turned out to be a problem on
multiprocessor machines. Since all processors shared the same zero page
(per-CPU differences being unlikely), they also all manipulated the
same reference count. That led to serious problems with cache line
bouncing, with a measurable performance impact. In response, Nick Piggin
evaluated a number of possible fixes, including special hacks to avoid
reference-counting the zero page or adding per-CPU zero pages. The patch
that got merged, though, simply eliminated most use of the zero page
altogether. The change was justified this way:
Inserting a ZERO_PAGE for anonymous read faults appears to be a
false optimisation: if an application is performance critical, it
would not be doing many read faults of new memory, or at least it
could be expected to write to that memory soon afterwards. If cache
or memory use is critical, it should not be working with a
significant number of ZERO_PAGEs anyway (a more compact
representation of zeroes should be used).
There was some nervousness about the patch at the time; Linus grumbled
about the changes which created the problem in the first place,
The kernel has *always* (since pretty much day 1) done that
ZERO_PAGE thing. This means that I would not be at all surprised if
some application basically depends on it. I've written
test-programs that depends on it - maybe people have written other
code that basically has been written for and tested with a kernel
that has basically always made read-only zero pages extra cheap.
Despite his misgivings, Linus merged the patch for 2.6.24 to see what sort
of problems might come to the surface. For the next 18 months, it
appeared that such problems were scarce indeed; most people forgot about
the zero page altogether. In early June, though, Julian Phillips reported a problem he had observed:
I have a program which creates a reasonably large private anonymous
map. The program then writes into a few places in the map, but ends
up reading from all of them.
When I run this program on a system running 220.127.116.11 the process
only ever seems to use enough memory to hold the data that has
actually been written (well - in units of PAGE_SIZE). When I run
the program on a system running 18.104.22.168 then as it reads the map
the amount of memory used continues to increase until the complete
map has actually been allocated (and since the total size is
greater than the physically available RAM causes swapping).
Basically I seem to be seeing copy-on-read instead of copy-on-write
What Julian was seeing, of course, was the effects from the removal of the
zero page. On older kernels, all of the unwritten pages in the data
structure would be mapped to the zero page, using no additional physical
memory at all. As of 2.6.24, each of those pages gets an actual physical
page - containing nothing but zeroes - assigned to it, increasing memory
Hiroyuki Kamezawa reports that he has seen zero-page-dependent workloads at
other sites. Many of those sites, he says, are running enterprise Linux
distributions which have not, yet, shipped kernels new enough to lack zero
page support. He worries that these users will encounter the same sort of
unpleasant surprise Julian found when they upgrade to newer kernels. In
response, he has posted a
patch which restores zero page support to the kernel.
Hiroyuki's zero page support isn't quite the same as what came before,
though. It avoids reference counting for the zero page, a change which
should eliminate the worst of the performance problems. It does add some
interesting special cases, though, where virtual memory code has to be
careful to test for zero pages; the bulk of those cases are handled with
the addition of a get_user_pages_nonzero() function which removes
any zero pages from the indicated range. Linus dislikes the special cases, thinking that they
are unnecessary. Instead, he has proposed an alternative implementation
using the relatively new PTE_SPECIAL flag to mark zero pages. As
of this writing, a updated version of the patch using this approach has not
yet been posted.
Nick Piggin, who wrote the patch removing zero page support in the first
place, would rather not see it return.
With regard to the affected users, he asks:
Can we just try to wean them off it? Using zero page for huge
sparse matricies is probably not ideal anyway because it needs to
still be faulted in and it occupies TLB space. They might see
better performance by using a better algorithm.
Linus, however, would like to see this feature
restored if it can be done in a clean way. So the return of zero page
support seems fairly likely, assuming the patch can be worked into
sufficiently good shape. Whether that will bring comfort to
enterprise kernel users remains to be seen, though; the next generation of
enterprise Linux releases look set to use kernels around 2.6.27. Unless
distributors backport the zero page patch, enterprise Linux users will
still be stuck with the current, zero-wasting behavior.
Comments (14 posted)
One of the outcomes from
last year's Kernel Summit and Linux Plumbers Conference was a plan to create a low-level
ring-buffer implementation that could be shared among the various
kernel and user-space tracing solutions available for Linux. One
implementation of the common ring-buffer was released as part of 2.6.28,
but it was somewhat lock-heavy, which impacted its performance. Recently,
Steven Rostedt has proposed a
lockless ring-buffer algorithm, which would eliminate locking on writes,
which is the fast path for tracing.
As tracing information is gathered in the kernel, it needs to be stored
somewhere very quickly, so that the impact on the timing of the events
observed—and system performance overall—is fairly minimal.
Reading the data is done from user space, though, so it
is generally not performance-sensitive. The current ring-buffer
implementation creates a circular, doubly-linked list of pages, along with
a head and
tail pointer, so writes are done at the tail, while reads are done from
If the ring-buffer gets full, or nearly so, there is the potential for
writers to overwrite data in the head page, which could corrupt data that
is being read. For this reason, there is a separate reader page, which has
been removed from the list entirely, that reader
processes can use
without being concerned about corruption from writers. But, having that
separate page requires that there be a bit of a dance whenever the reader
is done with the page and needs a new one. The reader page must be placed
back into the list somewhere after the tail, while the current head page
be removed as the new reader page, and the head page must be pushed
forward. That requires locking.
The diagram below, from Rostedt's ring-buffer-design.txt
document, gives an idea of how the ring-buffer would look. Observant
readers will note the H pointer, which is the
+---+ +---+ +---+ +---+
<---| |--->| |-H->| |--->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
Writers can be interrupted by other writers, so long as the interrupting
writer completes its write before the interrupted writer can continue.
This is in keeping with the way interrupts stack, and it is important that
it be enforced for the integrity of the ring-buffer structure. When a
write is initiated, space is reserved after the tail pointer to hold the
event. This moves the tail pointer, so another pointer, called the commit pointer, is needed to
track the latest complete write.
In nearly empty ring-buffers, it is possible for the reader page to also be
the commit and tail pages. While the reader page has been removed from the
ring-buffer, its next pointer still leads to the next ring-buffer
entry. Once enough writes are done, the commit and tail pointers will
simply follow the next pointer as they normally do.
In order to remove the locking for writers, which currently need to use
locks to synchronize updates of the head, tail, and commit pointers, Rostedt
leverages the cmpxchg() atomic operation available on some
architectures. It works as follows:
R = cmpxchg(A, C, B)
- Assign A = B if A == C
- Return A at the time of the call, unconditionally
success of the exchange can be determined by checking whether R
equal to C
, if so, the exchange was done.
The algorithm requires that the pointers to the linked-list structures be
4-byte aligned so that it can reserve the bottom 2 bits for flags. The two
- HEADER - the pointer is to the current head page
- UPDATE - the pointer is to a page that is currently being
written and either is, or is about to be, the head page
These flags, along with the cmpxchg()
operation, are what allow
lockless writing to the buffer.
When the reader page has been exhausted, the current head page needs to be
detached from the ring-buffer as the new reader page. By using the
HEADER flag on the next pointer that
points to the head page, writers can keep readers from interfering without
taking a lock. When trying to change the next pointer as part of
the swapping process, readers use cmpxchg() to require that the
HEADER flag be present. Writers can prevent that from happening
by setting the flag to UPDATE or clearing the flags entirely.
When the reader's cmpxchg() fails, it means that writers have
changed the state of the ring, so the reader must look for a new head page
and start the process all over.
When writers change to a new tail page, as they fill the buffer, they check
the next pointer of the new page for the HEADER flag. If
it is present, it is changed to UPDATE. That indicates that the
page is volatile, as writers are currently using it, and will cause the
cmpxchg() of a reader to fail, should it try to detach the head
page. This is an indication that the buffer is close to full, only one
page (i.e. the new tail page) remains for storing events.
The ring-buffer can operate in two modes, overwrite (aka "flight recorder")
mode, where new events overwrite older events when the buffer fills up, or
producer/consumer mode where writing to a full buffer causes the write to
fail. In producer/consumer mode, the head page only changes at the behest
of a reader, but in overwrite mode, once the tail page reaches the head,
the head must be pushed forward one page, which is why the UPDATE
flag must be used.
The basic function of the algorithm is relatively straightforward—if
a bit head-exploding—but there are number of more complex scenarios
to consider. One is the possibility that nested writes cause the buffer to
fill, such that the tail page reaches the commit page. There is no
choice but to drop writes at that point, but it is possible that the commit
page is on the reader page (as shown below). Naïvely pushing the head page forward,
past the entry that the reader page points to,
would break the ability for the commit page to move from the reader page
back into the ring-buffer. So writers must check for this condition and
start dropping writes if it is detected.
reader page commit page
+---+ +---+ +---+ +---+
<---| |--->| |-H->| |--->| |--->
--->| |<---| |<---| |<---| |<---
+---+ +---+ +---+ +---+
Other complex scenarios are possible. Interested readers are directed to
Rostedt's design document for more information. It is quite detailed and
chock full of ASCII artwork depicting ring-buffer operations. The algorithm itself is
the subject of a patent application by Rostedt for Red Hat. If granted, it
will be available for free software implementations under Red Hat's patent policy.
Mathieu Desnoyers, developer of the Linux Trace
Toolkit Next Generation (LTTng), has been following the ring-buffer
submission closely, as LTTng would be one of the tracing solutions expected
to use the common ring-buffer. The proposed algorithm is complex, "near that
of RCU mechanisms", he said, but unlike RCU (or the LTTng lockless
buffer algorithm), no formal proof of the algorithm has been done.
He agrees that lockless buffers for tracing are
desirable and achievable, but he is concerned that the lack of formal
verification of Rostedt's algorithm could lead to an extended period of bug
That complexity has a bit of a silver lining, though, as
in a review of the design: "The great news to me is that no one can
say LTTng's lockless buffering
algorithm is complex compared to this. ;)"
Two other concerns he mentioned are performance and fast user-space
algorithm depends on being able to disable preemption, which is not
possible for user-space tracing. Desnoyers said that LTTng has more
compact events which he
believes will allow the LTTng version to be able to handle more events
per second than Rostedt's, but no real performance comparisons have,
as yet, been done. Desnoyers is hopeful that he will be able to
propose an alternative
lockless ring-buffer implementation based on the LTTng code sometime soon, but
there is the small matter of a Ph.D. dissertation to
complete before that can happen.
Rostedt is targeting the 2.6.32 kernel for merging the lockless
ring-buffer, it remains to be seen if there will be objections to its
inclusion. It may also have to fend off alternatives. Sooner or later,
though, some kind of lockless buffering for trace events seems likely to
make it into the kernel.
Comments (none posted)
Making the best use of available memory is one of the biggest challenges
for any operating system. Throwing virtualization into the mix adds both
new challenges (balancing memory use between guests, for example) and
opportunities (sharing pages between guests). Developers have responded
with technologies like hot-plug memory and KSM
, but nobody seems to think
that the problem is fully solved. Transcendent memory is a new
memory-management technique which, it is hoped, will improve the system's
use of scarce RAM, regardless of whether virtualization is being used.
In his linux-kernel
introduction, Dan Magenheimer asks:
What if there was a class of memory that is of unknown and
dynamically variable size, is addressable only indirectly by the
kernel, can be configured either as persistent or as "ephemeral"
(meaning it will be around for awhile, but might disappear without
warning), and is still fast enough to be synchronously accessible?
Dan (along with a list of other kernel developers) is exploring this
concept, which he calls "transcendental memory." In short, transcendental
memory can be thought of as a sort of RAM disk with some interesting
characteristics: nobody knows how big it is, writes to the disk may not
succeed, and, potentially, data written to the disk may vanish before being
read back again. At a first blush, it may seem like a relatively useless
sort of device, but it is hoped that transcendental memory will be able to
improve performance in a few situations.
There is an
API specification [PDF] available; there is also a related C API found
in the patch itself. This discussion will focus on the latter, which
suffers from less EXCESSIVE CAPITAL USE and is generally easier to
Transcendental memory operates on the concept of page pools; once a pool is
created, data can be stored to pages within the pool. The calls for
creating and destroying pools look like this:
u32 pool_id = tmem_new_pool(struct tmem_pool_uuid uuid, u32 flags)
Pools are identified by the uuid value, though the identification
really only matters for pools which might be shared among multiple users.
A fair amount of information is stored in the flags field,
- An "ephemeral" bit, which controls whether data successfully written
to the pool is allowed to disappear at a random future time.
- A "shared" bit indicating whether the pool is to be shared with other
- The size of pages to use in the pool, expressed as a kernel "order"
- A specification version number, used to ensure that both sides of the
conversation know how to understand each other.
While users are expected to specify an expected page size, there is no way
to specify the size of the pool as a whole. Determining the proper sizing
for a pool (which almost certainly changes over time) is left to the
hypervisor or whatever other software component is managing the pool.
As suggested by the above interface, transcendental memory is very much
page-based. Beyond that, it also can never be referenced directly; users
are required to copy data into and out of the pool explicitly. The
functions used for moving data between normal and transcendental memory are:
int tmem_put_page(u32 pool_id, u64 object_id, u32 page_id, unsigned long pfn);
int tmem_get_page(u32 pool_id, u64 object_id, u32 page_id, unsigned long pfn);
For both of these calls, pool_id specifies an existing pool. The
object_id and page_id values, together, form a unique
identifier for the page within the pool. If the pool is being used to
cache file pages, for example, the object_id would identify the
file, while page_id would be the offset within the file.
pfn (a page frame number) identifies the page which is the source
of the data (for
tmem_put_page()) or the destination (tmem_get_page()).
Note that either call might fail. Since the size of the pool is not known,
callers can never know in advance whether tmem_put_page() will
succeed. So any transcendental memory user must have a backup plan ready
in case the call fails. For pools marked as "ephemeral,"
tmem_get_page() is allowed to fail even if
tmem_put_page() on the same ID succeeded; in other words, the
implementation is allowed to drop pages from ephemeral pools if it decides
that the memory can be put to better use elsewhere. It's also worth noting
that, with private, ephemeral pools, tmem_get_page() will remove
the indicated page from the pool.
As an example of how this feature might be used, consider the Linux page
cache, which maintains copies of pages from disk files. When memory gets
tight, the page cache will start forgetting pages which are clean, but
which have not been referenced in the recent past. With transcendental
memory, the page cache could, before dropping the pages, attempt to store
them into an ephemeral transcendental memory pool. At some future time,
when one of those pages is needed again, the page cache would first attempt to fetch
it from the pool. If the tmem_get_page() call succeeds, a disk
I/O operation will have been avoided and everybody benefits; otherwise the
page is read from disk as usual.
Persistent (non-ephemeral) pools could be used as a sort of swap device.
If the swapping
code succeeds in writing a page to the pool, it can avoid writing it to the
real swap device. The result is saved I/O at both swap-out and swap-in
times. If the pool lacks space for the swapped page, it will be written to
the real swap device in the usual way.
Meanwhile, the transcendental memory implementation can try to optimize its
management of the memory pools. Guests which are more active (or which
have been given a higher priority) might be allowed to allocate more pages
from the pools. Duplicate pages can be coalesced; KSM-like techniques
could be used, but the use of object IDs could make it easier to detect
duplicates in a number of situations. And so on.
The API specifies a number of other operations. There are a couple of
calls to flush pages from the pool; one of them can remove all pages with a
given object ID. Sub-page-size reads and writes are supported; there is
also a tmem_xchg() call to atomically exchange data within a
transcendental memory page. See the API specification for the full list.
A number of concerns were raised in the subsequent discussion; as a result,
the above API is likely to change a bit. The biggest concern, though,
appears to be security. The potential for hostile code to tap into shared
pools and read out pages has developers worried; the need to guess a
128-bit UUID first has proved not to be sufficiently reassuring. Even with
legitimate users only, a shared pool has the potential to contain data
which should not, in reality, be shared between guests. As a result, any
transcendental memory user will have to be written to take high-level
security issues into account in low-level code.
Dan seemingly doesn't see the security problems as being as worrisome as
others do. Even so, he eventually announced that the next transcendental memory
patch would not include support for shared pools, and, indeed, version 2 lacks that feature. That feature will
probably not come back until the security issues have been thought through
and the concerns have been addressed.
Beyond that, transcendental memory will need some convincing evidence that
it improves performance before it can make it into the mainline. The
potential for improvements is clearly there; it is essentially a way for
the system to take higher-level information into account when managing its
virtual memory resources. If transcendental memory is able to fulfill that
potential in a secure way, there may well be a place for it in the mainline
Comments (21 posted)
Patches and updates
Filesystems and block I/O
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>