The current 2.6 development kernel is 2.6.29-rc1
by Linus on
January 10. Since then, the flow of patches into the mainline git
repository has been relatively slow.
The current stable 2.6 kernel is 2.6.28; there have been no stable
updates released against this kernel yet. For 2.6.27 users, 188.8.131.52 was released, with a fair number of
fixes, on January 14.
Comments (none posted)
Kernel development news
In particular, block forking is a new and untried technique for
kernel, and far more difficult than user space because of multiple
blocks in different, asynchronously changing states sharing the
same underlying data page. We have to rip the data page out from
underneath a bunch of buffers and slip in a new one without any of
them noticing. Kind of like the trick where you pull the table
cloth out from underneath the dinner plates, so fast that nothing
crashes to the floor. Except that we also have to copy the table
cloth and slip the copy back underneath the dinnerware before it
settles back onto the table.
-- Daniel Phillips
- filesystems really are
As usual, git is actually smarter and get things more correct than
people realize. What you found "surprising" is actually a "profound
truth". Git is like a great indian mystic. It sees past the veil
of the trivial, to find the true connections in life.
Or at least in source code.
-- Linus Torvalds
(Thanks to Nicolas Pitre)
As far as I'm concerned, digital cameras have been more useful than
kernel dumps to kernel debugging.
-- Linus Torvalds
We've long needed a filesystem named after a vegetable.
-- Andrew Morton
Linus is going to take a wholesale conversion of mutexes to
adaptive mutexes? He's gone soft. I put on my asbestos underwear
for no reason, then.
-- Nick Piggin
Comments (3 posted)
I'm scratching my head wondering about this `data_ptr' thing. Is
it a disk offset? Is it really a pointer to kernel memory?
According to this code it is indeed a kernel pointer, but it then
gets stuffed into an unsigned long (wtf?) and then passed to the
<reviewer throws in the towel on this part of the code>
<wonders what the -1 does>
<goes to the btrfs_lookup_xattr() definition site>
<towel goes flying again>
<gets interested in btrfs_path.reada>
<greps for a while>
It's snowing towels in here!
-- Andrew Morton
I had this strange dream that google airlines was bombing my house
-- Chris Mason
Comments (none posted)
Many people complain about the problem of binary firmware blobs; the folks
at the OpenFWWF project
doing something about it. They have just released an early implementation
of a free firmware load for Broadcom 802.11b and 802.11g boards.
"Although the base firmware is not fully 802.11 compliant, e.g., it
does not support RTS/CTS procedure or QoS, we believe that someone could be
interested in testing it. The firmware does not require the kernel to be
modified and it uses the same shared memory layout and global registers
usage of the original stuff from broadcom to ease loading by the b43 driver
(and ease our writing...).
" (Thanks to Luis Rodriguez).
Full Story (comments: 13)
Linus Torvalds released
and closed the 2.6.29 merge window on January 10.
A little over 2000 changesets were merged after the writing of last week's merge window
; this article completes the summary for this development
Before getting into the details, though, it is worth pointing out that the
2.6.29-rc1 kernel has a couple of unusual traps for developers and
testers. If you are playing with this kernel, you should be aware of the
So what else was merged for 2.6.29? User-visible changes include:
- At the top of the list, of course, is the merge of the Btrfs
filesystem. It cannot be repeated too many times, though, that Btrfs
is still a development filesystem. Things are changing
quickly, and it still will panic the system if you run out of space.
Now is a good time for people to play with Btrfs - especially those
who are willing to report bugs or submit enhancements. But it is not,
yet, time to entrust your Valuable Intellectual Property to this
- Also merged was the squashfs compressed,
read-only filesystem. Squashfs has been packaged by distributors for
years; its merger into the mainline was certainly overdue.
- There is now kernel support for WiMAX networking. The current code
supports Intel's Wireless Wimax Connection 2400m devices, but others
are expected for the future. See this
documentation file for a bit of information on the WiMAX stack.
- There are new drivers for Atmel AVR32-based Hammerhead boards,
Linear Technology LTC4245 Multiple Supply Hot Swap Controller I2C
Oxford OXU210HP USB host/OTG/device controllers,
MIPS CI13412 USB controllers,
Freescale IMX USB peripheral controllers,
TI TWL4030 USB transceivers,
Dell-specific laptop backlight and rfkill devices,
ALIX.2 and ALIX.3 series LED controllers,
PIKA FPGA watchdog devices,
GE Fanuc watchdog timers, and
NXP PCF50633 multifunction chips (as seen in OpenMoko devices).
- The Blackfin architecture has gained symmetric multiprocessing
support. Also added is support for the BF51x family of processors.
- The memory controller has been extended to control swap usage as
well. Previously, it would be possible for a memory-controlled group
to exhaust swap space.
- The new "xenfs" virtual filesystem allows for information sharing and
control between Xen domains, the hypervisor, and the host system.
- It is now possible to create and run ext4 filesystems without a
journal. One loses the benefits of journaling, obviously, but there
is a notable increase in performance.
- The filesystem freeze
feature, allowing a suitably-privileged user to suspend changes to a
filesystem (for backup purposes, perhaps) has been merged.
Changes visible to kernel developers include:
- The exclusive I/O memory
allocation functions have been merged.
- The exports for a number of SUNRPC functions have been changed to
- The internal MTD (memory technology device) API has seen significant
changes aimed at supporting larger devices (those requiring 64-bit
- An infrastructure for
asynchronous function calls has been merged. This code is still a
work in progress, though, and, for 2.6.29, it will not be activated in
the absence of the fastboot command-line parameter.
And that completes the set of major changes added for 2.6.29 - with one
possible exception. Linus has indicated
that he would be willing to slip in an updated version of the spinning
mutex code (as described in this
Btrfs article) if it passes review in the near future.
Comments (4 posted)
Arjan van de Ven's fast boot
will be familiar to most LWN readers by now. Most of Arjan's
work has not yet found its way into the mainline, though, so most of us
still have to wait for our systems to boot the slow way. That said,
the 2.6.29 kernel will contain one piece of the fast boot work, in the form
of the asynchronous function call infrastructure. Users will need to know
where to find it, though, before making use of it.
There are many aspects to the job of making a system boot quickly. Some of
the lowest-hanging fruit can be found in the area of device probing.
Figuring out what hardware exists on the system tends to be a slow task at
best; if it involves physical actions (such as spinning up a disk) it gets
even worse. Kernel developers have long understood that they could gain a
lot of time if this device probing could, at least, be done in a parallel
manner: while the kernel is waiting for one device to respond, it can be
talking to another. Attempts at parallelizing this work over the years
have foundered, though. Problems with device ordering, concurrent access,
and more have adversely affected system stability, with the inevitable
result that the parallel code is taken back out. So early system
initialization remains almost entirely sequential.
Arjan hopes to succeed where others have failed by (1) taking a
carefully-controlled approach to parallelization which doesn't try to
parallelize everything at once, and (2) an API which attempts to hide
the effects of parallelization (other than improved speed) from the rest of
the system. For (1), Arjan has limited himself to making parts of the SCSI
and libata subsystems asynchronous, without addressing much of the rest of
the system. The API work ensures that device registration happens in the
same order is it would in a strictly sequential system. That eliminates
the irritating problems which result when one's hardware changes names from
one boot to the next.
The API is relatively simple. The code needs to include
<async.h> and create an asynchronous worker function matching
typedef void (async_func_ptr) (void *data, async_cookie_t cookie);
Here, data will be a typical private data pointer, and
cookie is an opaque synchronization value passed in by the
kernel. An asynchronous function call is made with a call to:
async_cookie_t async_schedule(async_func_ptr *ptr, void *data);
The call to the function identified by ptr will happen sometime
during or after the call to async_schedule(); in some
circumstances, it may happen synchronously. The return value is a cookie
identifying this particular asynchronous call.
Code which calls asynchronous functions will eventually want to ensure that
those functions have completed. The way to do that is through a call to:
void async_synchronize_cookie(async_cookie_t cookie);
After this call completes, all asynchronous functions called prior to the
one identified by cookie are guaranteed to have completed. Code
which makes globally-visible changes (registering devices, for example)
should synchronize in this manner first. In so doing, they ensure that any
global changes which would have happened first in a strictly-sequential
system will happen first in the asynchronous mode as well.
Code wanting to wait for all asynchronous functions to complete can call:
This function returns when there are no asynchronous function calls in the
system. Of course, another one could always be submitted immediately
Internally, the implementation of asynchronous functions is reasonably
simple. There a pair of linked lists - async_pending and
async_running - containing pending and running
function calls, respectively. A call to async_schedule() puts the
call onto the pending list and, possibly, starts a kernel thread to get the
job done. In general, there will be as many threads as there are
outstanding asynchronous function calls, within a hard-coded maximum
(currently 256). If a thread completes a function call and finds the
pending list to be empty, it will exit.
There is a special-purpose variation of this API:
async_cookie_t async_schedule_special(async_func_ptr *ptr, void *data,
struct list_head *running);
void async_synchronize_cookie_special(async_cookie_t cookie,
struct list_head *running);
void async_synchronize_full_special(struct list_head *list);
These functions allow the caller to provide a separate list to be used in
place of the async_running list. That, in turn, allows them to be
synchronized independently of any other asynchronous functions running in
the system. In 2.6.29-rc1, there is one prospective user of this API, which is, in fact,
not part of the bootstrap process: the inode deletion code in the virtual
filesystem layer. Making deletion asynchronous speeds up the process of
deleting large numbers of files. It's worth noting that, in 2.6.29, this
API also does not work quite as advertised - a shortcoming which,
presumably, will be fixed soon.
In fact, asynchronous function calls in general don't work as well as one
might have liked at the moment. This code was merged for 2.6.29-rc1, but users
immediately started reporting problems. One of those (which your editor
stumbled across) is that the process of enumerating SATA disks can be
"synchronized" while the partition enumerating process is still running,
leading to systems which fail to boot. As a result of this problem and
some other concerns, Arjan asked Linus to
disable most of the code so that it could be stabilized for 2.6.30. In the
end, the code remains in place, but it is not activated in the absence of
the new fastboot kernel parameter. So adventurous developers can
give asynchronous function calls a try; the rest of us can wait for this
feature to cook just a little longer.
Comments (2 posted)
keyword provided by GCC has always been a bit of a
dangerous temptation for kernel programmers. In many cases, making a
function inline can help performance. In some, it is mandatory; this is
especially true for functions which encapsulate specific CPU instructions.
But, in other cases, inlining becomes a classic example of premature
optimization; at best, it does not help, while, at worst, it can
significantly bloat the size of the kernel and harm performance. Since
performance matters to kernel developers, the proper way of inlining
functions has often been a topic of discussion. The most recent debate on
the subject has made it clear, though, that there is still no real
consensus on the issue.
The discussion began as an offshoot of the spinning mutex topic when Linus
noticed that a posted kernel oops listing
showed that the __cmpxchg() function had not been inlined.
This function provides access to the x86 cmpxchg* instructions; it
should expand to a single instruction. Clearly it makes sense to inline a
single-instruction function, but, for whatever reason, GCC had decided not
to do that.
Linus quickly concluded that the fault lies with the (non-default)
CONFIG_OPTIMIZE_INLINING configuration option. This option, when
selected, makes inline into a suggestion which GCC is free to
ignore. At that point, GCC makes its own decisions, based on a set of
built-in heuristics. In this case, it decided that __cmpxchg()
was too complex to inline, so it made it into a separate function. Linus,
in disgust, asked Ingo Molnar to remove CONFIG_OPTIMIZE_INLINING
and force the compiler to honor the inline keyword.
Some other developers agreed with this request - but not all. GCC will
still certainly make mistakes, but there is also a growing feeling that,
with more recent versions of the compiler, GCC is able to make good
decisions most of the time. If GCC is also given the power to inline
functions which have not been explicitly marked by the developer, the
results can be even better. There are hazards, though, to giving GCC an
overly free hand: excessive inlining can create stack usage problems and
make debugging harder. But these are problems that some developers are
willing to accept if the benefits are strong enough.
Ingo ran a long series of tests to see what
happens when GCC is given free rein over the inlining of functions. His
results were fairly clear: recent GCC, when allowed to make its own
inlining decisions, produces a kernel that is 1-7% smaller than the kernel
which results from strictly following inline declarations. From
that data, Ingo concludes that the best
solution is to use the inlining features built into the compiler:
Today we have in excess of thirty thousand 'inline' keyword uses in
the kernel, and in excess of one hundred thousand kernel
functions. We had a decade of hundreds of inline-tuning patches
that flipped inline attributes on and off, with the goal of doing
that job better than the compiler.
Still a sucky compiler who was never faced with this level of
inlining complexity before (up to a few short months ago when we
released the first kernel with non-CONFIG_BROKEN-marked
CONFIG_OPTIMIZE_INLINING feature in it) manages to do a better job
at judging inlining than a decade of human optimizations managed to
do. (If you accept that 1% - 3% - 7.5% code size reduction in
important areas of the kernel is an improvement.)
Linus, however, is unimpressed. In his
point of view, the kernel size reduction provided by automated inlining
does not outweigh the drawbacks:
It's not about size - or necessarily even performance - at
all. It's about abstraction, and a way of writing code.
And the thing is, as long as gcc does what we ask, we can notice
when _we_ did something wrong. We can say "ok, we should just
remove the inline" etc. But when gcc then essentially flips a coin,
and inlines things we don't want to, it dilutes the whole value of
inlining - because now gcc does things that actually does hurt us.
We get oopses that have a nice symbolic back-trace, and it reports
an error IN TOTALLY THE WRONG FUNCTION, because gcc "helpfully"
inlined things to the point that only an expert can realize "oh,
the bug was actually five hundred lines up, in that other function
that was just called once, so gcc inlined it even though it is
See? THIS is the problem with gcc heuristics. It's not about
quality of code, it's about RELIABILITY of code.
The reason people use C for system programming is because the
language is a reasonably portable way to get the expected end
results WITHOUT the compiler making a lot of semantic changes
behind your back.
Linus would rather that the inline keyword be considered mandatory
by the compiler. Then, if there are too many inline functions in the
kernel (and 30,000 of them does seem like a fairly high number), the
unnecessary inline keywords should be removed. There was some
talk of adding some sort of inline_hint keyword for cases where
inlining is just a suggestion, but there is not much enthusiasm for that
The problem with the all-manual approach - even assuming that it can yield
the best results - was perhaps best
expressed by Ingo:
In this cycle alone, in the past ~2 weeks we added another 1300 inlines
to the kernel. Do we really want periodic postings of:
[PATCH 0/135] inline removal cleanups
... in the next 10 years? We have about 20% of all functions in the
kernel marked with 'inline'. It is a _very_ strong habit. Is it worth
fighting against it?
Solving excessive use of inline functions by diluting the meaning of the
inline keyword may look like a misdirected solution. But the
alternative would require much more attentive review of kernel patches
before they go into the mainline. History suggests that getting that level
of review is an uphill battle at best. History also shows that compilers
tend to be better than programmers at making this kind of decision,
especially when behavior over an entire body of code (as opposed to in a
single function) is considered. But it may be a while, yet, before the
development community as a whole is willing to put that level of trust into
Comments (17 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>