LWN.net Logo

Kernel development

Brief items

Kernel release status

The current 2.6 development kernel is 2.6.29-rc1, released by Linus on January 10. Since then, the flow of patches into the mainline git repository has been relatively slow.

The current stable 2.6 kernel is 2.6.28; there have been no stable updates released against this kernel yet. For 2.6.27 users, 2.6.27.11 was released, with a fair number of fixes, on January 14.

Comments (none posted)

Kernel development news

Quotes of the week

In particular, block forking is a new and untried technique for kernel, and far more difficult than user space because of multiple blocks in different, asynchronously changing states sharing the same underlying data page. We have to rip the data page out from underneath a bunch of buffers and slip in a new one without any of them noticing. Kind of like the trick where you pull the table cloth out from underneath the dinner plates, so fast that nothing crashes to the floor. Except that we also have to copy the table cloth and slip the copy back underneath the dinnerware before it settles back onto the table.
-- Daniel Phillips - filesystems really are tricky code.

As usual, git is actually smarter and get things more correct than people realize. What you found "surprising" is actually a "profound truth". Git is like a great indian mystic. It sees past the veil of the trivial, to find the true connections in life.

Or at least in source code.

-- Linus Torvalds (Thanks to Nicolas Pitre)

As far as I'm concerned, digital cameras have been more useful than kernel dumps to kernel debugging.
-- Linus Torvalds

We've long needed a filesystem named after a vegetable.
-- Andrew Morton

Linus is going to take a wholesale conversion of mutexes to adaptive mutexes? He's gone soft. I put on my asbestos underwear for no reason, then.
-- Nick Piggin

Comments (3 posted)

QOTW II: patch review special

I'm scratching my head wondering about this `data_ptr' thing. Is it a disk offset? Is it really a pointer to kernel memory? According to this code it is indeed a kernel pointer, but it then gets stuffed into an unsigned long (wtf?) and then passed to the mysterious read_extent_buffer().

<reviewer throws in the towel on this part of the code>

...

<wonders what the -1 does>

<goes to the btrfs_lookup_xattr() definition site>

<towel goes flying again>

...

<gets interested in btrfs_path.reada>

<greps for a while>

It's snowing towels in here!

-- Andrew Morton

I had this strange dream that google airlines was bombing my house with towels....
-- Chris Mason

Comments (none posted)

Open source firmware for Broadcom wireless adapters

Many people complain about the problem of binary firmware blobs; the folks at the OpenFWWF project are doing something about it. They have just released an early implementation of a free firmware load for Broadcom 802.11b and 802.11g boards. "Although the base firmware is not fully 802.11 compliant, e.g., it does not support RTS/CTS procedure or QoS, we believe that someone could be interested in testing it. The firmware does not require the kernel to be modified and it uses the same shared memory layout and global registers usage of the original stuff from broadcom to ease loading by the b43 driver (and ease our writing...)." (Thanks to Luis Rodriguez).

Full Story (comments: 13)

2.6.29 merge window, part 2

By Jonathan Corbet
January 14, 2009
Linus Torvalds released 2.6.29-rc1 and closed the 2.6.29 merge window on January 10. A little over 2000 changesets were merged after the writing of last week's merge window summary; this article completes the summary for this development cycle.

Before getting into the details, though, it is worth pointing out that the 2.6.29-rc1 kernel has a couple of unusual traps for developers and testers. If you are playing with this kernel, you should be aware of the following:

  • The Btrfs merge brought with it the entire development history for that project. One interesting result is that, if one uses git to check out a tree within that development history, the result will be a tree containing only Btrfs. In particular, this can happen in the middle of a bisection process, yielding a tree which cannot be built or tested - almost certainly not the desired result. The solution is easy, though; simply run:

        git bisect good
    

    and continue with the bisection process as usual.

  • There is a portion of the kernel history which contains a badly broken version of reiserfs. Again, only developers running kernels from arbitrary points in the history will be affected by this problem; if you run reiserfs, though, read the summary and take care.

So what else was merged for 2.6.29? User-visible changes include:

  • At the top of the list, of course, is the merge of the Btrfs filesystem. It cannot be repeated too many times, though, that Btrfs is still a development filesystem. Things are changing quickly, and it still will panic the system if you run out of space. Now is a good time for people to play with Btrfs - especially those who are willing to report bugs or submit enhancements. But it is not, yet, time to entrust your Valuable Intellectual Property to this filesystem.

  • Also merged was the squashfs compressed, read-only filesystem. Squashfs has been packaged by distributors for years; its merger into the mainline was certainly overdue.

  • There is now kernel support for WiMAX networking. The current code supports Intel's Wireless Wimax Connection 2400m devices, but others are expected for the future. See this documentation file for a bit of information on the WiMAX stack.

  • There are new drivers for Atmel AVR32-based Hammerhead boards, Linear Technology LTC4245 Multiple Supply Hot Swap Controller I2C interfaces, Oxford OXU210HP USB host/OTG/device controllers, MIPS CI13412 USB controllers, Freescale IMX USB peripheral controllers, TI TWL4030 USB transceivers, Dell-specific laptop backlight and rfkill devices, ALIX.2 and ALIX.3 series LED controllers, PIKA FPGA watchdog devices, GE Fanuc watchdog timers, and NXP PCF50633 multifunction chips (as seen in OpenMoko devices).

  • The Blackfin architecture has gained symmetric multiprocessing support. Also added is support for the BF51x family of processors.

  • The memory controller has been extended to control swap usage as well. Previously, it would be possible for a memory-controlled group to exhaust swap space.

  • The new "xenfs" virtual filesystem allows for information sharing and control between Xen domains, the hypervisor, and the host system.

  • It is now possible to create and run ext4 filesystems without a journal. One loses the benefits of journaling, obviously, but there is a notable increase in performance.

  • The filesystem freeze feature, allowing a suitably-privileged user to suspend changes to a filesystem (for backup purposes, perhaps) has been merged.

Changes visible to kernel developers include:

  • The exclusive I/O memory allocation functions have been merged.

  • The exports for a number of SUNRPC functions have been changed to GPL-only.

  • The internal MTD (memory technology device) API has seen significant changes aimed at supporting larger devices (those requiring 64-bit sizes).

  • An infrastructure for asynchronous function calls has been merged. This code is still a work in progress, though, and, for 2.6.29, it will not be activated in the absence of the fastboot command-line parameter.

And that completes the set of major changes added for 2.6.29 - with one possible exception. Linus has indicated that he would be willing to slip in an updated version of the spinning mutex code (as described in this Btrfs article) if it passes review in the near future.

Comments (4 posted)

An asynchronous function call infrastructure

By Jonathan Corbet
January 13, 2009
Arjan van de Ven's fast boot project will be familiar to most LWN readers by now. Most of Arjan's work has not yet found its way into the mainline, though, so most of us still have to wait for our systems to boot the slow way. That said, the 2.6.29 kernel will contain one piece of the fast boot work, in the form of the asynchronous function call infrastructure. Users will need to know where to find it, though, before making use of it.

There are many aspects to the job of making a system boot quickly. Some of the lowest-hanging fruit can be found in the area of device probing. Figuring out what hardware exists on the system tends to be a slow task at best; if it involves physical actions (such as spinning up a disk) it gets even worse. Kernel developers have long understood that they could gain a lot of time if this device probing could, at least, be done in a parallel manner: while the kernel is waiting for one device to respond, it can be talking to another. Attempts at parallelizing this work over the years have foundered, though. Problems with device ordering, concurrent access, and more have adversely affected system stability, with the inevitable result that the parallel code is taken back out. So early system initialization remains almost entirely sequential.

Arjan hopes to succeed where others have failed by (1) taking a carefully-controlled approach to parallelization which doesn't try to parallelize everything at once, and (2) an API which attempts to hide the effects of parallelization (other than improved speed) from the rest of the system. For (1), Arjan has limited himself to making parts of the SCSI and libata subsystems asynchronous, without addressing much of the rest of the system. The API work ensures that device registration happens in the same order is it would in a strictly sequential system. That eliminates the irritating problems which result when one's hardware changes names from one boot to the next.

The API is relatively simple. The code needs to include <async.h> and create an asynchronous worker function matching this prototype:

    typedef void (async_func_ptr) (void *data, async_cookie_t cookie);

Here, data will be a typical private data pointer, and cookie is an opaque synchronization value passed in by the kernel. An asynchronous function call is made with a call to:

    async_cookie_t async_schedule(async_func_ptr *ptr, void *data);

The call to the function identified by ptr will happen sometime during or after the call to async_schedule(); in some circumstances, it may happen synchronously. The return value is a cookie identifying this particular asynchronous call.

Code which calls asynchronous functions will eventually want to ensure that those functions have completed. The way to do that is through a call to:

    void async_synchronize_cookie(async_cookie_t cookie);

After this call completes, all asynchronous functions called prior to the one identified by cookie are guaranteed to have completed. Code which makes globally-visible changes (registering devices, for example) should synchronize in this manner first. In so doing, they ensure that any global changes which would have happened first in a strictly-sequential system will happen first in the asynchronous mode as well.

Code wanting to wait for all asynchronous functions to complete can call:

    void async_synchronize_full(void);

This function returns when there are no asynchronous function calls in the system. Of course, another one could always be submitted immediately thereafter.

Internally, the implementation of asynchronous functions is reasonably simple. There a pair of linked lists - async_pending and async_running - containing pending and running function calls, respectively. A call to async_schedule() puts the call onto the pending list and, possibly, starts a kernel thread to get the job done. In general, there will be as many threads as there are outstanding asynchronous function calls, within a hard-coded maximum (currently 256). If a thread completes a function call and finds the pending list to be empty, it will exit.

There is a special-purpose variation of this API:

    async_cookie_t async_schedule_special(async_func_ptr *ptr, void *data, 
                                          struct list_head *running);
    void async_synchronize_cookie_special(async_cookie_t cookie, 
    	 				  struct list_head *running);
    void async_synchronize_full_special(struct list_head *list);

These functions allow the caller to provide a separate list to be used in place of the async_running list. That, in turn, allows them to be synchronized independently of any other asynchronous functions running in the system. In 2.6.29-rc1, there is one prospective user of this API, which is, in fact, not part of the bootstrap process: the inode deletion code in the virtual filesystem layer. Making deletion asynchronous speeds up the process of deleting large numbers of files. It's worth noting that, in 2.6.29, this API also does not work quite as advertised - a shortcoming which, presumably, will be fixed soon.

In fact, asynchronous function calls in general don't work as well as one might have liked at the moment. This code was merged for 2.6.29-rc1, but users immediately started reporting problems. One of those (which your editor stumbled across) is that the process of enumerating SATA disks can be "synchronized" while the partition enumerating process is still running, leading to systems which fail to boot. As a result of this problem and some other concerns, Arjan asked Linus to disable most of the code so that it could be stabilized for 2.6.30. In the end, the code remains in place, but it is not activated in the absence of the new fastboot kernel parameter. So adventurous developers can give asynchronous function calls a try; the rest of us can wait for this feature to cook just a little longer.

Comments (2 posted)

Who is the best inliner of all?

By Jonathan Corbet
January 14, 2009
The inline keyword provided by GCC has always been a bit of a dangerous temptation for kernel programmers. In many cases, making a function inline can help performance. In some, it is mandatory; this is especially true for functions which encapsulate specific CPU instructions. But, in other cases, inlining becomes a classic example of premature optimization; at best, it does not help, while, at worst, it can significantly bloat the size of the kernel and harm performance. Since performance matters to kernel developers, the proper way of inlining functions has often been a topic of discussion. The most recent debate on the subject has made it clear, though, that there is still no real consensus on the issue.

The discussion began as an offshoot of the spinning mutex topic when Linus noticed that a posted kernel oops listing showed that the __cmpxchg() function had not been inlined. This function provides access to the x86 cmpxchg* instructions; it should expand to a single instruction. Clearly it makes sense to inline a single-instruction function, but, for whatever reason, GCC had decided not to do that.

Linus quickly concluded that the fault lies with the (non-default) CONFIG_OPTIMIZE_INLINING configuration option. This option, when selected, makes inline into a suggestion which GCC is free to ignore. At that point, GCC makes its own decisions, based on a set of built-in heuristics. In this case, it decided that __cmpxchg() was too complex to inline, so it made it into a separate function. Linus, in disgust, asked Ingo Molnar to remove CONFIG_OPTIMIZE_INLINING and force the compiler to honor the inline keyword.

Some other developers agreed with this request - but not all. GCC will still certainly make mistakes, but there is also a growing feeling that, with more recent versions of the compiler, GCC is able to make good decisions most of the time. If GCC is also given the power to inline functions which have not been explicitly marked by the developer, the results can be even better. There are hazards, though, to giving GCC an overly free hand: excessive inlining can create stack usage problems and make debugging harder. But these are problems that some developers are willing to accept if the benefits are strong enough.

Ingo ran a long series of tests to see what happens when GCC is given free rein over the inlining of functions. His results were fairly clear: recent GCC, when allowed to make its own inlining decisions, produces a kernel that is 1-7% smaller than the kernel which results from strictly following inline declarations. From that data, Ingo concludes that the best solution is to use the inlining features built into the compiler:

Today we have in excess of thirty thousand 'inline' keyword uses in the kernel, and in excess of one hundred thousand kernel functions. We had a decade of hundreds of inline-tuning patches that flipped inline attributes on and off, with the goal of doing that job better than the compiler.

Still a sucky compiler who was never faced with this level of inlining complexity before (up to a few short months ago when we released the first kernel with non-CONFIG_BROKEN-marked CONFIG_OPTIMIZE_INLINING feature in it) manages to do a better job at judging inlining than a decade of human optimizations managed to do. (If you accept that 1% - 3% - 7.5% code size reduction in important areas of the kernel is an improvement.)

Linus, however, is unimpressed. In his point of view, the kernel size reduction provided by automated inlining does not outweigh the drawbacks:

It's not about size - or necessarily even performance - at all. It's about abstraction, and a way of writing code.

And the thing is, as long as gcc does what we ask, we can notice when _we_ did something wrong. We can say "ok, we should just remove the inline" etc. But when gcc then essentially flips a coin, and inlines things we don't want to, it dilutes the whole value of inlining - because now gcc does things that actually does hurt us.

We get oopses that have a nice symbolic back-trace, and it reports an error IN TOTALLY THE WRONG FUNCTION, because gcc "helpfully" inlined things to the point that only an expert can realize "oh, the bug was actually five hundred lines up, in that other function that was just called once, so gcc inlined it even though it is huge".

See? THIS is the problem with gcc heuristics. It's not about quality of code, it's about RELIABILITY of code.

The reason people use C for system programming is because the language is a reasonably portable way to get the expected end results WITHOUT the compiler making a lot of semantic changes behind your back.

Linus would rather that the inline keyword be considered mandatory by the compiler. Then, if there are too many inline functions in the kernel (and 30,000 of them does seem like a fairly high number), the unnecessary inline keywords should be removed. There was some talk of adding some sort of inline_hint keyword for cases where inlining is just a suggestion, but there is not much enthusiasm for that approach.

The problem with the all-manual approach - even assuming that it can yield the best results - was perhaps best expressed by Ingo:

In this cycle alone, in the past ~2 weeks we added another 1300 inlines to the kernel. Do we really want periodic postings of:

[PATCH 0/135] inline removal cleanups

... in the next 10 years? We have about 20% of all functions in the kernel marked with 'inline'. It is a _very_ strong habit. Is it worth fighting against it?

Solving excessive use of inline functions by diluting the meaning of the inline keyword may look like a misdirected solution. But the alternative would require much more attentive review of kernel patches before they go into the mainline. History suggests that getting that level of review is an uphill battle at best. History also shows that compilers tend to be better than programmers at making this kind of decision, especially when behavior over an entire body of code (as opposed to in a single function) is considered. But it may be a while, yet, before the development community as a whole is willing to put that level of trust into its tools.

Comments (17 posted)

Patches and updates

Kernel trees

Build system

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Virtualization and containers

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds