|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 development kernel is 2.6.29-rc5, released on February 13. It has some driver updates and a lot of fixes. "So go out and test the heck out of it, because I'm going to spend the three-day weekend drunk at the beach. Because somebody has to do it." See the full changelog for all the details.

The current stable 2.6 kernel is 2.6.28.6, released (along with 2.6.27.18) on February 17. Both contain a long list of fixes for a variety of problems.

Previously, 2.6.28.5 and 2.6.27.16 were released on February 12. 2.6.27.17 was rushed out moments afterward with a fix to an "instant oops" problem on some laptops.

Comments (none posted)

Kernel development news

Quotes of the week

For example iSCSI: blew its early promise, pulled a bunch of unnecessary networking into the protocol and ended up too big to fit in disk firmware (thus destroying the ability to have a simple network tap to replace storage fabric). It's been slowly fading until Virtualisation came along. Now all the other solutions to getting storage into virtual machines are so horrible and arcane that iSCSI looks like a winner (if the alternative is Frankenstein's monster, Grendel's mother suddenly looks more attractive as a partner).
-- James Bottomley

I'm a few days backlogged at present, sorry. Probably because of the rain - I really should move the computer indoors.
-- Andrew Morton

When the logical extension to an answer to a problem is "Add a configuration option to almost every driver", you might want to rethink.
-- Matthew Garrett

Comments (none posted)

From wakelocks to a real solution

By Jonathan Corbet
February 18, 2009
Last week's article on wakelocks described a suspend-inhibiting interface which derives from the Android project and the hostile reaction that interface received. Since then, the discussion has continued in two separate threads. Kernel developers, like engineers everywhere, are problem solvers, so the discussion has shifted away from criticism of wakelocks and toward the search for an acceptable solution. As of this writing, that solution does not exist, but we have learned some interesting things about the problem space.

Getting Linux power management to work well has been a long, drawn-out process, much of which involves fixing device drivers and applications, one at a time. There is also a lot of work which has gone into ensuring that the CPU remains in an idle state as much as possible. One of the reasons that some developers found the wakelock interface jarring was that the Android developers chose a different approach to power management. Rather than minimize power consumption at any given time, the Android code simply tries to suspend the entire device whenever possible. There are a couple of reasons for this approach, one of which we will get to below.

But we'll start with a very simple reason why Android goes for the "suspend the entire world" solution: because they can. The hardware that Android runs on, like many embedded systems (but unlike most x86-based systems), has been designed to suspend and resume quickly. So the Android developers see no reason to do things any other way. But that leads to comments like this one from Matthew Garrett:

Part of the reason you're getting pushback is that your solution to the problem of shutting down unused hardware is tied to embedded-style systems with very low resume latencies. You can afford to handle the problem by entering an explicit suspend state. In the x86 mobile world, we don't have that option. It's simply too slow and disruptive to the user experience. As a consequence we're far more interested in hardware power management that doesn't require an explicit system-wide suspend.

A solution that's focused on powering down as much unused hardware as possible regardless of the system state benefits the x86 world as well as the embedded world, so I think there's a fairly strong argument that it's a better solution than one requiring an explicit system state change.

Matthew also notes that it's possible to solve the power management problem without fully suspending the system; he gives the Nokia tablets as an example of a successful implementation which uses finer-grained power management.

That said, it seems clear that the full-suspend approach to power management is not going to go away. Some hardware is designed to work best that way, so Linux needs to support that mode of operation. So there has been some talk about how to design wakelocks in a way which fits better into the kernel as a whole. On the kernel side, there is some dispute as to whether the wakelock mechanism is needed at all; drivers can already inhibit an attempt by the kernel to suspend the system. But there is some justice to the claim that it's better if the kernel knows it can't suspend the system without having to poll every driver.

One simple solution, proposed by Matthew, would be a simple pair of functions: inhibit_suspend() and uninhibit_suspend(). On production systems, they would manipulate an atomic counter; when the counter is zero, the system can be suspended. These functions could take a device structure as an argument; debugging versions could then track which devices are blocking a suspend at any given time. The user-space equivalent could be a file like /dev/inhibit_suspend; as long as at least one process holds that file open, the system will continue to run. All told, it looks like a simple API without many of the problems seen in the wakelock code.

There were a few complaints from the Android side, but the biggest sticking point appears to be over timeouts. The wakelock API implements an automatic timeout which causes the "lock" to go away after a given time. There appear to be a few reasons for the existence of the timeouts:

  • Since not all drivers use the wakelock API, timeouts are required to prevent suspending the system while those drivers are running. The proposed solution to this one is to instrument all of the drivers which need to keep the system running. Once an acceptable API is merged into the kernel, drivers can be modified as needed.

  • If a process holding a wakelock dies unexpectedly, the timeout will keep the system running while the watchdog code restarts the faulting process. The problem here is that timeouts encode a recovery policy in the kernel and do little to ensure that operation is actually correct. What has been proposed instead is that the user-space "inhibit suspend" policy be encapsulated into a separate daemon which would make the decisions on when to keep the system awake.

  • User-space applications may simply screw up and forget to allow the system to suspend.

The final case above is also used as an argument for the full-suspend approach to power management. Even if an ill-behaved application goes into a loop and refuses to quit, the system will eventually suspend and save its battery anyway. This is an argument which does not fly particularly well with a lot of kernel developers, who respond that, rather than coding the kernel to protect against poor applications, one should simply fix those applications. Arjan van de Ven points out that, since the advent of PowerTop, the bulk of the problems with open-source applications have been fixed.

In this space, though, it is harder to get a handle on all of these problems. Brian Swetland describes the situation this way:

  • carrier deploys a device
  • carrier agrees to allow installation of arbitrary third party apps without some horrible certification program requiring app authors to jump through hoops, wait ages for approval, etc
  • users rejoice and install all kinds of apps
  • some apps are poorly written and impact battery life
  • users complain to carrier about battery life

Matthew also acknowledges the problem:

Remember that Android has an open marketplace designed to appeal to Java programmers - users are going to end up downloading code from there and then blaming the platform if their battery life heads towards zero. I think "We can't trust our userland not to be dumb" is a valid concern.

It is a real problem, but it still is not at all clear that attempts to fix such problems in the kernel are advisable - or that they will be successful in the end. Ben Herrenschmidt offers a different solution: a daemon which monitors application behavior and warns the user when a given application is seen to be behaving badly. That would at least let users know where the real problem is. But it is, of course, no substitute for the real solution: run open-source applications on the phone so that poor behavior can be fixed by users if need be.

The Android platform is explicitly designed to enable proprietary applications, though. It may prove to be able to attract those applications in a way which standard desktop Linux has never quite managed to do. So some sort of solution to the problem of power management in the face of badly-written applications will need to be found. The Android developers like wakelocks as that solution for now, but they also appear to be interested in working with the community to find a more globally-acceptable solution. What that solution will look like, though, is unlikely to become clear without a lot more discussion.

Comments (16 posted)

Getting the measure of ksize()

By Jonathan Corbet
February 17, 2009
One of the lesser-known functions supported by the kernel's memory management code is ksize(); given a pointer to an object allocated with kmalloc(), ksize() will return the size of that object. This function is not often needed; callers to kmalloc() usually know what they allocated. It can be useful, though, in situations where a function needs to know the size of an object and does not have that information handy. As it happens, there are other potential uses for ksize(), but there are traps as well.

Users of ksize() in the mainline kernel are rare. Until 2008, the main user was the nommu architecture code, which was found to be using ksize() in a number of situations where that use was not appropriate. The result was a cleanup of the nommu code and the un-exporting of ksize() in an attempt to prevent that sort of situation from coming about again.

Happiness prevailed until recently; the 2.6.29-rc5 kernel includes a patch to the crypto code which makes use of ksize() to ensure that crypto_tfm structures are completely wiped of sensitive data before being returned to the system. The lack of an export for ksize() caused the crypto code to fail when built as a module, so Kirill Shutemov posted a patch to export it. That's when the discussion got interesting.

There was resistance to restoring the export for ksize(); the biggest problem would appear to be that it's an easy function to use incorrectly. It is only really correct to call ksize() with a pointer obtained from kmalloc(), but programmers seem to find themselves tempted to use it on other types of objects as well. This situation is not helped by the fact that the SLAB and SLUB memory allocators work just fine if any slab-allocated memory object is passed to ksize(). The SLOB allocator, instead, is not so accommodating. An explanation of this situation led to some complaints from Andrew Morton:

OK. This is really bad, isn't it? People will write code which happily works under slab and slub, only to have it crash for those small number of people who (very much later) test with slob?

[...]

Gee this sucks. Biggest mistake I ever made. Are we working hard enough to remove some of these sl?b implementations? Would it help if I randomly deleted a couple?

Thus far, no implementations have been deleted; indeed, it appears that the SLQB allocator is headed for inclusion in 2.6.30. The idea of restricting access to ksize() has also not gotten very far; the export of this function was restored for 2.6.29-rc5. In the end, the kernel is full of dangerous functions - such is the nature of kernel code - and it is not possible to defend against any mistake which could be made by kernel developers. As Matt Mackall put it, this is just another basic mistake:

And it -is- a category error. The fact that kmalloc is implemented on top of kmem_cache_alloc is an implementation detail that callers should not assume. They shouldn't call kfree() on kmem_cache_alloc objects (even though it might just happen to work), nor should they call ksize().

There is another potential reason to keep this function available: ksize() may prove to have a use beyond freeing developers from the need to track the size of allocated objects. One poorly-kept secret about kmalloc() is that it tends to allocate objects which are larger than the caller requests. A quick look at /proc/slabinfo will (with the right memory allocator) reveal a number of caches with names like kmalloc-256. Whenever a call to kmalloc() is made, the requested size will be rounded up to the next slab size, and an object of that size will be returned. (Again, this is true for the SLAB and SLUB allocators; SLOB is a special case).

This rounding-up results in a simpler and faster allocator, but those benefits are gained at the cost of some wasted memory. That is one of the reasons why it makes sense to create a dedicated slab for frequently-allocated objects. There is one interesting allocation case which is stuck with kmalloc(), though, for DMA-compatibility reasons: SKB (network packet buffer) allocations.

An SKB is typically sized to match the maximum transfer size for the intended network interface. In an Ethernet-dominated world, that size tends to be 1500 bytes. A 1500-byte object requested from kmalloc() will typically result in the allocation a 2048-byte chunk of memory; that's a significant amount of wasted RAM. As it happens, though, the network developers really need the SKB buffer to not cross page boundaries, so there is generally no way to avoid that waste.

But there may be a way to take advantage of it. Occasionally, the network layer needs to store some extra data associated with a packet; IPSec, it seems, is especially likely to create this type of situation. The networking layer could allocate more memory for that data, or it could use krealloc() to expand the existing buffer allocation, but both will slow down the highly-tuned networking core. What would be a lot nicer would be to just use some extra space that happened to be lying around. With a buffer from kmalloc(), that space might just be there. The way to find out, of course, is to use ksize(). And that's exactly what the networking developers intend to do.

Not everybody is convinced that this kind of trick is worth the trouble. Some argue that the extra space should be allocated explicitly if it will be needed later. Others would like to see some benchmarks demonstrating that there is a real-world benefit from this technique. But, in the end, kernel developers do appreciate a good trick. So ksize() will be there should this kind of code head for the mainline in the future.

Comments (5 posted)

Interview: the return of the realtime preemption tree

By Jonathan Corbet
February 16, 2009
The realtime preemption project is a longstanding effort to provide deterministic response times in a general-purpose kernel. Much code resulting from this work has been merged into the mainline kernel over the last few years, and a number of vendors are shipping commercial products based upon it. But, for the last year or so, progress toward getting the rest of the realtime work into the mainline has slowed.

On February 11, realtime developers Thomas Gleixner and Ingo Molnar resurfaced with the announcement of a new realtime preemption tree and a newly reinvigorated development effort. Your editor asked them if they would be willing to answer a few questions about this work; their response went well beyond the call of duty. Read on for a detailed look at where the realtime preemption tree stands and what's likely to happen in the near future.

LWN: The 2.6.29-rc4-rt1 announcement notes that you're coming off a 1.5-year sabbatical. Why did you step away from the RT patches so long; have you been hanging out on the beach in the mean time? :)

Thomas: We spent a marvelous time at the x86 lagoon, a place with an extreme contrast of antiquities and modern art. :)

Seriously, we underestimated the amount of work which was necessary to bring the unified x86 architecture into shape. Nothing to complain about; it definitely was and still is a worthwhile effort and I would not hesitate longer than a fraction of a second to do it again.

Ingo: Yeah, hanging out on the beach for almost two years was well-deserved for both of us. We met Linus there and it was all fun and laughter, with free beach cocktails, pretty sunsets and camp fires. [ All paid for by the nice folks from Microsoft btw., - those guys sure know how to please a Linux kernel hacker! ;-) ]

So what has brought you back to the realtime work at this time?

Thomas: Boredom and nostalgia :) In fact I never lost track of the real time work since we took over x86 maintenance, but my effort was restricted to decode hard to solve problems and make sure that the patches were kept in a usable shape. Right now I have the feeling that we need to put more development effort into preempt-rt again to keep its upstream visibility and make progress on merging the remaining parts.

The most important reason for returning was of course our editors challenge in The Grumpy Editor's guide to 2009: "The realtime patch set will be mostly merged by the end of the year..."

Ingo: When we left for the x86 land more than 1.5 years ago, the -rt patch-queue was a huge pile of patches that changed hundreds of critical kernel files and introduced/touched ten thousand new lines of code. Fast-forward 1.5 years and the -rt patchqueue is a humungous pile of patches that changes nearly a thousand critical kernel files and introduces/touches twenty-thirty thousand lines of code. So we thought that while the project is growing nicely, it is useful and obviously people love it - the direction of growth was a bit off and that this particular area needs some help.

Initially it started as a thought experiment of ours: how much time and effort would it take to port the most stable -rt patch (.26-rt15) to the .29-tip tree and could we get it to boot? Turns out we are very poor at thought experiments (just like we are pretty bad at keeping patch queues small), so we had to go and settle the argument via some hands-on hacking. Porting the queue was serious fun, it even booted after a few dozen fixes, and the result was the .29-rt1 release.

Maintaining the x86 tree for such a long time and doing many difficult conceptual modernizations in that area was also very helpful when porting the -rt patch-queue to latest mainline.

Most of the code it touched and most of the conflicts that came up looked strangely familiar to us, as if those upstream changes went through our trees =;-)

(It's certainly nothing compared to the beach experience though, so we are still looking at returning for a few months to a Hawaii cruise.)

How well does the realtime code work at this point? What do you think are the largest remaining issues to be tackled?

Thomas: The realtime code has reached quite a stable state. The 2.6.24/26 based versions can definitely be considered production ready. I spent a lot of time to sort out a huge amount of details in those versions to make them production stable. Still we need to refactor a lot of the patches and look for mainline acceptable solutions for some of the real time related changes.

Ingo: To me what settled quite a bit of "do we need -rt in mainline" questions were the spin-mutex enhancements it got. Prior that there were a handful of pretty pathologic workload scenarios where -rt performance tanked over mainline. With that it's all pretty comparable.

The patch splitup and patch quality has improved too, and the queue we ported actually builds and boots at just about every bisection point, so it's pretty usable. A fair deal of patches fell out of the .26 queue because they went upstream meanwhile: tracing patches, scheduler patches, dyntick/hrtimer patches, etc.

It all looks a lot less scary now than it looked 1.5 years ago - albeit the total size is still considerable, so there's definitely still a ton of work with it.

What are your current thoughts with regard to merging this work into the mainline?

Thomas: First of all we want to integrate the -rt patches into our -tip git repository which makes it easier to keep -rt in sync with the ongoing mainline development. The next steps are to gradually refactor the patches either by rewriting or preferably by pulling in the work which was done in Stevens git-rt tree, split out parts which are ready and merge them upstream step by step.

Ingo: IMO the key thought here is to move the -rt tree 'ahead of the upstream development curve' again, and to make it the frontier of Linux R&D. With a 2.6.26 basis that was arguably hard to do. With a partly-2.6.30 basis (which the -tip tree really is) it's a lot more ahead of the curve, and there are a lot more opportunities to merge -rt bits into upstream bits wherever there's accidental upstream activity that we could hang -rt related cleanups and changes onto. We jumped almost 4 full kernel releases, that moves -rt across 1 year worth of upstream development - and keeps it at that leading edge.

Another factor is that most of the top -rt contributors are also -tip contributors so there's strong synergy.

The -tip tree also undergoes serious automated stabilization and productization efforts, so it's a good basis for development _and_ for practical daily use. For example there were no build failures reported against .29-rt1, and most of the other failures that were reported were non-fatal as well and were quickly fixed. One of the main things we learned in the past 1.5 years was how to keep a tree stable against a wild, dangerous looking flux of modifications.

YMMV ;-)

Thomas once told me about a scheme to patch rtmutex locks into/out of the kernel at boot time, allowing distributors to ship a single kernel which can run in either realtime or "normal" mode. Is that still something that you're working on?

Thomas: We are not working on that right now, but it is still on the list of things which need to be investigated.

Ingo: That still sounds like an interesting feature, but it's pretty hard to pull it off. We used to have something rather close to that, a few years ago: a runtime switch that turned the rtmutex code back into spinning code. It was fragile and hard to maintain and eventually we dropped it.

Ideally it should be done not at boot time but runtime - via the stop-machine-run mechanism or so. [extended perhaps with hibernation bits that force each task into hitting user-mode, so that all locks in the system are released]

It's really hard to implement it, and it is definitely not for the faint hearted.

The RT-preempt code would appear to be one of the biggest exceptions to the "upstream first" rule, which urges code to be merged into the mainline before being shipped to customers. How has that worked out in this case? Are there times when it is good to keep shipping code out of the mainline for such a long time?

Thomas: It is an exception which was only acceptable because preempt-rt does not introduce new user space APIs. It just changes the run time behaviour of the kernel to a deterministic mode.

All changes which are user space API related (e.g. PI futexes) were merged into mainline before they got shipped to customers via preempt-rt and all bug fixes and improvements of mainline code were sent upstream immediately. Preempt-rt was never a detached project which did not care about mainline.

When we started preempt-rt there was huge demand on the customer side - both enterprise and embedded - for an in kernel realtime solution. The dual kernel approaches of RTAI, RT-Linux and Xenomai had no chance to get ever accepted into the mainline and the handling of the dual kernel environment has never been an easy task. With preempt-rt you just switch the kernel under a stock mainline user space environment and voila your application behaves as you would expect - most of the time :) Dual kernel environments require different libraries, different APIs and you can not run the same binary on a non -rt enabled kernel. Debugging preempt-rt based real time applications is exactly the same as debugging non real time applications.

While we never had doubts that it would be possible to turn Linux into a real time OS, it was clear from the very beginning that it would be a long way until the last bits and pieces got merged. While we never had doubts that it would be possible to turn Linux into a real time OS, it was clear from the very beginning that it would be a long way until the last bits and pieces got merged. The first question Ingo asked me when I contacted him in the early days of preempt-rt was: "Are you sure that you want to touch every part of the kernel while working on preempt-rt?". This question was absolutely legitimate; in the first days of preempt-rt we really touched every part of the kernel due to problems which were mostly locking and preemption related. The fixes have been merged upstream and especially in the locking area we got a huge improvement in mainline due to lock debugging, conversion to mutexes, etc. and a general better awareness of locking and preemption semantics.

preempt-rt was always a great breeding ground for fundamental changes in the kernel and so far quite a large part of the preempt-rt development has been integrated into the mainline: PI-futexes, high-resolution timers ... I hope we can keep that up and provide soon more interesting technological changes which emerged originally from the preempt-rt efforts.

Ingo: Preempt-rt turns the kernel's scheduling, lock handling and interrupt handling code upside down, so there was no realistic way to merge it all upstream without having had some actual field feedback. It is also unique in that you need _all_ those changes to have the new kernel behavior - there's no real gradual approach to the -rt concept itself. That adds up to a bit of a catch-22: you don't get it upstream without field use, and you don't get field use without it being upstream.

Deterministic execution is a major niche, one of which was not effectively covered by the mainstream kernel before. It's perhaps the last major technological niches in existence that the stock upstream kernel does not handle yet, and it's no wonder that the last one out is in that precise position for conceptually hard reasons.

In short: all the easy technologies are upstream already ;-)

Nevertheless we strictly got all user-ABI changes upstream first: PI-futexes in particular. The rest of -rt is "just" a new kernel option that magically turns kernel execution into deterministic mode.

Where would be the best starting point for a developer who wishes to contribute to this effort?

Thomas: Nothing special with the realtime patches. Just kernel development as usual: get the patches, apply them, run them on your machine and test. If problems arise, provide bug reports or try to fix them yourself and send patches. Read through the code and start providing improvements, cleanups ...

Ingo: Beyond the "try it yourself, follow the discussions, and go wherever your heart tells you to go" suggestion, there's a few random areas that might need more attention:

  • Big Kernel Lock removal. It's critical for -rt. We still have the tip:core/kill-the-BKL branch, and if someone is interested it would be nice to drive that effort forward. A lot of nice help-zap-the-BKL patches went upstream recently (such as the device-open patches), so we are in a pretty good position to try the kill-the-BKL final hammer approach too.

    [I have just done a (raw!) refresh and conflict resolution merge of that tree to v2.6.29-rc5. Interested people can find it at:

          git pull \
            git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git \
            core/kill-the-BKL
    
    Warning: it might not even build. ]

  • Look at Steve's git-rt tree and split out and gradually merge bits. A fair deal of stuff has been cleaned up there and it would be nice to preserve that work.

  • Latency measurements and tooling. Go try the latency tracer, the function graph tracer and ftrace in general. Try to find delays in apps caused by the kernel (or caused by the app itself), and think about whether the kernel's built-in tools could be improved.

  • Try Thomas's cyclictest utility and try to trace and improve those worst-case latencies. A nice target would be to push the worst-case latencies on a contemporary PC below 10 microseconds. We were down to about 13 microseconds with a hack that threaded the timer IRQ with .29-rt1, so it's possible to go below 10 microseconds i think.

  • And of course: just try to improve the mainline kernel - that will improve the -rt kernel too, by definition :-)

But as usual, follow your own path. Independent, critical thinking is a lot more valuable than follow-the-crowd behavior. [As long as it ends up producing patches (not flamewars) that is ;-)]

And by all means, start small and seek feedback on lkml early and often. Being a good and useful kernel developer is not an attribute but a process, and good processes always need time, many gradual steps and a feedback loop to thrive.

Many thanks to Thomas and Ingo for taking the time to answer (in detail!) this long list of questions.

Comments (19 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux v2.6.29-rc5 ?
Thomas Gleixner 2.6.29-rc4-rt1 ?
Thomas Gleixner 2.6.29-rc4-rt2 ?
Greg KH Linux 2.6.28.6 ?
Greg KH Linux 2.6.28.5 ?
Greg KH Linux 2.6.27.18 ?
Greg KH Linux 2.6.27.17 ?
Greg KH Linux 2.6.27.16 ?

Architecture-specific

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Johannes Weiner kzfree() ?
Johannes Weiner kzfree() v2 ?

Networking

Security-related

Virtualization and containers

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds