Brief items
The current 2.6 development kernel is 2.6.29-rc5,
released on February 13.
It has some driver updates and a lot of fixes. "
So go out and test
the heck out of it, because I'm going to spend the three-day weekend drunk
at the beach. Because somebody has to do it." See
the
full changelog for all the details.
The current stable 2.6 kernel is 2.6.28.6, released (along with 2.6.27.18) on February 17.
Both contain a long list of fixes for a variety of problems.
Previously, 2.6.28.5 and 2.6.27.16 were released on
February 12. 2.6.27.17
was rushed out moments afterward with a fix to an "instant oops" problem
on some laptops.
Comments (none posted)
Kernel development news
For example iSCSI: blew its early promise, pulled a bunch of
unnecessary networking into the protocol and ended up too big to
fit in disk firmware (thus destroying the ability to have a simple
network tap to replace storage fabric). It's been slowly fading
until Virtualisation came along. Now all the other solutions to
getting storage into virtual machines are so horrible and arcane
that iSCSI looks like a winner (if the alternative is
Frankenstein's monster, Grendel's mother suddenly looks more
attractive as a partner).
--
James Bottomley
I'm a few days backlogged at present, sorry. Probably because of
the rain - I really should move the computer indoors.
--
Andrew Morton
When the logical extension to an answer to a problem is "Add a
configuration option to almost every driver", you might want to
rethink.
--
Matthew Garrett
Comments (none posted)
By Jonathan Corbet
February 18, 2009
Last week's article on
wakelocks described a suspend-inhibiting interface which derives
from the Android project and the hostile reaction that interface received.
Since then, the discussion has continued in two separate threads. Kernel
developers, like engineers everywhere, are problem solvers, so the
discussion has shifted away from criticism of wakelocks and toward the
search for an acceptable solution. As of this writing, that solution does
not exist, but we have learned some interesting things about the problem
space.
Getting Linux power management to work well has been a long, drawn-out
process, much of which involves fixing device drivers and applications, one
at a time. There is also a lot of work which has gone into ensuring that
the CPU remains in an idle state as much as possible. One of the reasons
that some developers found the wakelock interface jarring was that the
Android developers chose a different approach to power management. Rather
than minimize power consumption at any given time, the Android code simply
tries to suspend the entire device whenever possible. There are a couple
of reasons for this approach, one of which we will get to below.
But we'll start with a very simple reason why Android goes for the "suspend the entire
world" solution: because they can. The hardware that Android runs on, like
many embedded systems (but unlike most x86-based systems), has been
designed to suspend and resume quickly. So
the Android developers see no reason to do things any other way. But that
leads to comments like this one from Matthew
Garrett:
Part of the reason you're getting pushback is that your solution to
the problem of shutting down unused hardware is tied to
embedded-style systems with very low resume latencies. You can
afford to handle the problem by entering an explicit suspend
state. In the x86 mobile world, we don't have that option. It's
simply too slow and disruptive to the user experience. As a
consequence we're far more interested in hardware power management
that doesn't require an explicit system-wide suspend.
A solution that's focused on powering down as much unused hardware
as possible regardless of the system state benefits the x86 world
as well as the embedded world, so I think there's a fairly strong
argument that it's a better solution than one requiring an explicit
system state change.
Matthew also notes that it's possible to solve the power management problem
without fully suspending the system; he gives the Nokia tablets as an
example of a successful implementation which uses finer-grained power
management.
That said, it seems clear that the full-suspend approach to power
management is not going to go away. Some hardware is designed to work best
that way, so Linux needs to support that mode of operation. So there has
been some talk about how to design wakelocks in a way which fits better
into the kernel as a whole. On the kernel side, there is some dispute as
to whether the wakelock mechanism is needed at all; drivers can already
inhibit an attempt by the kernel to suspend the system. But there is some
justice to the claim that it's better if the kernel knows it can't suspend
the system without having to poll every driver.
One simple solution, proposed by Matthew,
would be a simple pair of functions: inhibit_suspend() and
uninhibit_suspend(). On production systems, they would
manipulate an atomic counter; when the counter is zero, the system can be
suspended. These functions could take a device structure as an
argument; debugging versions could then track which devices are blocking a
suspend at any given time. The user-space equivalent could be a file like
/dev/inhibit_suspend; as long as at least one process holds that
file open, the system will continue to run. All told, it looks like a
simple API without many of the problems seen in the wakelock code.
There were a few complaints from the Android side, but the biggest sticking
point appears to be over timeouts. The wakelock API implements an
automatic timeout which causes the "lock" to go away after a given time.
There appear to be a few reasons for the existence of the timeouts:
- Since not all drivers use the wakelock API, timeouts are required to
prevent suspending the system while those drivers are running. The
proposed solution to this one is to instrument all of the drivers
which need to keep the system running. Once an acceptable API is
merged into the kernel, drivers can be modified as needed.
- If a process holding a wakelock dies unexpectedly, the timeout will
keep the system running while the watchdog code restarts the faulting
process. The problem here is that timeouts encode a recovery policy
in the kernel and do little to ensure that operation is actually
correct. What has been proposed instead is that the user-space
"inhibit suspend" policy be encapsulated into a separate daemon which
would make the decisions on when to keep the system awake.
- User-space applications may simply screw up and forget to allow the
system to suspend.
The final case above is also used as an argument for the full-suspend
approach to power management. Even if an ill-behaved application goes into
a loop and refuses to quit, the system will eventually suspend and save its
battery anyway. This is an argument which does not fly particularly well
with a lot of kernel developers, who respond that, rather than coding the
kernel to protect against poor applications, one should simply fix those
applications. Arjan van de Ven points out
that, since the advent of PowerTop,
the bulk of the problems with open-source applications have been fixed.
In this space, though, it is harder to get a handle on all of these
problems. Brian Swetland describes the
situation this way:
- carrier deploys a device
- carrier agrees to allow installation of arbitrary third party apps
without some horrible certification program requiring app authors
to jump through hoops, wait ages for approval, etc
- users rejoice and install all kinds of apps
- some apps are poorly written and impact battery life
- users complain to carrier about battery life
Matthew also acknowledges the problem:
Remember that Android has an open marketplace designed to appeal to
Java programmers - users are going to end up downloading code from
there and then blaming the platform if their battery life heads
towards zero. I think "We can't trust our userland not to be dumb"
is a valid concern.
It is a real problem, but it still is not at all clear that attempts to fix
such problems in the kernel are advisable - or that they will be successful
in the end. Ben Herrenschmidt offers a
different solution: a daemon which monitors application behavior and warns
the user when a given application is seen to be behaving badly. That would
at least let users know where the real problem is. But it is, of course,
no substitute for the real solution: run open-source applications on the
phone so that poor behavior can be fixed by users if need be.
The Android platform is explicitly designed to enable proprietary
applications, though. It may prove to be able to attract those
applications in a way which standard desktop Linux has never quite managed
to do. So some sort of solution to the problem of power management in the
face of badly-written applications will need to be found. The Android
developers like wakelocks as that solution for now, but they also appear to
be interested in working with the community to find a more
globally-acceptable solution. What that solution will look like, though,
is unlikely to become clear without a lot more discussion.
Comments (16 posted)
By Jonathan Corbet
February 17, 2009
One of the lesser-known functions supported by the kernel's memory
management code is
ksize(); given a pointer to an object allocated
with
kmalloc(),
ksize() will return the size of that
object. This function is not often needed; callers to
kmalloc()
usually know what they allocated. It can be useful, though, in situations
where a function needs to know the size of an object and does not have that
information handy. As it happens, there are other potential uses for
ksize(), but there are traps as well.
Users of ksize() in the mainline kernel are rare. Until 2008, the
main user was the nommu architecture code, which was found to be using
ksize() in a number of situations where that use was not
appropriate. The result was a cleanup of the nommu code and the
un-exporting of ksize() in an attempt to prevent that sort of
situation from coming about again.
Happiness prevailed until recently; the 2.6.29-rc5 kernel includes a patch to the crypto code which makes use of
ksize() to ensure that crypto_tfm structures are
completely wiped of sensitive data before being returned to the system.
The lack of an export for ksize() caused the crypto code to fail
when built as a module, so Kirill Shutemov posted a patch to export it. That's when the
discussion got interesting.
There was resistance to restoring the export for ksize(); the
biggest problem would appear to be that it's an easy function to use
incorrectly. It is only really correct to call ksize() with a
pointer obtained from kmalloc(), but programmers seem to find
themselves tempted to use it on other types of objects as well. This
situation is not helped by the fact that the SLAB and SLUB memory
allocators work just fine if any slab-allocated memory object is passed to
ksize(). The SLOB allocator, instead, is not so accommodating.
An explanation of this situation led to some
complaints from Andrew Morton:
OK. This is really bad, isn't it? People will write code which
happily works under slab and slub, only to have it crash for those
small number of people who (very much later) test with slob?
[...]
Gee this sucks. Biggest mistake I ever made. Are we working hard
enough to remove some of these sl?b implementations? Would it help
if I randomly deleted a couple?
Thus far, no implementations have been deleted; indeed, it appears that the
SLQB allocator is headed for
inclusion in 2.6.30. The idea of restricting access to ksize()
has also not gotten very far; the export of this function was restored for
2.6.29-rc5. In the end, the kernel is full of dangerous functions - such
is the nature of kernel code - and it is not possible to defend against any
mistake which could be made by kernel developers. As Matt Mackall put it, this is just another basic mistake:
And it -is- a category error. The fact that kmalloc is implemented
on top of kmem_cache_alloc is an implementation detail that callers
should not assume. They shouldn't call kfree() on kmem_cache_alloc
objects (even though it might just happen to work), nor should they
call ksize().
There is another potential reason to keep this function available: ksize()
may prove to have a use beyond freeing
developers from the need to track the size of allocated objects. One
poorly-kept secret about kmalloc() is that it tends to allocate
objects which are larger than the caller requests. A quick look at
/proc/slabinfo will (with the right memory allocator) reveal a
number of caches with names like kmalloc-256. Whenever a call to
kmalloc() is made, the requested size will be rounded up to the
next slab size, and an object of that size will be returned. (Again, this
is true for the SLAB and SLUB allocators; SLOB is a special case).
This rounding-up results in a simpler and faster allocator, but those
benefits are gained at the cost of some wasted memory. That is one of the
reasons why it makes sense to create a dedicated slab for
frequently-allocated objects. There is one interesting allocation case
which is stuck with kmalloc(), though, for DMA-compatibility
reasons: SKB (network packet buffer) allocations.
An SKB is typically sized to match the maximum transfer size for the
intended network interface. In an Ethernet-dominated world, that size
tends to be 1500 bytes. A 1500-byte object requested from
kmalloc() will typically result in the allocation a 2048-byte
chunk of memory; that's a significant amount of wasted RAM. As it happens,
though, the network developers really need the SKB buffer to not cross page
boundaries, so there is generally no way to avoid that waste.
But there may be a way to take advantage of it. Occasionally, the network
layer needs to store some extra data associated with a packet; IPSec, it
seems, is especially likely to create this type of situation. The
networking layer could allocate more memory for that data, or it could use
krealloc() to expand the existing buffer allocation, but both will
slow down the highly-tuned networking core. What would be a lot nicer
would be to just use some extra space that happened to be lying around.
With a buffer from kmalloc(), that space might just be there.
The way to find out, of course, is to use ksize(). And that's
exactly what the networking developers intend to do.
Not everybody is convinced that this kind of trick is worth the trouble.
Some argue that the extra space should be allocated explicitly if it will
be needed later. Others would like to see some benchmarks demonstrating
that there is a real-world benefit from this technique. But, in the end,
kernel developers do appreciate a good trick. So ksize() will be
there should this kind of code head for the mainline in the future.
Comments (5 posted)
By Jonathan Corbet
February 16, 2009
The realtime preemption project is a longstanding effort to provide
deterministic response times in a general-purpose kernel. Much code
resulting from this work has been merged into the mainline kernel over the
last few years, and a number of vendors are shipping commercial products
based upon it. But, for the last year or so, progress toward getting the
rest of the realtime work into the mainline has slowed.
On February 11, realtime developers Thomas Gleixner and Ingo Molnar
resurfaced with the announcement of a new realtime preemption tree
and a newly reinvigorated development effort. Your editor asked them if
they would be willing to answer a few questions about this work; their
response went well beyond the call of duty. Read on for a detailed look at
where the realtime preemption tree stands and what's likely to happen in
the near future.
LWN: The 2.6.29-rc4-rt1 announcement notes that you're coming off a
1.5-year sabbatical. Why did you step away from the RT patches so
long; have you been hanging out on the beach in the mean time? :)
Thomas:
We spent a marvelous time at the x86 lagoon, a place with an extreme
contrast of antiquities and modern art. :)
Seriously, we underestimated the amount of work which was necessary to
bring the unified x86 architecture into shape. Nothing to complain
about; it definitely was and still is a worthwhile effort and I would
not hesitate longer than a fraction of a second to do it again.
Ingo:
Yeah, hanging out on the beach for almost two years was well-deserved
for both of us. We met Linus there and it was all fun and laughter, with
free beach cocktails, pretty sunsets and camp fires. [ All paid for by
the nice folks from Microsoft btw., - those guys sure know how to please
a Linux kernel hacker! ;-) ]
So what has brought you back to the realtime work at this time?
Thomas:
Boredom and nostalgia :) In fact I never lost track of the real time
work since we took over x86 maintenance, but my effort was restricted to
decode hard to solve problems and make sure that the patches were kept
in a usable shape. Right now I have the feeling that we need to put more
development effort into preempt-rt again to keep its upstream visibility
and make progress on merging the remaining parts.
The most important reason for returning was of course our editors
challenge in The Grumpy Editor's
guide to 2009:
"The realtime patch set will be mostly merged by the end of the year..."
Ingo:
When we left for the x86 land more than 1.5 years ago, the -rt
patch-queue was a huge pile of patches that changed hundreds of critical
kernel files and introduced/touched ten thousand new lines of code.
Fast-forward 1.5 years and the -rt patchqueue is a humungous pile of
patches that changes nearly a thousand critical kernel files and
introduces/touches twenty-thirty thousand lines of code.
So we thought that while the project is growing nicely, it is useful and
obviously people love it - the direction of growth was a bit off and
that this particular area needs some help.
Initially it started as a thought experiment of ours: how much time and
effort would it take to port the most stable -rt patch (.26-rt15) to the
.29-tip tree and could we get it to boot?
Turns out we are very poor at thought experiments (just like we are
pretty bad at keeping patch queues small), so we had to go and settle
the argument via some hands-on hacking.
Porting the queue was serious fun, it even booted after a few dozen
fixes, and the result was the .29-rt1 release.
Maintaining the x86 tree for such a long time and doing many difficult
conceptual modernizations in that area was also very helpful when
porting the -rt patch-queue to latest mainline.
Most of the code it touched and most of the conflicts that came up
looked strangely familiar to us, as if those upstream changes went
through our trees =;-)
(It's certainly nothing compared to the beach experience though, so we
are still looking at returning for a few months to a Hawaii cruise.)
How well does the realtime code work at this point? What do you think
are the largest remaining issues to be tackled?
Thomas:
The realtime code has reached quite a stable state. The 2.6.24/26
based versions can definitely be considered production ready. I spent
a lot of time to sort out a huge amount of details in those versions
to make them production stable. Still we need to refactor a lot of the
patches and look for mainline acceptable solutions for some of the
real time related changes.
Ingo:
To me what settled quite a bit of "do we need -rt in mainline" questions
were the spin-mutex enhancements it got. Prior that there were a handful
of pretty pathologic workload scenarios where -rt performance tanked
over mainline. With that it's all pretty comparable.
The patch splitup and patch quality has improved too, and the queue we
ported actually builds and boots at just about every bisection point, so
it's pretty usable. A fair deal of patches fell out of the .26 queue
because they went upstream meanwhile: tracing patches, scheduler
patches, dyntick/hrtimer patches, etc.
It all looks a lot less scary now than it looked 1.5 years ago - albeit
the total size is still considerable, so there's definitely still a ton
of work with it.
What are your current thoughts with regard to merging this work into
the mainline?
Thomas:
First of all we want to integrate the -rt patches into our -tip git
repository which makes it easier to keep -rt in sync with the ongoing
mainline development. The next steps are to gradually refactor the
patches either by rewriting or preferably by pulling in the work which
was done in Stevens git-rt tree, split out parts which are ready and
merge them upstream step by step.
Ingo:
IMO the key thought here is to move the -rt tree 'ahead of the upstream
development curve' again, and to make it the frontier of Linux R&D.
With a 2.6.26 basis that was arguably hard to do.
With a partly-2.6.30 basis (which the -tip tree really is) it's a lot
more ahead of the curve, and there are a lot more opportunities to merge
-rt bits into upstream bits wherever there's accidental upstream
activity that we could hang -rt related cleanups and changes onto.
We jumped almost 4 full kernel releases, that moves -rt across 1 year
worth of upstream development - and keeps it at that leading edge.
Another factor is that most of the top -rt contributors are also -tip
contributors so there's strong synergy.
The -tip tree also undergoes serious automated stabilization and
productization efforts, so it's a good basis for development _and_ for
practical daily use.
For example there were no build failures reported against .29-rt1, and
most of the other failures that were reported were non-fatal as well and
were quickly fixed. One of the main things we learned in the past 1.5
years was how to keep a tree stable against a wild, dangerous looking
flux of modifications.
YMMV ;-)
Thomas once told me about a scheme to patch rtmutex locks into/out of
the kernel at boot time, allowing distributors to ship a single kernel
which can run in either realtime or "normal" mode. Is that still
something that you're working on?
Thomas:
We are not working on that right now, but it is still on the list of
things which need to be investigated.
Ingo:
That still sounds like an interesting feature, but it's pretty hard to
pull it off. We used to have something rather close to that, a few years
ago: a runtime switch that turned the rtmutex code back into spinning
code. It was fragile and hard to maintain and eventually we dropped it.
Ideally it should be done not at boot time but runtime - via the
stop-machine-run mechanism or so. [extended perhaps with hibernation
bits that force each task into hitting user-mode, so that all locks in
the system are released]
It's really hard to implement it, and it is definitely not for the faint
hearted.
The RT-preempt code would appear to be one of the biggest exceptions
to the "upstream first" rule, which urges code to be merged into the
mainline before being shipped to customers. How has that worked out
in this case? Are there times when it is good to keep shipping code
out of the mainline for such a long time?
Thomas:
It is an exception which was only acceptable because preempt-rt does not
introduce new user space APIs. It just changes the run time behaviour of
the kernel to a deterministic mode.
All changes which are user space API related (e.g. PI futexes) were
merged into mainline before they got shipped to customers via
preempt-rt and all bug fixes and improvements of mainline code were
sent upstream immediately. Preempt-rt was never a detached project
which did not care about mainline.
When we started preempt-rt there was huge demand on the customer side
- both enterprise and embedded - for an in kernel realtime solution. The
dual kernel approaches of RTAI, RT-Linux and Xenomai had no chance to
get ever accepted into the mainline and the handling of the dual kernel
environment has never been an easy task. With preempt-rt you just switch
the kernel under a stock mainline user space environment and voila your
application behaves as you would expect - most of the time :) Dual
kernel environments require different libraries, different APIs and you
can not run the same binary on a non -rt enabled kernel. Debugging
preempt-rt based real time applications is exactly the same as debugging
non real time applications.
[PULL QUOTE:
While we never had doubts that it would be possible to turn Linux into a
real time OS, it was clear from the very beginning that it would be a
long way until the last bits and pieces got merged.
END QUOTE]
While we never had doubts that it would be possible to turn Linux into a
real time OS, it was clear from the very beginning that it would be a
long way until the last bits and pieces got merged. The first question
Ingo asked me when I contacted him in the early days of preempt-rt was:
"Are you sure that you want to touch every part of the kernel while
working on preempt-rt?". This question was absolutely legitimate; in the
first days of preempt-rt we really touched every part of the kernel due
to problems which were mostly locking and preemption related. The fixes
have been merged upstream and especially in the locking area we got a
huge improvement in mainline due to lock debugging, conversion to
mutexes, etc. and a general better awareness of locking and preemption
semantics.
preempt-rt was always a great breeding ground for fundamental changes in
the kernel and so far quite a large part of the preempt-rt development
has been integrated into the mainline: PI-futexes, high-resolution
timers ... I hope we can keep that up and provide soon more interesting
technological changes which emerged originally from the preempt-rt
efforts.
Ingo:
Preempt-rt turns the kernel's scheduling, lock handling and interrupt
handling code upside down, so there was no realistic way to merge it all
upstream without having had some actual field feedback.
It is also unique in that you need _all_ those changes to have the new
kernel behavior - there's no real gradual approach to the -rt concept
itself.
That adds up to a bit of a catch-22: you don't get it upstream without
field use, and you don't get field use without it being upstream.
Deterministic execution is a major niche, one of which was not
effectively covered by the mainstream kernel before. It's perhaps the
last major technological niches in existence that the stock upstream
kernel does not handle yet, and it's no wonder that the last one out is
in that precise position for conceptually hard reasons.
In short: all the easy technologies are upstream already ;-)
Nevertheless we strictly got all user-ABI changes upstream first:
PI-futexes in particular. The rest of -rt is "just" a new kernel option
that magically turns kernel execution into deterministic mode.
Where would be the best starting point for a developer who wishes to
contribute to this effort?
Thomas:
Nothing special with the realtime patches. Just kernel development as
usual: get the patches, apply them, run them on your machine and test.
If problems arise, provide bug reports or try to fix them yourself and
send patches. Read through the code and start providing improvements,
cleanups ...
Ingo:
Beyond the "try it yourself, follow the discussions, and go wherever
your heart tells you to go" suggestion, there's a few random areas that
might need more attention:
- Big Kernel Lock removal. It's critical for -rt. We still have the
tip:core/kill-the-BKL branch, and if someone is interested it would
be nice to drive that effort forward. A lot of nice help-zap-the-BKL
patches went upstream recently (such as the device-open patches), so
we are in a pretty good position to try the kill-the-BKL final hammer
approach too.
[I have just done a (raw!) refresh and conflict resolution merge of
that tree to v2.6.29-rc5. Interested people can find it at:
git pull \
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git \
core/kill-the-BKL
Warning: it might not even build. ]
- Look at Steve's git-rt tree and split out and gradually merge bits. A
fair deal of stuff has been cleaned up there and it would be nice to
preserve that work.
- Latency measurements and tooling. Go try the latency tracer, the
function graph tracer and ftrace in general. Try to find delays in
apps caused by the kernel (or caused by the app itself), and think
about whether the kernel's built-in tools could be improved.
- Try Thomas's cyclictest utility and try to trace and improve those
worst-case latencies. A nice target would be to push the worst-case
latencies on a contemporary PC below 10 microseconds. We were down to
about 13 microseconds with a hack that threaded the timer IRQ with
.29-rt1, so it's possible to go below 10 microseconds i think.
- And of course: just try to improve the mainline kernel - that will
improve the -rt kernel too, by definition :-)
But as usual, follow your own path. Independent, critical thinking is a
lot more valuable than follow-the-crowd behavior. [As long as it ends
up producing patches (not flamewars) that is ;-)]
And by all means, start small and seek feedback on lkml early and often.
Being a good and useful kernel developer is not an attribute but a
process, and good processes always need time, many gradual steps and a
feedback loop to thrive.
Many thanks to Thomas and Ingo for taking the time to answer (in detail!)
this long list of questions.
Comments (19 posted)
Patches and updates
Kernel trees
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
- Johannes Weiner: kzfree().
(February 16, 2009)
Networking
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>