Brief items
The current 2.6 development kernel is 2.6.29-rc6,
released on February 22. The list
of changes is still pretty long, but, with luck, the problems are getting
fixed. See the announcement for the short-form changelog, or see
the
full changelog for all the details.
As of this writing, a few dozen post-rc6 patches have found their way into
the mainline repository. They include more fixes, but also new drivers for
Atheros L1C gigabit Ethernet adapters and FireDTV IEEE1394 adapters.
The current stable 2.6 kernel is 2.6.28.7, released (without
announcement) on February 20. It contains the usual long list of
fixes, many of which are for the ext4 filesystem; the
changelog has the details. 2.6.27.19 was also released on the 20th without an
announcement; see the
changelog for the list of patches included there.
Comments (4 posted)
Kernel development news
Especially for developers who are just starting out with submitting
patches to a project, it's rare that a patch is of sufficiently
high quality that it can be applied directly into the repository
without needing fixups of one kind or another. The patch might
not have the right coding style compared to the surrounding code,
or it might be fundamentally buggy because the patch submitter
didn't understand the code completely. Indeed, more often than
not, when someone submits a patch to me, it is more useful for
indicating the location of the bug more than anything else, and I
often have to completely rewrite the patch before it enters into
the e2fsprogs mainline repository.
--
Ted Ts'o
I personally find it reprehensible that the attitude that network
communications ought to be exempt from access controls is so
pervasive, but I bend to the will of the people.
--
Casey Schaufler
A better approach would be to design simple, robust kernel
interfaces which make sense and which aren't made all complex by
putting the user interface in kernel space. And to maintain
corresponding userspace tools which manipulate and present the IO
from those kernel interfaces.
But we don't do that, because userspace is hard, because we don't have
a delivery process. But nobody has even tried!
--
Andrew Morton
Comments (none posted)
By Jonathan Corbet
February 25, 2009
It is a rare kernel operation that does not involve the allocation and
freeing of memory. Beyond all of the memory-management requirements that
would normally come with a complex system, kernel code must be written with
extremely tight stack limits in mind. As a result, variables which would
be declared as automatic (stack) variables in user-space code require
dynamic allocation in the kernel. So the efficiency of the memory
management subsystem has a pronounced effect on the performance of the
system as a whole. That is why the kernel currently has three slab-level
allocators (the original slab allocator, SLOB, and
SLUB), with another one (
SLQB) waiting for the 2.6.30
merge window to open. Thus far, nobody has been able to create a single
slab allocator which provides the best performance in all situations, and
the stakes are high enough to make it worthwhile to keep trying.
While many kernel memory allocations are done at the slab level (using
kmem_cache_alloc() or kmalloc()), there is another layer
of memory management below the slab allocators. In the end, all dynamic
memory management comes down to the page allocator, which hands out memory
in units of full pages. The page allocator must manage memory without
allowing it to become overly fragmented; it also must deal with details
like CPU and NUMA node affinity, DMA accessibility, and high memory. It
also clearly needs to be fast; if it is slowing things down, there is
little that the higher levels can do to make things better. So one might
do well to be concerned when memory management hacker Mel Gorman writes:
The complexity of the page allocator has been increasing for some
time and it has now reached the point where the SLUB allocator is
doing strange tricks to avoid the page allocator. This is obviously
bad as it may encourage other subsystems to try avoiding the page
allocator as well.
As might be expected, Mel has come up with a set of patches designed to
speed up the page allocator and do away the the temptation to try to work
around it. The result appears to be a significant cleaning-up of the code
and a real improvement in performance; it also shows the kind of work which
is necessary to keep this sort of vital subsystem in top shape.
Mel's 20-part patch (linked with the quote, above) attacks the problem in a
number of ways. Many of them are small tweaks; for example, the core page
allocation function (alloc_pages_node()) includes the following
test:
if (unlikely(order >= MAX_ORDER))
return NULL;
But, as Mel puts it, no proper user of the page allocator should be
allocating something larger than MAX_ORDER in any case. So his
patch set removes this test from the fast path of the allocator, replacing
it with a rather more attention-getting test (VM_BUG_ON) in the
slow path. The fast allocation path gets a little faster, and misuse of
the interface should eventually be caught (and complained about) anyway.
Then, there is the little function gfp_zone(), which takes the
flags passed to the allocation request and decides which memory zone to try
to allocate from. Different requests must be satisfied from different
regions of memory, depending on factors like whether the memory will be
used for DMA, whether high memory is acceptable, or whether the memory can
be relocated if needed for defragmentation purposes. The current code
accomplishes this test with a series of four if tests, but lots of
jumps can be expensive in fast-path code. So Mel's patch replaces the
tests with a table lookup.
There are a number of other changes along these lines - seeming
micro-optimizations that one would not normally bother with. But, in
fast-path code deep within the system, this level of optimization can be
worth doing. The patch set also reorganizes things to make the fast path
more explicit and contiguous; that, too, can speed things up, but it also
helps ensure that developers know when they are working with
performance-critical code.
The change which provoked the most discussion, though, was the removal of
the distinction between hot and cold pages. This feature, merged for 2.5.45, attempts to
track which pages are most likely to be present in the processor's caches.
If the memory allocator can give cache-warm pages to requesters, memory
performance should improve. But, notes Mel, it turns out that very few
pages are being freed as "cold," and that, in general, the decisions on
whether to tag specific pages as being hot or cold are questionable. This
feature adds some complexity to the page allocator and doesn't seem to
improve performance, so Mel decided to take it out. After running some benchmarks, though, he concluded
that, in fact, he has no idea whether the feature helps or not. So the
second version of the patch has left out the hot/cold removal, but this
topic will be revisited in the future.
Mel claims some good results:
Running all of these through a profiler shows me the cost of page
allocation and freeing is reduced by a nice amount without
drastically altering how the allocator actually works. Excluding
the cost of zeroing pages, the cost of allocation is reduced by 25%
and the cost of freeing by 12%. Again excluding zeroing a page,
much of the remaining cost is due to counters, debugging checks and
interrupt disabling. Of course when a page has to be zeroed, the
dominant cost of a page allocation is zeroing it.
A number of standard user-space benchmarks also show improvements with this
patch set. The reviews are generally good, so the chances are that these
changes could avoid the lengthy delays that characterize memory management
patches and head for the mainline in the relatively near future. Then
there should be no excuse for trying to avoid the page allocator.
Comments (22 posted)
By Jake Edge
February 25, 2009
In kernel development, there is always tension between the needs of
a new feature versus the needs of the kernel as a whole. Projects
generally want to get their code merged as early as possible, for a variety
of reasons, while the
rest of the kernel community needs to be comfortable that the feature is
sensible, desirable, and, perhaps most importantly, maintainable. The
current push for inclusion of a feature to checkpoint and restart processes
highlights this tension.
In late January, Oren Laadan posted the latest version of his
kernel-based checkpoint and restart code with the notation: "Aiming
for -mm". There are many possible uses for checkpoints, but it is
an extremely complex problem. Laadan's current version is quite
minimal, implementing only a fairly small subset of the features
envisioned, but he would like to get the kind of review and testing that
goes along with pushing it towards the mainline.
After two weeks without much in the way of comments, another proponent,
Dave Hansen asked what, if anything, was
holding the patchset back from -mm inclusion. Andrew Morton replied that he had raised some concerns which
were "inconclusively waffled at" a few months back.
Morton's opinion carries a fair amount of weight—not least because he
runs the targeted tree. He is looking to the future and trying to ensure
that the patches make sense:
I am concerned that this implementation is a bit of a toy, and that we
don't know what a sufficiently complete implementation will look like.
There is a risk that if we merge the toy we either:
a) end up having to merge unacceptably-expensive-to-maintain code to
make it a non-toy or
b) decide not to merge the unacceptably-expensive-to-maintain code,
leaving us with a toy or
c) simply cannot work out how to implement the missing functionality.
Morton asked for answers to several questions regarding what features are
available in the current implementation, as well as information on what
needs to be added. He also asked for indications that Laadan and Hansen
had some thoughts on the design for required, but not
yet implemented, features. In short, he wants to avoid any of the
scenarios he outlined. In response to further questions from Ingo Molnar,
Hansen outlined
some of the shortcomings of the current implementation:
Right now, it is good for very little. An app has to basically be
either specifically designed to work, or be pretty puny in its
capabilities. Any fds that are open can only be restored if a simple
open();lseek(); would have been sufficient to get it back into a good
state. The process must be single-threaded. Shared memory, hugetlbfs,
VM_NONLINEAR are not supported.
Hansen also had a more detailed answer to
Morton's questions, which showed a lot of work still to be done. The
current code only works for x86 architectures, for example, and only for
basic file types, essentially just pipes and regular files. He likened the
progress of checkpoint/restart to that of kernel scalability; it is a work
in progress, not something that will ever be complete:
We intend to make core kernel
functionality checkpointable first. We'll move outwards from there as
we (and our users) deem things important, but we'll certainly never be
done.
One of the main concerns is not that there is a lot still to be done, but
that there may be lurking problems that either don't have solutions or can
only be solved by very intrusive kernel changes. Matt Mackall looked at
Hansen's list of additional features needing to be implemented and summed up the worries this way:
I think the real questions is: where are the dragons hiding? Some of
these are known to be hard. And some of them are critical [for] checkpointing
typical applications. If you have plans or theories for implementing all
of the above, then great. But this list doesn't really give any sense of
whether we should be scared of what lurks behind those doors.
There is, however, a free out-of-tree implementation of checkpoint/restart
in the OpenVZ project. OpenVZ is a
virtualization scheme using its own implementation of
containers—different from that
in more recent kernels—that supports checkpointing and migrating those
containers. But it is a large patch, which Morton looked at several years
ago and concluded that it would not be welcome in the mainline. Hansen
sees OpenVZ as a useful example, but
"with all the input from the OpenVZ folks
and at least three other projects, I bet we can come up with something
better".
An incremental approach to implementing checkpoints is reasonable, but
Morton is concerned that by merging the
current patches, the kernel developers will be
committed to merging something that looks a lot like—and is as
intrusive as—the OpenVZ patches. Molnar is more upbeat: he sees it as an important
feature without "many long-term dragons". He does see one
potential problem area in the incremental approach, though:
There is _one_ interim runtime cost: the "can we checkpoint or not"
decision that the kernel has to make while the feature is not complete.
That, if this feature takes off, is just a short-term worry - as
basically everything will be checkpointable in the long run.
That is one of the technical issues still to be resolved with the current
patchset: how does a process programmatically determine whether it is able
to be checkpointed? If the process has performed some action while
running on a kernel
that does not support checkpointing the state caused by that action, there
is a need to be able
to decide that. Molnar suggested overloading the LSM security checks such
that performing those actions sets a one-way "not checkpointable" flag as
appropriate. That flag
could be checked by the process or by some other program that was
interested. Overloading the LSM hooks is not completely uncontroversial, but
it does hook the kernel in many of the right places—adding an
additional call to those same places for checkpointing is not likely to fly.
There was also some question about whether the "not checkpointable" flag
needs to be a one-way flag, as it could be cleared once the process has
returned to a state that is able to be checkpointed. Molnar argued that
the one-way flag is desirable: "uncheckpointable
functionality should be as
painful as possible, to make sure it's getting fixed". Users who
run into problems checkpointing their applications will then apply pressure to
get the requisite state added to checkpoints. As a starting point,
Hansen has posted a patch that would add a
one-way flag based on the kinds of files a process had opened.
Checkpoints are a useful feature that could be used for migrating processes
to different machines, protecting long-running processes against kernel
crashes or upgrades, system hibernation, and more. It is a difficult
problem that may never really be completely finished and it touches a lot
of core kernel code. For these reasons, caution is certainly justified,
but one gets the sense that some kind checkpoint/restart feature will
eventually make its way into the mainline. Whether it is Laadan's version,
something derived from OpenVZ, or some other mechanism entirely remains to
be seen.
Comments (9 posted)
By Jonathan Corbet
February 24, 2009
Once upon a time, the Video4Linux (V4L) development community was seen as a
discordant group which hung out in its own playpen and which had not
managed to implement support for much of the available hardware. Times
have changed; the V4L community is energetic and productive, disruptive
flame wars have all but disappeared from the V4L mailing lists, and Linux
now supports a large majority of the hardware which can be found on the
market. As this community moves forward, it is reorganizing things on many
fronts; among other things, they are working on the creation of the first
true framework for video capture devices. The V4L developers are also
having to look at their code management practices; in the process they are
encountering a number of issues which have been faced by other subsystems
as well.
The discussion started with this RFC from Hans
Verkuil. Hans points out that the size of the V4L subsystem (as found
under drivers/media in the kernel source) has grown significantly
in recent years - it is 2.5 times larger now than it was in the 2.6.16
kernel. This growth is a sign of success: V4L has added features and
support for a vast array of new hardware in this time. But it has its
costs as well - that is a lot of code to maintain.
As it happens, the V4L developers make that maintenance even harder by
incorporating backward compatibility into their tree. The tree run by V4L
maintainer Mauro Carvalho Chehab does not support just the current mainline
kernel; instead, it can be built on any kernel from 2.6.16 forward. This
is not a small trick, considering that the majority of that code did not
exist when 2.6.16 was released. There have been some major internal kernel API
changes over that time; supporting all those kernels requires a complicated
array of #ifdefs, compatibility headers, and more. It takes a lot
of work to keep this compatibility structure in place. Additionally, this
kind of compatibility code is not welcome in the mainline kernel, so it
must all be stripped out prior to sending code upstream.
The reason for this practice is relatively straightforward: the V4L
developers would like to make it possible for testers to try out new
drivers without forcing them to install a leading-edge mainline kernel.
This is the same reasoning that the DRM developers gave at the 2008 Kernel Summit: allowing
testers to build modules for older kernels makes life easier for them. And
that, in turn, leads to more testing of current code. But the cost of this
compatibility is high, so Hans is proposing a few changes.
One of those would be in how the subsystem tree is managed. Currently,
this tree is maintained in a Mercurial repository which represents only the
V4L subsystem (it is not a full kernel tree), and which contains the
backward compatibility patches. This organization makes interaction with
the kernel development process harder in a number of ways. Beyond the
effort required to maintain backward compatibility, the separate tree makes
it harder to integrate patches written against the mainline kernel, and
there is no way for this tree to contain patches which affect kernel code
outside of drivers/media. Life would be easier if developers
could simply work against an ordinary mainline kernel tree.
So Hans suggests moving to a tree organization modeled on the techniques
developed by the ALSA project. The ALSA maintainers (who also keep
backward compatibility patches) use as their primary tree a clone of the
mainline git repository. Backward compatibility changes are then
retrofitted into a separate tree which exists just for that purpose. By
working against a mainline tree, the ALSA developers interact more smoothly
with the rest of the kernel development process. The down side is that
creating the backward-compatible tree requires more work; a team of V4L
developers would have to commit to putting time toward that goal.
And that leads, of course, to the biggest question: what is the real value
of the backward compatibility work, and how far back should the project go?
There seems to be little interest in dropping compatibility with older
kernels altogether; the value to testers and developers both seems to be
too high. But it is not clear that it is really necessary to support
kernels all the way back to 2.6.16. So, asks Hans, what is the oldest
kernel that the project should support?
Hans has a clear objective here: the i2c changes which were merged for
2.6.22 create a boundary beyond which backward compatibility gets
significantly harder. If kernels before 2.6.22 could be dropped, a lot of
backward compatibility hassles would go away. But convenience is not the
only thing to bear in mind when dropping support; one must also consider
whether that change will significantly reduce the number of testers who can
try out the code. It would also be good to have some sort of objective
policy on backward compatibility support so that older kernels could be
dropped in the future without the need for extensive discussions.
The proposed policy is this: V4L backward compatibility should support the
oldest kernels supported by "the three major distros" (Fedora, openSUSE,
and Ubuntu). For the moment, that kernel, conveniently, happens to be
2.6.22, which will be supported by Ubuntu 7.10 until April, 2009.
(Interestingly, Hans seems to have skipped over the 6.06 "Dapper Drake"
release - supported until June, 2009 - which runs a bleeding-edge 2.6.15
kernel). A quick poll run by Hans suggests
that there is little opposition to removing support for kernels prior to
2.6.22.
There is some, though: John Pilkington points
out:
I think you should be aware that the mythtv and ATrpms communities
include a significant number of people who have chosen to use the
CentOS_5 series in the hope of getting systems that do not need to
be reinstalled every few months. I hope you won't disappoint them.
CentOS 5 (like the RHEL5 distribution it is built from) shipped with a
2.6.18 kernel. It seems, though, that there is
little sympathy for CentOS (or any other "enterprise" distribution) in
the development community. Running a distribution designed to be held
stable for several years and wanting the latest hardware support are seen
to be contradictory goals. So it seems unlikely that the V4L tree will be
managed with the needs of enterprise distributions in mind.
Thus far, no actual decisions have been made. Mauro, who as the subsystem
maintainer would be expected to have a strong voice in any such decision,
has not yet shown up in the discussion. Given the lack of any strong
opposition to the proposals, though, it would be surprising if those
proposals are not adopted in some form.
Comments (8 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>