Brief items
The current 2.6 prepatch is 2.6.23-rc1,
released by Linus on
July 22. The 2.6.23 merge window is now closed. See the article
below for features merged since last week; for a complete view of what's in
2.6.23-rc1 see
the short-form
changelog or the
full
changelog if you have a lot of time.
Something over 100 patches have gone into the mainline repository since -rc1
as of this writing. They are mostly fixes, but there was also a patch
removing the request_queue_t typedef - though it was later
restored with a "deprecated" tag.
The current -mm tree is 2.6.23-rc1-mm1. This tree has
slimmed considerably as patches flowed into the mainline; other changes
include a set of IDE updates, the USB device authorization
patches, the Linux security
non-modules patch, a new file capabilities patch, some new ext4
features, and process-ID namespaces.
For older kernels: 2.6.16.53-rc1 was released on
July 23 - the first 2.6.16 update in a while.
2.4.34.6 was released on
July 22 with a couple of fixes. 2.4.35-rc1 is also out with a
larger set of fixes; the final 2.4.35 release should happen shortly.
Comments (none posted)
Kernel development news
Stupid bugs only appear endearing in retrospect.
--
Linus Torvalds
In Linux we reject _lots_ of code, and that's the only way to
create a quality kernel. It's a bit like evolutionary selection:
breathtakingly wasteful and incredibly efficient at the same time.
--
Ingo Molnar
Apologies to those of you looking for selections from the ill-advised run
of limericks recently posted on linux-kernel; interested readers can find
most of them in this
thread.
Comments (11 posted)
Linus has closed the 2.6.23 merge window. Before that happened, however, a
few more patches slipped through:
- New drivers for LM93 hardware monitoring chips, SMSC DME1737 hardware
monitoring chips, AMD5536 UDC USB controllers, OpenMoko Neo1973 audio
controllers, Renesas SH7760 audio controllers, SEGA Dreamcast Yamaha
AICA PCM sound devices, Cyrix Geode 5530 audio controllers, PS3 audio
controllers, Xbox 360 pad LEDs, Fujitsu serial touch screens, Simtek
STK17TA8 timekeeping chips, and GPIO-connected LEDs.
- The UIO API for the
creation of simple device drivers in user space has been merged.
- Japanese and Chinese versions of Documentation/HOWTO and
stable_api_nonsense.txt have been added to the tree. There is
resistance to carrying translated versions of kernel documents in
general, but it is hoped that translations of some of the introductory
documents will help new developers to join the process.
- The Lguest
virtualization mechanism has been merged. Puppies for everybody!
- Process entries in /proc now have a coredump_filter
file which controls which memory areas will be written out should a
core dump become necessary.
- The on-demand readahead
patches have finally found their way into the mainline.
Changes visible to kernel developers include:
- unregister_chrdev() now returns void.
- There is a new notifier chain which can be used (by calling
register_pm_notifier()) to obtain notification before and
after suspend and hibernate operations.
- The new "lockstat" infrastructure provides statistics on the amount of
time threads spend waiting for and holding locks.
- The new fault() VMA operation replaces nopage() and
populate(). See this article
for a description of the current fault() API.
- The generic netlink API now has the ability to register (and
unregister) multicast groups on the fly.
- The destructor argument has been removed from
kmem_cache_create(), as destructors are no longer supported.
All in-kernel callers have been updated.
- There is now support for profiling Cell SPU usage in oprofile.
Since the merge window is now closed, that should be the end of new
features for this development cycle. There could be an exception or two,
though: a few developers appear to have missed the window and are hoping to
slip in a few post -rc1 changes.
Comments (1 posted)
The Secure Digital Input/Output specification enables the creation of SD
cards which handle tasks beyond the simple storage of bits, which is what
SD has traditionally been used for. The
SD Association SDIO page
shows some cute pictures with SDIO network adapters, cameras, GPS
receivers, fingerprint recognizers, and a strangely disturbing image of a
scanner glued directly to an SD card. As small gadgets with SD slots
become more prevalent, one can imagine a number of uses for peripherals
which can be attached to those slots. Since many of those gadgets run
Linux, it would be nice to have proper support for SDIO devices in the
mainline kernel. Unfortunately, like much of the SD Association's work,
SDIO has been a realm of proprietary specifications and implementations.
That would appear to be about to change, however: Pierre Ossman has sent
out an announcement of interest:
I am happy to announce that SDIO support will soon be a standard
feature in Linux. No more proprietary stacks with all the troubles
(legal and technical) that go with them.
The new SDIO stack, written by Pierre and Nicolas Pitre, is in a fairly
complete state with all the sorts of bus-level support
that driver writers have come to expect. There is one driver (for GPS
interfaces) available now; it is expected that others will show up
shortly. If all goes well, expect the new SDIO stack to be ready for
2.6.24.
Comments (5 posted)
Back in October, 2006, LWN
covered the proposed
fault() method for virtual memory areas. This API change was
put forward as part of a fix for an obscure (but real) race condition
within the kernel. Such a fix would seem important, but, even so, it took
the better part of a year for
fault() to make it into the
mainline. Now that the patch has been merged for 2.6.23, it is worth
taking a look at the API which was adopted.
A virtual memory area (VMA) in the kernel represents a piece of a process's
virtual address space. Each VMA is mapped in its own way; most VMAs are
mapped to files on the disk, but there are also anonymous VMAs (mapped to
swap space, for all practical purposes), device memory mappings, and more.
Each VMA must provide a handler for situations where a specific page in
that VMA is not resident in main memory; the handler must rectify the
situation or let the kernel know that it cannot be done. In most cases,
the nopfn() or older (but more heavily used) nopage()
methods fill that bill. They are called with the offset of the missing
page within the VMA and are expected to return a pointer to the
page structure for the missing page. For more complicated cases,
nonlinear VMAs in particular, the populate() method is invoked
instead.
The existence of three functions to perform the same task suggests that
requirements have changed over time and that a cleanup is overdue. When
none of those interfaces are able to be extended to prevent a race
condition, the pressure for a new approach can only get stronger. That new
approach, as created by Nick Piggin, is the fault() method, which
should, eventually, replace all three of the others. The prototype for
fault() is:
int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
Most of the information of interest can be found in the new
vm_fault structure, which looks like this:
struct vm_fault {
unsigned int flags;
pgoff_t pgoff;
void __user *virtual_address;
struct page *page;
};
The fault() method should, like its predecessors, arrange for the
missing page to exist and return its address to the kernel. The interface
used is rather more flexible, though.
The offset of the missing page can be found in the pgoff field.
Fault handlers can also find the corresponding user-space address in
virtual_address, but anybody who is tempted to use that field
should be prepared to justify that use to a crowd of skeptical kernel
developers. Most handlers should not care where the page lives in user
space, and use of virtual_address will make it impossible to
support nonlinear VMAs. So, if at all possible, virtual_address
should be ignored. If your code only uses pgoff, it should also
set the VM_CAN_NONLINEAR flag in the VMA's vm_flags field
to let the kernel know that it is playing by the rules.
The flags field has two possible flags:
- FAULT_FLAG_WRITE indicates that the page fault happened
on a write access.
- FAULT_FLAG_NONLINEAR says that the given VMA is a nonlinear
mapping.
After fault() has done its work, it should store a pointer to the
page structure for the faulted-in page in the page field
- but see below for an exception. The return value from fault()
is a set of flags which can indicate a number of things:
- VM_FAULT_OOM: the fault could not be handled because
the handler was unable to allocate the required memory.
- VM_FAULT_SIGBUS: the page offset is out of
range, so the fault could not be handled.
- VM_FAULT_MAJOR: marks a "major" page fault - usually one which
required reading data from disk.
- VM_FAULT_WRITE: a copy-on-write mapping was
broken to satisfy the fault.
- VM_FAULT_NOPAGE: set if the handler has installed the page
table entry directly. In this case, the page field returned
in the vm_fault structure has no meaning. Among other uses,
this flag allows fault() to be used with mappings that have
no associated page structures - mappings of device memory,
for example.
- VM_FAULT_LOCKED: the returned page has been locked
by the handler and should be unlocked by the caller. It is used with
file-backed mappings to prevent races with other parts of the kernel
which may be trying to access the same page.
All callers of the populate() VMA operation have been changed, and
that method no longer exists. There is an entry in the feature removal
schedule for nopage() indicating that it will go away "as soon as
possible." The kernel still has a number of nopage()
implementations, though, so getting rid of it may take a little while yet.
Longer-term plans call for the removal of nopfn() as well, though
no date has been set for this change. Certainly any new code which
implements mmap() should be written to handle faults with
fault() rather than one of the older functions.
Comments (1 posted)
It has been almost two years since LWN
covered the swap prefetch
patch. This work, done by Con Kolivas, is based on the idea that if a
system is idle, and it has pushed user data out to swap, perhaps it should
spend a little time speculatively fetching that swapped data back into
any free memory that might be sitting around. Then, when some application
wants that memory in the future, it
will already be available and the time-consuming process of fetching it
from disk can be avoided.
The classic use case for this feature is a
desktop system which runs memory-intensive daemons (updatedb, say, or a
backup process) during the night. Those daemons may shove a lot of useful
data to swap, where it will languish until the system's user arrives,
coffee in hand, the next morning. Said user's coffee may well grow cold by
the time the various open applications have managed to fault in enough
memory to function again. Swap prefetch is intended to allow users to
enjoy their computers and hot coffee at the same time.
There is a vocal set of users out there who will attest that swap prefetch
has made their systems work better. Even so, the swap prefetch patch has
languished in the -mm tree for almost all of those two years with no path
to the mainline in sight. Con has given up
on the patch (and on kernel development in general):
The window for 2.6.23 has now closed and your position on this is
clear. I've been supporting this code in -mm for 21 months since
16-Oct-2005 without any obvious decision for this code forwards or
backwards.
I am no longer part of your operating system's kernel's world; thus
I cannot support this code any longer. Unless someone takes over
the code base for swap prefetch you have to assume it is now
unmaintained and should delete it.
It is an unfortunate thing when a talented and well-meaning developer runs
afoul of the kernel development process and walks away. We cannot afford
to lose such people. So it is worth the trouble to try to understand what
went wrong.
Problem #1 is that Con chose to work in some of the trickiest parts of the
kernel. Swap prefetch is a memory management patch, and those patches
always have a long and difficult path into the kernel. It's not just Con
who has run into this: Nick Piggin's lockless pagecache patches have
been
knocking on the door for just as long. The LWN article on Wu Fengguang's
adaptive readahead patches appeared at about the same time as the swap
prefetch article - and that was after your editor had stared at them for
weeks trying to work up the courage to write something. Those patches
were only merged earlier this month, and, even then, only after many of the
features were stripped out. Memory management is not an area for
programmers looking for instant gratification.
There is a reason for this. Device drivers either work or they do not, but
the virtual memory subsystem behaves a little differently for every
workload which is put to it. Tweaking the heuristics which drive memory
management is a difficult process; a change which makes one workload run
better can, unpredictably, destroy performance somewhere else. And that
"somewhere else" might not surface until some large financial institution
somewhere tries to deploy a new kernel release. The core kernel
maintainers have seen this sort of thing happen often enough to become
quite conservative with memory management changes. Without convincing
evidence that the change makes things better (or at least does no harm) in
all situations, it will be hard to get a significant change merged.
In a
recent interview Con stated:
Then along came swap prefetch. I spent a long time maintaining and
improving it. It was merged into the -mm kernel 18 months ago and
I've been supporting it since. Andrew [Morton] to this day remains
unconvinced it helps and that it 'might' have negative consequences
elsewhere. No bug report or performance complaint has been
forthcoming in the last 9 months. I even wrote a benchmark that
showed how it worked, which managed to quantify it!
The problem is that, as any developer knows, "no bug reports" is not the same
as "no bugs." What is needed in a situation like this is not just
testimonials from happy desktop users; there also needs to be some sort of
sense that the patch has been tried out in a wide variety of situations.
The relatively self-selecting nature of Con's testing community (more on
this shortly) makes that wider testing harder to achieve.
A patch like swap prefetch will require a certain amount of support from
the other developers working in memory management before it can be merged.
These developers have, as a whole, not quite been ready to jump onto the
prefetch bandwagon. A concern which has been raised a few times is that
the morning swap-in problem may well be a sign of a larger issue within the
virtual memory subsystem, and that prefetch mostly serves as a way of
papering over that problem. And it fails to even paper things completely,
since it brings back some pages from swap, but doesn't (and really can't)
address file-backed pages which will also have been pushed out. The
conclusion that this reasoning leads to is that it would be better to find
and fix the real problem rather than hiding it behind prefetch.
The way to address this concern is to try to get a better handle on what
workloads are having problems so that the root cause can be addressed.
That's why Andrew Morton says:
To attack the second question we could start out with bug reports:
system A with workload B produces result C. I think result C is
wrong for <reasons> and would prefer to see result D.
and why Nick Piggin complains:
Not talking about swap prefetch itself, but everytime I have asked
anyone to instrument or produce some workload where swap prefetch
helps, they never do.
Fair enough if swap prefetch helps them, but I also want to look at
why that is the case and try to improve page reclaim in some of
these situations (for example standard overnight cron jobs
shouldn't need swap prefetch on a 1 or 2GB system, I would hope).
There have been a few attempts to characterize workloads which are improved
by swap prefetch, but the descriptions tend toward the vague and hard to
reproduce. This is not an easy situation to write a simple benchmark for
(though Con has tried), so demonstrating the problem is a hard thing to
do. Still, if the prefetch proponents are serious about wanting this code
in the mainline, they will need to find ways to better communicate
information about the problems solved by prefetch to the development
community.
Communications with the community have been an occasional problem with
Con's patches. Almost uniquely among kernel developers, Con
chose to do most of his work on his own mailing list. That has resulted in
a self-selected community of users which is nearly uniformly supportive of Con's work,
but which, in general, is not participating much in the development of that
work. It is rare to see patches posted to the ck-list which were not
written by Con himself. The result was the formation of a sort of
cheerleading squad which would occasionally spill over onto linux-kernel
demanding the merging of Con's patches. This sort of one-way communication
was not particularly helpful for anybody involved. It failed to convince
developers outside of ck-list, and it failed to make the patches better.
This dynamic became actively harmful when ck-list members (and Con)
continued to push for inclusion of patches in the face of real problems.
This behavior came to the fore after Con posted the RSDL scheduler. RSDL
restarted the whole CPU scheduling discussion and ended up leading to some
good work. But some users were reporting real regressions with RSDL and
were being told that those regressions were to be expected and would not be
fixed. This behavior soured
Linus on RSDL and set the stage for Ingo Molnar's CFS scheduler. Some
(not all) people are convinced that Con's scheduler was the better design,
but refusal to engage with negative feedback doomed the whole exercise.
Some of Con's ideas made it into the mainline, but his code did not.
The swap prefetch patches appear to lack any obvious problems; nobody is
reporting that prefetch makes things worse. But the ck-list members
pushing for its inclusion (often with Con's encouragement) have not been
providing the sort of information that the kernel developers want to see.
Even so, while a consensus in favor of merging this patch has
not formed, there are some important developers who support its inclusion.
They include Ingo Molnar and David Miller, who says:
There is a point at which it might be wise to just step back and
let the river run it's course and see what happens. Initially,
it's good to play games of "what if", but after several months it's
not a productive thing and slows down progress for no good reason.
If a better mechanism gets implemented, great! We'll can easily
replace the swap prefetch stuff at such time. But until then swap
prefetch is what we have and it's sat long enough in -mm with no
major problems to merge it.
So swap prefetch may yet make it into the mainline - that discussion is
not, yet, done. If we are especially lucky, Con will find a way to get back into
kernel development, where his talents and user focus are very much in need.
But this sort of situation will certainly come up again. Getting major
changes into the core kernel is not an easy thing to do, and, arguably,
that is how it should be. If the process must make mistakes, they should
probably happen on the side of being conservative, even if the occasional
result is the exclusion of patches that end up being helpful.
Comments (89 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>