Brief items
The current 2.6 prepatch remains 2.6.14-rc4. The final 2.6.14
kernel was supposed to be out by now, but, as of this writing, it has not
been released. Once the swiotlb problem (see below) has been worked out,
2.6.14 should follow shortly.
The current -mm tree is 2.6.14-rc4-mm1. Recent changes
to -mm include a fair number of VM scalability patches, the nested class
devices patch set (see below), a big x86-64 update, the removal of the
PageReserved() flag, the swap prefetching patches, some
kernel keyring enhancements, the error detection and correction patch set,
a RAID update, and lots of fixes.
Comments (none posted)
Kernel development news
I'm with Roman on this one - the old "show me the code" trick which
people use to quash other people's objections is rather poor form -
we should simply address the objections as raised.
--
Andrew Morton
Comments (none posted)
For those wanting to know more about how the 2.6 virtual memory subsystem
works: Rik van Riel has put together
a detailed article on how
page fault handling is handled on the i386 architecture. This document is
apparently the first of many, all of which should show up on the
Linux MM Internals page.
Comments (none posted)
Linus was set on releasing the 2.6.14 kernel on October 17, when a
little issue came up. Serge Belyshev
discovered that it is easy to cause the system
to stop opening files for user-space applications. He posted a program
which, in essence, does the following:
while (1) {
int fd = open("/dev/null", O_RDONLY);
close(fd);
}
After some 50,000 iterations, the open fails with a "too many open files in
system" message. This behavior can be problematic in more realistic
situations; it evidently can cause highly-parallel kernel builds to fail,
and it also exposes the system to local denial of service attacks. So it
is worth tracking down.
The kernel places a limit on the number of files which are allowed to be
open simultaneously. That limit is not normally expected to include files
which have been closed, however. The problem, as it turns out, is a
virtual filesystem scalability patch which was merged in September.
That patch eliminates some locking around file structures in the
kernel, and, to that end, defers certain tasks (such as file cleanup) to
the read-copy-update
mechanism. For this particular case, file
structures corresponding to closed files are building up in the RCU
callback list, and RCU is not getting around to freeing them in time.
Initially, it was thought that the culprit was another patch which put a
limit on the processing of the RCU callback lists. Those lists can get
quite long, and lengthy callback processing was causing latency problems
elsewhere in the kernel. So a "batch size" of ten was imposed; after ten
callbacks have been processed, the RCU subsystem defers the rest until
later. It seemed that this limit was causing the freeing of file
structures to languish. Raising the batch limit to 10,000 seemed to
improve the situation, so Linus merged a patch to that effect.
But, in fact, the higher batch limit did not solve the problem for real.
RCU callbacks cannot be called immediately after being queued. They must,
instead, wait until every processor on the system has scheduled at least
once. This "quiescence" requirement is the kernel's way of ensuring that no
references to the freed structure remain; it's a key part of how RCU
works. If a process chews through file structures quickly enough,
they will accumulate while the kernel waits for the grace period to run
out, and no changes to the batch limits will help. The only way to be able
to process those callbacks - and free the associated structures - is to
force every processor to schedule.
A couple of patches have been posted in an attempt to deal with this
problem. One of them simply changes the way file structures are
accounted for - they are removed from the count of open files when the RCU
callback is queued, rather than when it is executed. This patch stops
programs from running into the maximum open file limit, but does nothing to
stop the growth of the RCU callback queues. So the patch which got merged,
instead, is this one from Eric Dumazet,
which keeps track of the length of the callback list. Should the list get
to be too long (where "too long" is wired at 10,000 entries), a reschedule
is forced so that the callbacks can be processed. This patch appears to
have dealt with the problem well enough to allow 2.6.14 to come out, though
more refinement may be required afterward.
Unfortunately for those who are waiting for 2.6.14, another problem turned
up. Some 64-bit architectures which
lack I/O memory management units must be very careful in setting up DMA
areas. A number of devices can only reliably deal with 32-bit DMA
addresses, so DMA areas must be allocated in the lower part of memory. To
that end, the x86-64 and ia64 architectures use a mechanism called the
"software I/O translation buffer", or swiotlb. It is simply a large chunk
of low memory, allocated at boot time, which is used as a bounce buffer for
DMA operations involving 64-bit-challenged devices.
It was noted that the 2.6.14-rc4 kernel can
allocate the swiotlb area in high memory, which defeats the entire
purpose. This revelation led to a long discussion of how swiotlb memory
should be allocated. It turns out that there is no easy way of finding the
low memory on the system. Once upon a time, that memory would belong to
CPU 0, but on some contemporary NUMA
systems, the low memory might be elsewhere. So the real solution
appears to iterate through all CPUs on the system, try to allocate from
each of them, and test to see if the resulting memory is within the DMAable
range. If not, the memory is freed and the next processor is tried. A
couple of patches implementing this approach are circulating; none has been
merged as of this writing.
Comments (3 posted)
Two weeks ago, this page
looked
at nested classes in sysfs as a way of representing the input subsystem
device hierarchy to user space. This week, Greg Kroah-Hartman posted
a set of patches with the latest
version of
class_device nesting; the selling feature this time
around was that the patches "actually work." With this patch set, it is
possible to create a hierarchy under
/sys/class which represents
the known input devices on the system and their relationship to the actual
system hardware. Greg also notes that this patch set makes possible the
long-anticipated move of
/sys/block into the class hierarchy.
So all would seem to be well in sysfs land. But Greg finished his
announcement with the following:
Oh, one final thing. I really don't think that input should be a
class. It looks like a "bus" and acts like a "bus" (you have
different devices that have different drivers bind to them, and you
want to load those drivers with the hotplug mechanism.)
This note opened the floodgates to a wider discussion; it seems that a
number of people are not entirely happy with the /sys/class
hierarchy. Udev hacker Kay Sievers complained:
The nesting classes implement a fraction of a device hierarchy in
/sys/class. It moves arbitrary relation information into the class
directory, where nothing else than device classification belongs.
What is the rationale behind sticking device trees into class?
What seems to have happened here is that a number of devices, mostly of the
virtual variety, have found their home in the class hierarchy rather than
with the other devices. As a result, the class tree has grown more
complicated, and it has moved away from its original purpose, which was to
be a way of grouping devices which share the same interface and function.
So Kay (among others) has proposed that much of what is currently in the
class tree be moved over to /sys/devices with the rest of the
device information. The idea is that user space does not really care about
the distinction between "real" and "virtual" devices, and the kernel
interface should not either.
Greg, who holds a big vote on device model issues, has responded thusly:
Ok, I've spent a while thinking about this proposal and originally
I thought it was the same thing we had heard years ago. But I was
wrong, moving the class stuff into the device tree is the right
thing to do, as long as we keep them as new "things" in the tree...
So it would seem that big changes are in store for the Linux device model.
This code has grown and evolved considerably since its introduction in 2.5;
it may be time for a big rework. Actually changing things without causing
major pain for users could be a bit of a challenge, however. It will have
to be approached carefully.
The plan under consideration for now is to simply try to solve the input
subsystem problem for 2.6.15. That most likely involves the nested
class_device patches, perhaps with some changes to avoid breaking
things in user space (and udev in particular). Things look more
ambitious in the longer term:
Then, we move the class stuff into real devices. There was always
a lot of duplication with the class and device code, and this shows
that there is a commonality there. At the same time, I'll work on
making the attribute stuff easier and possibly merge the kobject
and device structures together a bit (possibly I said, I don't know
quite how much yet...)
The end result is that there is likely to be some significant churn in the
device model code in the coming months. There will almost certainly be
consequences for the driver API, and for user space as well. If it all
works out, however, we should end up with a device model which is easier to
understand and work with in both kernel and user space.
Comments (8 posted)
LWN
looked at the ktimers
patch about one month ago. Work continues on the new kernel timer
mechanism; the
latest version
of the patch includes a new "clockevents" abstraction intended to make
high-resolution timer support easier to implement in an
architecture-independent way. The patch appears to be coming together
well, and there has been little in the way of criticism.
...with the exception of one observer, who has kept up a steady stream of
complaints about the new mechanism. His objections include the name (he
would rather see "process timers" than "ktimers"), the use of
high-resolution time within the kernel, and various "unnecessary
complexities." The discussion has been mostly unfruitful, to the point that
the normally even-keeled Ingo Molnar tried to end it with a shut up and show me the code challenge. That
led Andrew Morton to state that "show me the code" is no longer an
acceptable arguing point for kernel discussions, and that the objections
should be addressed regardless.
Getting a handle on the objections has proved hard; it is not clear that
the person in question (Roman Zippel) truly understands the patches. One
bit of the
discussion is worth a look, however. It has been repeatedly pointed out
that the existing kernel timer mechanism is optimized for timeouts which
rarely actually expire, while ktimers are expected to expire.
Roman claimed:
Whether the timer event is delivered or not is completely
unimportant, as at some point the event has to be removed anyway,
so that optimizing a timer for (non)delivery is complete nonsense.
This claim led to a required-reading response
from Ingo on the history of the kernel timer mechanism and why
optimizing for delivery (or the lack thereof) is not nonsense. That
particular branch of
the discussion, at least, should not need to go much further.
Andrew Morton has, in the past, stated that he would be highly reluctant to
merge new code over the objections of a developer. The need to address all
objections can be highly frustrating to kernel hackers, especially when new
complaints seem to keep turning up as the old ones are resolved. The
result of this process, when it works well, can be a stronger kernel. But
it can also be the delaying of useful
code which few people have problems with. It is starting to look like that
may be the outcome in the ktimers case; the code will almost certainly be
merged in the end, perhaps with almost no changes resulting from the
current discussion.
Comments (none posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>