Release status
Kernel release status
The current 2.6 prepatch remains 2.6.16-rc5; no new -rc releases
have been made over the last week. A slow trickle of patches continues to
find its way into the mainline git repository as bugs are tracked down and
fixed.
The current -mm release is 2.6.16-rc5-mm3. Recent changes
to -mm include a patch to allow NFS mounts from a common server to share
superblocks, CPU hotplug support for the x86-64 architecture, a
continuation of the /proc rework, and some device mapper work.
The current stable 2.6 kernel is 2.6.15.6, released on March 5,
following shortly after 2.6.15.5. The two updates carry
a few dozen patches, a number of which address security-related issues.
Comments (none posted)
Kernel development news
Quote of the week
Users of Suspend2 can rest assured that I will not allow the patches to suffer
bitrot. I will be continuing to use them myself, and will therefore have the
best of incentives to keep them up-to-date.
Now for the downside: I won't, however, be making any sort of concerted effort
at getting them merged into the vanilla kernel after my move, and am not
inclined to make a big effort beforehand.
-- Nigel Cunningham
Comments (2 posted)
Double kfree() errors
Less than 24 hours after Coverity announced the availability of a new set
of machine-detected potential kernel bugs, Dave Jones started posting
fixes. Judging from these fixes, a number of the problems detected this
time around are double-free errors - passing the same pointer to
kfree() twice. Freeing memory twice is a sure way to corrupt core
kernel data structures, leading to trouble in unpredictable places far from
where the real bug is to be found. Avoiding this kind of error would make
life easier for everybody involved.
To that end, Dave tossed out a simple idea:
have kfree() poison pointers so that a second call can be detected
immediately. His first proposal looked like this:
#define kfree(foo) \
__kfree(foo); \
foo = KFREE_POISON;
This code was not meant to be incorporated as-is; for starters, it probably
needs a pair of braces. But there were a couple of other problems which
popped up. One of them is that, since passing a NULL pointer to
kfree() is legal, passing it twice is also legal. But this code
would break that case. Whether that would be a problem for real code is
unclear. Al Viro pointed out a more
serious issue: the pointer passed to kfree() is not always an
lvalue which can be assigned to. So simply redefining kfree() in
this way would lead to compilation errors.
The end result is that a transparent, in-place replacement for
kfree() may be hard to implement. An alternative might be the creation of a
safe_kfree() variant, combined with some serious pressure to use
that variant. Then, perhaps, double-free errors could be caught when they
happen.
Or, instead, one could use the double-free checking already built into the
kernel. The slab allocator, which is (among other things) the engine
behind kmalloc() and kfree(), has options for poisoning
(writing special values to) all memory which it handles. One value
(0x5a in every byte) marks uninitialized memory, while another
(0x6b) is written into memory when it is freed. The resulting
patterns jump out nicely in oops listings, often making the cause of the
problem immediately obvious. But the use-after-free value can also enable
the detection of double-free errors - assuming that the memory is not
reallocated between kfree() calls.
The problem, it seems, is that not a whole lot of developers are running
with slab poisoning enabled. As a result, they are working without a
valuable debugging tool and allowing certain kinds of bugs to persist in
the code base. So a part of the solution to the problem may well be a
stronger effort to get developers to turn the slab poisoning option on.
Beyond that, any sort of checking added to kfree() (or a variant)
should be harder to disable than the existing debugging options.
Comments (4 posted)
RCU and open file accounting
David Miller has been making great progress in his port of the Linux kernel
to Sun's new "Niagara" (SPARC) CPU architecture. He has
run into one little problem, however:
I just wanted to report that I am hitting the "VFS: file-max limit
xxx reached" problem quite easily on my 32-cpu Niagara machine with
16GB of ram with current 2.6.x GIT. It seems far too easy to get a
box into this state due to SLAB fragmentation and RCU. And once
you get a machine into this state it is totally unusable.
Our test case is usually a "make -j8192" kernel build along with a
parallel bootstrap of gcc. That puts about 256 processes on each
cpu's runqueue, I doubt ksoftirqd can run much at all.
The file limit problem was last discussed here in October, when it delayed the
release of the 2.6.14 kernel. A fix merged at that time made the problem
harder to trigger, but, as David's experience shows, the problem has not
been solved altogether. One might argue that a relatively small number of
users run the sort of workload that David is playing with. But the point
remains: with current kernels, including the upcoming 2.6.16 release, it is
possible for a suitably-written program to run the open file count to its
maximum, thus denying any sort of service to other users. This seems like
a problem which one might want to fix.
One piece of the puzzle here is the way that the open file count is
managed. Currently, that count is decremented in the slab destructor set
up for file structures. This method works, but it can cause the
decrement to be delayed by an arbitrary amount of time, with the result
that the open file count overstates the number of files which are actually
held open by processes in the system. Moving that operation out of the
slab destructor can help to keep the count more in sync with reality.
The core of the problem, however is the use of the read-copy-update (RCU)
mechanism for management of file structures. When a file is
closed, the task of freeing the structure is queued in RCU. Using RCU lets
the kernel ensure that the structure is not freed while references to it
remain, but without the sort of locking overhead that comes with other
techniques. As a result, performance is measurably improved on SMP
systems.
When there is a lot of opening and closing of files going on (such as, say,
when a wild-eyed developer starts an 8192-process kernel build), the length of
the RCU callback queue can get quite long. By the time that the RCU code
decides that the system has quiesced and it is safe to invoke the RCU
callbacks, the queue might have thousands of entries. Working through the
entire callback queue led to latency problems elsewhere in the system, so
2.6.14 included a patch which put an upper limit on the number of callbacks
which would be processed in any single iteration.
The limit helped with the latency problem. But, if the generation of RCU
callbacks continues at a high rate, the length of the callback queue can
only grow. Every entry in the queue represents memory which could be
returned to the system, but which has not yet been made available. So, as
the queue grows, memory gets fragmented and the system heads towards the
dreaded out-of-memory state.
An attempt at a solution can be found in this
patch by Dipankar Sarma, which has been sitting in the -mm tree for a
while. Dipankar's patch puts a configurable upper limit on the number of
RCU callbacks which will be processed in any single batch; that allows
system administrators to tune the batch size to their particular needs. On
a server which is dealing with large number of file requests, and on which
latency is not a crucial issue, the batch size can be set to a large
number.
The patch also adds a high-water limit. If the length of the RCU callback
queue ever exceeds that limit, the RCU code will (1) set the batch
limit to infinity (or the integer representation thereof) and (2) send
out an inter-processor interrupt forcing every CPU on the system to
schedule. The combination of these actions will cause the system to work
through the entire RCU queue at the soonest possible time. Once the queue
length goes below a low-water limit, the old batch limit will be restored.
It is, in other words, a somewhat unsubtle approach; the system is given a
kick in the rear and told to go clean up its mess. But, it seems, that is
exactly what the system needs at such a time. The cleanup task can only be
deferred for so long; the work eventually needs to be done regardless.
David has reported that the patches fix the problem on his Niagara system,
and suggests that they should be merged into 2.6.16. It is a fairly
significant patch to merge at this late point in the cycle, but there seems
to be a reasonably high level of confidence in its stability. So, chances
are that it will be included as a preferable alternative to shipping 2.6.16
with a known problem.
Comments (6 posted)
Some upcoming sysfs enhancements
A glance at Greg Kroah-Hartman's
state of the driver core and sysfs
message shows that a number of changes are queued up for future kernel
cycles. A couple of those add new features to sysfs, and seem worth a
mention.
Attribute files in sysfs serve as a channel for sharing information between
the kernel and user space. As more of the information interface moves to
sysfs, an increasing number of user-space programs will be making use of
sysfs attributes. Often, these programs will want to respond when the
value of a sysfs attribute changes. In current kernels, however, there is
no easy way for an application to know when an attribute has changed; the
only option is to repeatedly re-read the file and check for new values.
The current -mm kernels include a patch by Neil Brown which makes it
possible to create pollable attributes. With such attributes, user space
need only open the attribute of interest pass it to poll() with
the POLLERR and POLLPRI events selected. When
poll() returns, the file can be reopened and reread to obtain the
new value.
Internally, the patch adds a wait queue head to every kobject on the
system; that queue is inserted into a poll table in response to a
poll() call. The sysfs code has no way of knowing, however, when
the value of any given sysfs attribute has changed, so the subsystem
implementing a pollable attribute must make explicit calls to:
void sysfs_notify(struct kobject *kobj, char *dir, char *attr);
Here, kobj and attr describe the attribute whose value
has been changed. The dir argument need only be supplied when the
given kobject has a special subdirectory (and the attribute is in that
directory). This call will cause any polling process to wake up and see
that a new value is available.
With the current code, there is no way to mark attributes which can be
polled. Any process which calls poll() on an attribute which does
not support polling will end up waiting rather longer than the developer
intended.
While sysfs attributes are normally low-bandwidth items - holding generally
a single value - the relayfs subsystem (added in 2.6.14) is meant to be a
high-bandwidth pipe from the kernel to user space. Relayfs is often used
for debugging tasks, such as relaying large amounts of kernel trace data
for later analysis. User space gets at that data stream by opening a
channel file created in the special-purpose relayfs filesystem.
As it turns out, relayfs contains a fairly nice internal
abstraction for its file operations, making it possible to create entries
for relay channels in other filesystems. Paul Mundt recently put together a patch taking advantage of this
feature to allow kernel code to
create relayfs channels in sysfs. The reaction to this capability was
positive; indeed, it was seen as a better interface to the relay code than
relayfs itself. So Paul's patches have grown into a full reworking of the
relay interface, with the separate relayfs filesystem going away entirely.
Most of the interfaces remain unchanged; in particular, almost the entire
kernel API (as described in the documentation
file) remains as it was before. But now there is a pair of new
functions:
int sysfs_create_relay_file(struct kobject *kobj,
struct relay_attribute *attr);
void sysfs_remove_relay_file(struct kobject *kobj,
struct relay_attribute *attr);
A simple call to sysfs_create_relay_file() will add a relay
channel attribute to the given kobject. The relay_attribute
structure must be filled in with information about the actual channel. On
the user-space side, the only change is that the application must look in a
different place to find the relay channel. All of the supported operations
(mmap() in particular) work as before.
Barring last-minute objections, both of these patches seem likely to be
merged for 2.6.17.
Comments (7 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>