Brief items
The current 2.6 prepatch is 2.6.24-rc5,
released by Linus on
December 10. He says:
Things _have_ slowed down, although
I'd obviously be lying if I said we've got all the regressions handled and
under control. They are being worked on, and the list is shrinking, but at
a guess, we're definitely not going to have a final 2.6.24 out before xmas
unless santa puts some more elves to work on those regressions.
The list of fixes is still fairly long; there is also a significant
FireWire stack update. The short-form changelog is included in Linus's
announcement; see the
long-format changelog for all the details.
A handful of patches have found their way into the mainline git repository
since the -rc5 release.
Comments (none posted)
Kernel development news
while i dont want to jump to conclusions without looking at some
profiles, i think the SLUB performance regression is indicative of the
following fallacy: "SLAB can be done significantly simpler while keeping
the same performance".
I couldnt point to any particular aspect of SLAB that i could
characterise as "needless bloat".
--
Ingo Molnar
I suppose if the NSA had 20,000 2Ghz processors in the basement
cranking for 10 years, then 50% of the time *after* they did a black
bag job to crack the random pool state, they could get the last 80
bits generated from /dev/random, but it just seems that if you are
assuming the power to grab the pool plus add_ptr, there would be much
more useful things you could --- like for example having the black bag
job trojaning the software to grab the private key directly.
--
Ted Ts'o
Nothing is beyond my skills. My mad k0der skillz are unbeatable.
--
Linus Torvalds
Comments (12 posted)
By Jonathan Corbet
December 10, 2007
Syslets are a proposed mechanism which would allow any system call to be
invoked in an asynchronous manner; this technique promises a more
comprehensive and simpler asynchronous I/O mechanism and much more - once
all of the pesky little details can be worked out. A while back, Zach
Brown let it be known that he had taken over the ongoing development of the
syslets patch set; things have been relatively quiet since then. But Zach
has just returned with
a new
syslets patch which shows where this idea is going.
This version of the patch removes much of the functionality seen in
previous postings. The ability to load simple programs into the kernel
for asynchronous execution is now gone, as is the "threadlet" mechanism for
asynchronous execution of user-space functions. Instead, syslets have gone
back to their roots: a mechanism for running a single system call without
blocking.
As had been foreshadowed in other discussions, syslets now use the indirect() system call
mechanism. An application wanting to perform an asynchronous system call
fills in a syslet_args structure describing how the asynchronous
execution is to be handled; the application then calls indirect() to make it
happen. If the system call can run without blocking, indirect()
simply returns with the final status. If blocking is required, the kernel
will (as with previous versions of this patch) return to user space in a
separate process while the original process waits for things to complete.
Upon completion, the final status is stored in user-space memory and the
application is notified in an interesting way.
The syslet_args structure looks like this:
struct syslet_args {
u64 completion_ring_ptr;
u64 caller_data;
struct syslet_frame frame;
};
The completion_ring_pointer field contains a pointer to a circular
buffer stored in user space. The head of the buffer is defined this way:
struct syslet_ring {
u32 kernel_head;
u32 user_tail;
u32 elements;
u32 wait_group;
struct syslet_completion comp[0];
};
Here, kernel_head is the index of the next completion ring entry
to be filled in by the kernel, and user_tail is the next entry to
be consumed by the application. If the two are equal, the ring is empty.
The elements field says how many entries can be stored in the
ring; it must be a power of two. The kernel uses wait_group as a
way of locating a wait queue internally when the application waits on
syslet completion; your editor suspects that this part of the API may not
survive into the final version.
Finally, the completion status values themselves live in the array of
syslet_completion structures, which look like this:
struct syslet_completion {
u64 status;
u64 caller_data;
};
When a syslet completes, the final return code is stored in
status, while the caller_data field is set with the value
provided in the field by the same name in the syslet_args
structure when the syslet was first started.
There is one field of syslet_args which has not been discussed
yet: frame. The definition of this structure is
architecture-dependent; for the x86 architecture it is:
struct syslet_frame {
u64 ip;
u64 sp;
};
These values are used when the syslet completes. After the kernel stores
the completion status in the ring buffer, it will call the function whose
address is stored in ip, using the stack pointer found in
sp. This call serves as a sort of instant, asynchronous
notification to the application that the syslet is done. It's worth noting
that this call is performed in the original process - the one in which the
syslet was executed - rather than in the new process used to return to user
space when the syslet blocked. This function also has nothing to return
to, so, after doing its job, it should simply exit.
So, to review, here is how a user-space application will use syslets to
call a system call asynchronously:
- The completion ring is established and initialized in user space.
- A stack is allocated for the notification function, and the
syslet_args structure is filled in with the relevant
information.
- A call is made to indirect() to get the syslet going.
- If the system call of interest is able to complete without blocking,
the return value is passed directly back to user space from
indirect() and the call is complete.
- Otherwise, once the system call blocks, execution switches to a new
process which returns to user space. An ESYSLETPENDING
error is returned in this case.
- Once the system call completes, the kernel stores the return value in
the completion ring and calls the notification function in the
original process.
Should the application wish to stop and wait for any outstanding syslets to
complete, it can make use of a new system call:
int syslet_ring_wait(struct syslet_ring *ring, unsigned long user_idx);
Here, ring is the pointer to the completion ring, and
user_idx is the value of the user_tail index as seen by
the process. Providing the tail as an argument to
syslet_ring_wait() prevents problems with race conditions which
might come about if a
syslet completes after the application has decided to wait. This call will
return once there is at least one completion in the ring.
The real purpose of this set of patches is to try to nail down the
user-space API for syslets; it is clear that there is still some work to be
done. For
example, there is no way, currently, for an application to use
indirect() to simultaneously launch a syslet and (as was the
original purpose for indirect()) provide additional arguments to
the target system call. In fact, the means for determining which of the
two is being done looks dangerously brittle. As Zach has already noted,
the calling convention needs
to be changed to make the syslet functionality and the addition of
arguments orthogonal.
There are a number of other questions which need to be answered - Zach has
supplied a few of them with the patch. Interaction with ptrace()
is unclear, resource management issues abound, and so on. Zach is clearly
looking for feedback on these issues:
I'm particularly interested in hearing from people who are trying
to use syslets in their applications. This will involve awkward
wrappers instead of glibc calls for now, and your machine may
explode, but hopefully the chance to influence the design of
syslets would make it worth the effort.
So, the message is clear: anybody who is interested in how this interface
will look would be well advised to pay attention to it now.
Comments (10 posted)
By Jonathan Corbet
December 11, 2007
The avoidance of writeout deadlocks is a topic which occasionally pops up
on the mailing lists. Most Linux systems are able to handle the writeout
of dirty pages to disk without a great deal of trouble. Every now and
then, however, the system can get itself into a state where it is is out of
memory and it must write some pages to disk before any more memory can be
allocated. If the act of writing those pages, itself, requires memory
allocations, the system can deadlock. Systems with complicated block I/O
setups - those using the device mapper, network-based storage, user-space
filesystems, etc. - are
the most susceptible to this problem.
There has been a steady stream of patches aimed at solving this problem;
the write throttling patch
discussed here last August is one of them. The problem is inherently hard
to solve, though; it looks like it may be with us for a long time. Or
maybe not, if Daniel Phillips's new and rather aggressively promoted writeout throttling patch lives
up to its hype.
Daniel's patch is quite simple at its core. His approach for eliminating
writeout-related deadlocks comes down to this:
- Establish a memory reserve from which (only) code performing writeout
can allocate pages. In fact, this reserve already exists, in that
some memory is reserved for the use of processes marked with the
PF_MEMALLOC flag.
- Place an upper limit on the amount of memory which can be used for writeout
to each device at any given time.
The patch does not try to directly track the amount of memory which will be
used by each writeout request; instead, it tasks block-level drivers with
accounting for the number of "units" which will be used. To that end, it
adds an atomic_t variable (called available) and a
function pointer (metric()) to each
request queue. When an outgoing request finds its way to
__generic_make_request(), it is passed to metric() to get
an estimate of the amount of resource which will be required to handle that
request. If the estimated resource requirement exceeds the value of
available, the process will simply block until a request completes
and available is incremented to a sufficiently high level.
The metric() function is to be supplied by the highest-level block
driver responsible for the request queue. If that block driver is, itself,
responsible for getting the data to the physical media, estimating the
resource requirements will be relatively easy. The deadlock problems,
however, tend to come up when I/O requests have to go through multiple
layers of drivers; imagine a RAID array built on top of network-based
storage devices, for example. In that case the top level will have to get
resource requirement estimates from the lower levels, a problem which has
not been addressed in this patch set.
Andrew Morton suggested an alternative
approach wherein the actual memory use by each block device would be
tracked. A few hooks into the page allocation code would give a reasonable
estimate of how much memory is dedicated to outstanding I/O requests at any
given time; these
hooks could also be used to make a guess at how much memory each new
request can be expected to need. Then, the block layer could use that
guess and the current usage to ensure that the device does not exceed its
maximum allowable memory usage. Daniel eventually rejected this approach, saying that looking at
current memory use is risky. It may well be that a given device is
committed to serving I/O requests which will, before they are done, require
quite a bit more memory than has been allocated so far. In that case,
memory usage could eventually exceed the cap in a big way. It's better,
says Daniel, to do a conservative accounting at the beginning.
The patch does not address the memory reserve issue at all; instead, it
relies on the current PF_MEMALLOC mechanism. It was necessary,
says Daniel, to give the PF_MEMALLOC "privilege" to some system
processes which assist in the writeout process, but nothing more than that
was needed. He also claims that, for best results, much of the current
code aimed at preventing writeout deadlocks needs to be removed from the
kernel. He concludes:
Let me close with perhaps the most relevant remarks: the attached
code has been in heavy testing and in production for months now.
Thus there is nothing theoretical when I say it works, and the
patch speaks for itself in terms of obvious correctness. What I
hope to add to this in the not too distant future is the news that
we have removed hundreds of lines of existing kernel code,
maintaining stability and improving performance.
Since then, a couple of reviewers have pointed out problems in the code,
dimming its aura of obvious correctness slightly. But nobody has found
serious fault with the core idea. Determining its true effectiveness and
making it work for a larger selection of storage configurations will take
some time and effort. But, if the idea pans out, it could herald the end
of a perennial and unpleasant problem for the Linux kernel.
Comments (none posted)
By Jonathan Corbet
December 12, 2007
As the 2.6.24 release slowly gets closer, the desire to shrink the list of
known regressions grows. As can be seen from
the current list (as of just before
2.6.24-rc5), there is still some work yet to be done. That list is long
enough that, as Linus pointed out in the -rc5 announcement, the traditional
holiday release may not happen this year.
One of those regressions is a failure of a certain model of DVD drive to
work with the 2.6.24-rc kernels; this drive works fine with 2.6.23. A look
at the
corresponding bugzilla entry shows that quite a bit of effort has been
expended (by both developers and testers) toward tracking this one down,
but, as of this writing, its exact cause remains unknown.
So there is not (again, as of this writing) a well-defined fix for the problem.
What is known is which patch broke the device. Tejun Heo describes it this way: "It's introduced
by setting ATAPI transfer chunk size to actual transfer size which is the
right thing to do generally." The current development code
(destined for 2.6.25) works just fine with this device, but that would be
far too big a patch to put into the 2.6.24 kernel at this stage in the
cycle. So Tejun (along with others) continues to look for a simpler fix.
He also has a backup plan:
If we fail to find out the solution in time, we always have the
alternative of backing out the ATAPI transfer chunk size update.
This will break some other cases which were fixed by the change but
those won't be regressions at least and we can add transfer chunk
size update with other changes to 2.6.25.
This plan drew an immediate complaint from
Alan Cox, who notes that backing out this fix will break quite a few
devices which had finally been made to work while fixing only one which is
known to have problems with the new
code. This change, he says, "...is nonsensical and not in the
general good". Alan would rather take the hit of breaking one
device for the benefit of making a larger number of others work properly
for the first time. If need be, the failing drive could be handled via a
special blacklist in 2.6.24.
That idea, however, was firmly shot down by
Linus:
"The one off regression" is likely the tip of an iceberg. If
something regresses for one person, for that one person who tested
and noticed and made a bug-report, there's probably a thousand
people who haven't even tested the development kernel, or who had
problems and just went back to the previous version.
In contrast, reverting something will be guaranteed to not have
those kinds of issues, since the only people who could notice are
people for who it never worked in the first place. There's no
"silent mass of people" that can be affected.
In recent years, as the complexity of the kernel (and concerns about its
quality) have grown, the development community has taken an increasingly
hard line against regressions. As Linus points out above, regressions cause
visible problems for people whose systems were once working; that is a
clear way to lose testers and (eventually) users. On the other hand,
something which has never worked, and which still does not work,
does not make life worse for Linux users. For this reason, the avoidance
of regressions has become one of the highest development priorities.
There is another, related reason: the aforementioned kernel quality
concerns. One can easily ask whether the quality of the kernel is
improving or not, but truly answering that question is not an easy thing to
do. A better kernel may, by attracting additional users, actually result
in more bug reports; similarly, a buggier kernel may drive testers away,
with the result that the number of reported bugs goes down. One cannot
simply look at the lists of known problems and come to a reasonably
defensible conclusion as to whether a given kernel is better than another
or not.
What one can do, however, is ensure that everything which works now
continues to work in future versions. If working things do not break,
then, on the assumption that other problems are occasionally being fixed,
it is reasonable to conclude that the kernel is getting better. If
regressions are allowed, instead, then one never really knows. Regressions
thus are the closest thing we have to an objective measurement of the
quality of a given kernel release, and fixing regressions is an unambiguous
way of improving that quality. So it's no wonder that the higher priority
placed on improving kernel quality has led to a stronger focus on
regressions.
Anybody who has watched Alan Cox's work knows that he cares deeply about
the quality of the kernel. But he thinks that the anti-regression policy
is being taken a little too far this time
around:
To blindly argue regressions are critical is sometimes (as in this
case) to argue that "this freeway is no longer compatible with a
horse and cart" means the freeway should be turned back into a dirt
road.
It may yet be that a proper fix for this problem will be found for 2.6.24,
at which point the larger change can go through. Failing that, though, it
appears that the horses and carts will win the day for now. Those needing
the full freeway will have to wait until the horse-compatible version
becomes available in 2.6.25.
(Update: it appears that
the problem has now been fixed.)
Comments (2 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>