The current development kernel is 2.5.36
, which was released
by Linus on September 17. The
big news was, of course, is the merge of the XFS journaling filesystem
There's also the x86 "huge page" patch, an IEEE-1394 ("Firewire") update, a
big USB update (converting the code to the new driver model scheme), an IDE
update, and various other fixes. See the
for the details.
Linus has had a busy week; 2.5.35 was released on September 15.
This (large) patch included, among other things, the merge of User-mode Linux, a large IDE
update, various memory management improvements, more threading
improvements, a bunch of NFS server patches and PPC64 and SPARC updates.
Again, the long-format changelog has the
Linus's BitKeeper tree, which will become 2.5.37, has some block I/O work,
some RPC fixes, a bit of memory management work, and Linus's simple
solution to the get_pid() problem (see below).
The current 2.5 Status Summary from
Guillaume Boissere is dated September 17.
The current stable kernel is 2.4.19; Marcelo released 2.4.20-pre7 on September 12. Big MIPS and
IA-64 updates make up the bulk of the patch this time around, along with a
relatively small set of other fixes.
Alan Cox's current prepatch is 2.4.20-pre7-ac2. The IDE work continues; this
patch also contains a number of other, unrelated fixes.
The current ancient kernel is 2.2.22, which was released by Alan Cox on September 16. It
contains a few security fixes, so people still running 2.2 will probably
want to have a look at this update.
Comments (1 posted)
Kernel development news
A quick look through the kernel source will turn up no end of examples of
The idea, of course, is to put the process asleep until something of
interest has happened. The problem with this kind of code is that if the
condition changes (and the wakeup happens) between the two lines of
code above, the process will miss the wakeup and could sleep for far longer
than intended. Because of this inherent race condition, the elimination of
sleep_on() and its variants has been on the kernel hackers' todo
list for some time.
There is a macro (wait_event) which can be used to sleep safely,
but most code which includes race-free sleeps does so manually with the
following approximate steps:
- Create a wait queue entry (usually with DECLARE_WAITQUEUE).
- Change the process to a state (usually TASK_INTERRUPTIBLE)
which indicates that it is asleep - even though the process is still
running in kernel code.
- Add the current process to a wait queue which will be awakened when
the condition is met.
- Test the condition of interest; if no sleep is necessary, reset the
process state to TASK_RUNNING, remove the wait queue entry,
and get on with the job at hand.
- Otherwise call the scheduler to let some other process run until
somebody wakes the current process up.
- On wakeup, go back to the top and do it all again.
This sequence works because a wakeup will reset the task state to
TASK_RUNNING; this "shorts out" the sleep should the process test
its condition at the wrong time and call the scheduler after the wakeup has
In many places, the above steps are complicated by the need to release
locks or other resources before invoking the scheduler. The result is a
lot of duplicated (and error-prone) code throughout the kernel - and this
is the "safe" way of doing things.
As part of his 2.5.35-mm1 patch, Andrew
Morton has included a new interface designed to simplify the coding of safe
sleeps. Code using the new API looks like:
prepare_to_wait(&wait_queue, &queue_entry, TASK_INTERRUPTIBLE);
The actual series of events that occur has not really changed; things have
just been packaged inside the new prepare_to_wait() and
finish_wait functions. The result, though, is code which is
cleaner and more likely to be correct. Now it's just a matter of those
hundreds of sleep_on calls still in the 2.5 kernel source...
Comments (none posted)
Ingo Molnar, in his project to give Linux "world-class threading support,"
has set his sights on another Linux performance problem: the allocation of
process ID (PID) numbers for new processes. This does not seem like it
should be a difficult problem, but the current kernel get_pid()
shows quadratic behavior when the number of processes gets large.
Essentially, the algorithm looks like this:
for each possible PID
for each task in the system
if task_pid == pid
The above is an oversimplification, since the get_pid() code tries to
find a range of usable PIDs, not just one. Look here for
the real get_pid() implementation. The point is that, with very
large numbers of processes (i.e. on the order of 100,000),
process ID allocation can lock up the system for long periods of time.
Ingo's solution starts with some work done
by William Lee Irwin. William's "idtag" infrastructure adds hash tables
for managing things with numeric ID tags; it is used in this patch to
manage PID-related things like process groups and session IDs. The idtags
help to eliminate many iterations over the whole process space done in the
kernel, but do not solve the PID allocation problem.
Ingo handles PID allocation through a new allocator that he wrote from
scratch. This allocator maintains an array of pages (allocated as needed)
which are used as PID bitmaps; allocating a new PID becomes a matter of
finding a page with a free PID available, then finding and clearing the
first set bit. It all happens with no locking required. Ingo claims:
Ie. even in the most hopeless situation, if there are 999,999 PIDs
allocated already, it takes less than 10 usecs to find and allocate
the remaining one PID. The common fastpath is a couple of
So it's fast - though a few extra features
have been requested. But this patch has stirred up a bit of a debate.
Rather than put in a complicated new PID allocator, it is asked, why not just make the
maximum PID be very large? Then, in theory, the quadratic part of
get_pid() will never run so the performance problems go away, and
the code stays simpler. Linus prefers this
approach, as do a number of other developers; he has put a simple patch
along these lines into his pre-2.5.37 BitKeeper tree.
Ingo disagrees, pointing out that any
reasonable maximum PID size can be exceeded eventually. He would rather
fix the problem than try to hid it behind a large process ID space. In the
absence of real-world examples that show people being bitten by
get_pid()'s behavior in a larger PID space, though, Linus appears
unlikely to accept any more complicated fix.
Comments (4 posted)
There has been little (visible) progress with the asynchronous I/O code
since the AIO core was merged into the 2.5.32 kernel. AIO author Ben
LaHaise has not been idle, however. Slowly the other pieces of the AIO
package are beginning to show up for the 2.5 tree.
One piece is this patch which adds
"synchronous IOCBs" to the mix. One might wonder why an asynchronous I/O
infrastructure needs I/O control blocks which have a synchronous option.
The answer is that the synchronous IOCB is needed to achieve the goal of
making most or all low-level I/O operations in the kernel be asynchronous.
Once the I/O primitives expect an IOCB, and they work in an asynchronous
mode, it is easy to layer the older, synchronous versions on top through
the use of a synchronous IOCB. For now, synchronous IOCBs are only used in
the generic_file_read() function.
The next step, perhaps, is this patch from
Badari Pulavarty; which reworks the direct I/O (DIO) infrastructure. The
DIO code handles direct operations on block devices - such as when a
"raw" device is used, or when a file is opened with the O_DIRECT
option. The DIO operations, with this patch, are all asynchronous, with
synchronous IOCBs used when synchronous behavior is required. With this
change, the task of making the block I/O subsystem be asynchronous
internally is nearly complete. Other subsystems (i.e. char devices,
networking) remain to be converted over to the AIO scheme, however.
Comments (none posted)
HP has recently been trumpeting its results running the TPC-C benchmark
with Oracle on Linux. Slightly better performance than that achieved with
Windows is claimed. What may be more interesting is this note
posted to the linux-kernel list on
what HP did to its kernel to achieve those results. The kernel that ran
the benchmark had a few patches:
- Asynchronous I/O. Apparently using AIO improved performance by
- Large pages. Going to 2MB pages (i.e. using the large page patch
that went into 2.5.36) improved performance by 8%.
The benchmark also made extensive use of high memory (16GB worth), direct
I/O, and a number of other recent kernel features.
Comments (none posted)
Patches and updates
Core kernel code
Filesystems and block I/O
- Rik van Riel: rmap 14b.
(September 18, 2002)
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>