Brief items
The current development kernel is 2.6.0-test1, which was
released by Linus on
July 13. As is appropriate in this stage of development, this patch
consists (almost) entirely of fixes. See
the
long-format changelog for the details.
The last of the 2.5 kernels was 2.5.75,
released on July 10. This patch merged the anticipatory I/O
scheduler (covered here last January), a new
set of "kblockd" kernel threads (designed to handle block I/O operations
without creating more such operations themselves), a scary new
"nointegrity" JFS mount option, some software suspend tweaks, and, of
course, lots of fixes and updates. See the
long-format changelog for more.
Linus's BitKeeper tree contains a handful of small fixes, as of this
writing.
Alan Cox has gotten back into the 2.6 prepatch business; his latest is 2.6.0-test1-ac2. This patch is made up almost
entirely of fixes which have not yet made their way to Linus. Andrew
Morton's 2.6.0-test1-mm1 is a much more
bleeding-edge affair; it contains the latest ACPI code, the SELinux
security module, a bunch of asynchronous I/O work, the 64-bit
dev_t type, and much other stuff. The -mm tree is also
where the bulk of the scheduler interactivity work is being done.
The current stable kernel is 2.4.21. The 2.4.22 process continues
to move relatively quickly; 2.4.22-pre6
(consisting almost entirely of fixes) was
released on July 14.
Comments (none posted)
Kernel development news
On July 13, Linus
began the 2.6.0-test
series of development kernels. The move to the -test naming scheme
indicates that the 2.5 development period is truly done, and that the focus
is now strongly on stabilization. To that end, the -test1 release
restricted itself to fixes and updates - except for the addition of Andries
Brouwer's cryptoloop driver.
This sort of announcement usually results in a flurry of "but X hasn't been
merged yet" postings. Things are much quieter this time around. It would
seem that, for the most part, the features that the developers want to see
in the kernel are mostly in place. There are a few remaining loose ends,
however:
- The expanded dev_t type. Most of the ground work has been
done, but the size of dev_t has not yet been changed in
Linus's tree. It is widely expected that this work will be completed
before 2.6.0 goes out.
- Power management still needs some work. Much of that work has been
done, but it has not yet been packaged up and submitted to Linus.
- The NSA SELinux security module is being proposed for inclusion.
Linus has not made his feelings known on this patch, but, since it
does not affect anything outside of the module itself, adding SELinux
should be relatively easy to justify. Andrew Morton has indicated
that SELinux will show up in his -mm tree shortly.
- Support for many (or most) non-x86 architectures is not current in the
mainline kernel. This is a pretty standard state of affairs; the
official 2.6.0 kernel will certainly lack functioning support for
several architectures.
- There is some continuing unease over the state of the 2.5 scheduler,
which shows problems with certain kinds of loads.
In the past, Linus has not always been successful in making this kind of
freeze stick. This time around, however, Andrew Morton will be involved in
the stabilization process. Since Andrew will also be maintaining the
resulting 2.6 kernel, he'll have a strong incentive to keep a lid on things
during the test phase.
Now, of course, is the time for people with an interest in 2.6 to try out
the -test releases. Before trying out a 2.6-test kernel for the first
time, however, a reading of Dave Jones's "what
to expect" document is highly recommended (Joe Pranevich's Wonderful World of Linux 2.6
is also worth a look). Also note that putting a
2.6-test kernel on a production system is a risky thing to do; there are
still known bugs and security issues to be dealt with.
Comments (2 posted)
Once upon a time - not that long ago - the Linux kernel was unable to work
with more than 1GB of physical memory (actually, just a little bit less).
This limit was imposed by a couple of fundamental design decisions in the
kernel:
- All physical memory was directly reachable via a kernel virtual
address. When the kernel has direct access to all memory,
manipulating that memory is easy. But, to operate in this mode, the
system cannot have more memory than the kernel is able to address.
- The virtual address space was split into two large pieces: the
bottommost 3GB for user-space addresses, and the top 1GB for kernel
addresses.
The 3/1 split was not imposed by any particular external factor; instead,
it was a compromise chosen to balance two limits. The portion of the address space
given over to user addresses limits the maximum size of any individual
process on the system, while the kernel's portion limits the maximum
amount of physical memory which can be supported. Allowing the kernel to
address more memory would reduce the maximum size of every process in the
system, to the chagrin of Lisp programmers and Mozilla users worldwide.
There were, however, patches in
circulation to change the address space split for specific needs.
The 2.3 development series added the concept of "high memory," which is not
directly addressable by the kernel. High memory complicated kernel
programming a bit - kernel code cannot access an arbitrary page in the
system without setting up an explicit page-table mapping first. But the
payoff that comes with high memory is that much larger amounts of physical
memory can now be supported. Multi-gigabyte Linux systems are now common.
High memory has not solved the problem entirely, however. The kernel is
still limited to 1GB of directly-addressable low memory. Any kernel data
structure which is frequently accessed must live in low memory, or system
performance will be hurt. Increasingly, low memory is becoming the new
limiting factor on system scalability.
Consider, for example, the system memory map, which consists of a
struct page structure for every page of physical memory in the
system. The memory map is a fundamental kernel data structure which must
be placed in low memory. It takes up 40 bytes for every (4096-byte) page
in the system; that overhead may seem small until you consider that, if you
want to put 64GB of memory into an x86 box, the memory map will grow to
some 640 megabytes. This structure thus takes most of low memory by
itself. Low memory must also be used for every other important data
structure, free memory, and the kernel code itself. For a 64GB system, 1GB
of low memory is insufficient to even allow the system to boot, much less
do the sort of serious processing that such machines are bought for.
One approach to solving this problem is page clustering - grouping physical
pages into larger virtual pages. Among other things, this technique
reduces the size of the memory map. Page clustering was covered here back in February.
Recently, Ingo Molnar posted a patch which
takes a very different approach. Rather than try to squeeze more into 1GB
of low memory, Ingo's patch makes low memory bigger. This is done by
creating separate page tables to be used by user-space and kernel code,
eliminating the need to split the virtual address space between the two realms.
With this patch, a user-space process has a page table which gives it
access to (almost) the full 4GB virtual address space. When the system
goes into kernel mode (via a system call or interrupt), it switches over to
the kernel page tables. Since none of the kernel page table space must be
given to user processes, the kernel, too, can use the full 4GB address
space. The maximum amount of addressable low memory thus quadruples.
There are, of course, costs to this approach, or it would have been adopted
a long time ago. The biggest problem is that the processor's translation
buffer (a hardware cache which stores the results of page table lookups)
must be flushed when the page tables are changed. Flushing the TLB hurts
because subsequent memory accesses will be slowed by the need to do a full,
multi-level page table lookup. And, as it turns out, the TLB flush is,
itself, a slow operation on x86 processors. The additional overhead is
enough to cause a significant slowdown, especially for certain kinds of
loads.
The cost by the separated page tables is more than
most users will want to pay. For those who have applications requiring
large amounts of memory - and who, for whatever reason, cannot just get a
64-bit system - this patch may well be the piece that makes everything
work. Of course, the chances of such a patch getting in to the mainline
kernel before 2.7 are about zero. But it would not be surprising to see it
show up in certain vendors' distributions as an option.
Comments (5 posted)
The
Kernel Bug Tracker ("bugme") is a
BugZilla system run by the Open Source Development Lab. It currently holds
information on over 300 reported bugs in the 2.5 kernel. The Tracker is
seen by many as a useful tool that brings some organization and discipline
to the task of stabilizing the kernel. So it came as a surprise to many
when David Miller, maintainer of the networking subsystem,
requested that networking bugs not be entered
into the Tracker. It is, he says, the wrong way of solving the problem.
The complaint with bug tracking systems is that they try to centralize what
is otherwise an inherently distributed process. Bugs accumulate in the
database, and a single person gets the job of managing all the bugs for a
particular subsystem. If that person does not devote a significant amount
of time to the task, the tracking system quickly clogs up with outdated
reports, duplicated entries, and generally useless stuff. The time that
goes into maintaining the bug tracker is, of course, time that is not
available to actually fix the bugs.
The proper way of dealing with bugs, according to David, is to simply
report them to the relevant mailing list. The report will be seen by the
developers who can fix the bug, others who have been affected by the bug can
contribute additional information, and fixes can be publicly discussed.
And people who, for whatever reason, do not want to deal with a particular
bug report can simply hit "delete" and the message goes away.
Of course, the "goes away" part is not always popular with those who report
bugs; they would rather see the report hang around and annoy people until
one of them deals with the problem. But anybody who has sent a few bug
reports to a public list knows that those reports can simply vanish without
a trace - a rather unsatisfying result. Why bother to report bugs if the
reports can simply be ignored?
According to David (and others), the lossy nature of mailing list bug
reporting is actually a feature. Bug reporting, it is said, is a process
similar to patch submission. Users who do not get satisfaction from a bug
report should resubmit it. If the bug is not important enough for the user
to "maintain" the report, it's not worth a whole lot of effort to fix.
The "submit and retry" approach does have some advantages. Since it puts
more of the responsibility for bug reports on the users submitting those
reports, it scales more reliably as the number of users increases.
Unimportant or "operator error" bugs vanish automatically without anybody
having to shovel them out of a bug tracking system. Bugs which are fixed
by (seemingly) unrelated patches also fade away automatically. The whole
thing works in a scalable way without the need for central managers.
This approach is foreign and scary, however, to those who feel the need to
track every bug and keep a firm hand on the development process. It
provides FUD fodder for those who would portray free software development
as immature and untrustworthy. It's also frustrating to those who want to retain bug
report information for statistical or data mining purposes. It is,
however, typical of how the kernel development process works in general.
And that process, for all its faults, has produced excellent results over
years as the kernel (and its development team) has grown.
Comments (13 posted)
Patches and updates
Kernel trees
Core kernel code
- Rusty Russell: local_t.
(July 16, 2003)
Device drivers
Documentation
Filesystems and block I/O
- Andries.Brouwer@cwi.nl: cryptoloop.
(July 11, 2003)
- Tom Zanussi: relayfs.
(July 15, 2003)
Memory management
Networking
Architecture-specific
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>