Brief items
The current 2.6 development kernel is 2.6.25-rc8,
released on April 1. This
one has a number of fixes and takes care of some of the most obnoxious
remaining regressions; it could, conceivably, be the last -rc before the
final 2.6.25 release. See the announcement for details, or
the
long-format changelog for lots of details.
The current -mm tree is 2.6.25-rc8-mm1. Recent changes
to -mm include a big set of IDE patches, a new "special" page flag (see
below), the device whitelist control group, writable mmap()
support for the FUSE filesystem, the object debugging infrastructure
patches, and lots of fixes.
Comments (none posted)
Kernel development news
This is "high" priority because the wife will kill me if she
doesn't have her videos. And the adobe player won't install on
current rawhide due to some library issues.
--
Linus
Torvalds files a bug report.
Argh. My apologies for sloppiness of flame. Conclusion still
stands.
-- Politeness,
Al Viro style.
Comments (5 posted)
By Jonathan Corbet
March 31, 2008
Linux enthusiasts like to point out just how scalable the system is; Linux
runs on everything from pocket-size devices to supercomputers with several
thousand processors. What they talk about a little bit less is that, at
the high end, the true scalability of the system is limited by the sort of
workload which is run. CPU-intensive scientific computing tasks can make
good use of very large systems, but database-heavy workloads do not scale
nearly as well. There is a lot of interest in making big database systems
work better, but it has been a challenging task. Nick Piggin appears to
have come up with a logical next step in that direction, though, with a
relatively straightforward set of core memory management changes.
For some time, Linux has supported direct I/O from user space. This, too, is a scalability
technology: the idea is to save processor time and memory by avoiding the
need to copy data through the kernel as it moves between the application
and the disks. With sufficient programming effort, the application should
be able to make use of its superior knowledge of its own data access patterns
to cache data more effectively than the kernel can; direct I/O allows that caching
to happen without additional overhead. Large database management systems
have had just that kind of programming effort applied to them, with the
result that they use direct I/O heavily. To a significant extent, these
systems use direct I/O to replace the kernel's paging algorithms with their
own, specialized code.
When the kernel is asked to carry out a direct I/O operation, one of the
first things it must do is to pin all of the relevant user-space pages into
memory and locate their physical addresses. The function which performs
this task is get_user_pages():
int get_user_pages(struct task_struct *tsk,
struct mm_struct *mm,
unsigned long start,
int len,
int write,
int force,
struct page **pages,
struct vm_area_struct **vmas);
A successful call to get_user_pages() will pin len pages
into memory, those pages starting at the user-space address start
as seen in the given mm. The addresses of the relevant struct
page pointers will be stored in pages, and the associated VMA
pointers in vmas if it is not NULL.
This function works, but it has a problem (beyond the fact that it is a
long, twisted, complex mess to read): it requires that the caller hold
mm->mmap_sem. If two processes are performing direct I/O on
within the same address space - a common scenario for large database
management systems - they will contend for that semaphore. This kind of
lock contention quickly kills scalability; as soon as processors have to
wait for each other, there is little to be gained by adding more of them.
There are two common approaches to take when faced with this sort of
scalability problem. One is to go with more fine-grained locking, where each
lock covers a smaller part of the kernel. Splitting up locks has been
happening since the initial creation of the Big Kernel Lock, which is the
definitive example of coarse-grained locking. There are limits to how much
fine-grained locking can help, though, and the addition of more locks comes
at the cost of more complexity and more opportunities to create deadlocks.
The other approach is to do away with locking altogether; this has been the
preferred way of improving scalability in recent years. That is, for
example, what all of the work around read-copy-update has been doing. And
this is the direction Nick has chosen to improve get_user_pages().
Nick's core observation is that, when get_user_pages() is called
on a normal user-space page which is already present in memory, that page's
reference count can be increased without needing to hold any locks first.
As it happens, this is the most common use case. Behind that observation,
though, are a few conditions. One is that it is not possible to traverse
the page tables if those tables are being modified at the same time. To be
guaranteed that this will not happen, the kernel must, before heading into
the page table tree, disable interrupts in the current processor. Even
then, the kernel can only traverse the currently-running process's page
tables without holding mmap_sem.
Lockless operation also will not work whenever pages which are not "normal"
are involved. Some cases - non-present pages, for example - are easily
detected from the information found in the page tables themselves. But
others, such as situations where the relevant part of the address space has
been mapped onto device memory with mmap(), are not readily
apparent by looking at the associated page table entries. In this case,
the kernel must look back at the controlling vm_area_struct (VMA)
structure to see what is going on - and that cannot be done without holding
mmap_sem. So it looks like there is no way to find out whether
lockless operation is possible without first taking the lock.
The solution here is to grab a free bit in the page table entry. The PTE
for a page which is present in memory holds the physical page frame
address. In such addresses, the bottom 12 bits (for architectures using
4096-byte pages) will always be zero, so they can be dedicated to other
purposes. One of them is used to indicate whether the page is present in
memory at all; others indicate writability, whether it's a user-space page,
whether it is dirty, etc. Nick's patch grabs one of the few remaining bits
and calls it "PAGE_BIT_SPECIAL," indicating "special" pages.
These are pages which, for whatever reason, do not have a
readily-accessible struct page associated with them. Marking
"special" pages in the page tables can help in a number of places; one of
those is making it possible to determine whether lockless
get_user_pages() is possible on a given page.
Once these pages are properly marked in the page tables, it is possible
to write a function which makes a good attempt at a lockless
get_user_pages(). Nick's proposal is called
fast_gup():
int fast_gup(unsigned long start, int nr_pages,
int write, struct page **pages);
This function has a much simpler interface than get_user_pages()
because it does not handle many of the cases that get_user_pages()
can deal with. It only works with the current process's address space, and
it cannot return pointers to VMA structures. But it can iterate
through a set of page tables, testing each page for presence, writability,
and "non-specialness," and incrementing each page's reference count (thus
pinning it into physical memory) in the process. If it works, it's very
fast. If not, it undoes things then falls back to
get_user_pages() to do things the slow, old-fashioned way.
How much is this worth? Nick claims a 10% performance improvement running
"an OLTP workload" (one of those unnameable benchmark suites, perhaps) using
IBM's DB2 DBMS system on a two-processor (eight-core) system. The
performance improvement, he says, may be greater on larger systems. But
even if it remains at "only" 10%, this work is a clear step in the right
direction for this kind of workload.
[Update: this interface was merged for the 2.6.27 kernel; the name
was changed to get_user_pages_fast() but it is otherwise the
same.]
Comments (none posted)
By Jonathan Corbet
April 2, 2008
The steady growth in flash-based memory devices looks set to transform
parts of the storage industry. Flash has a number of advantages over
rotating magnetic storage: it is smaller, has no moving parts, requires
less power, makes less noise, is truly random access, and it has the
potential to be faster.
But flash is not without its own idiosyncrasies. Flash-based devices
operate on much larger blocks of data: 32KB or more. Rewriting a portion
of a block requires running an erase cycle on the entire block (which can
be quite slow) and writing the entire block's contents. There is a limit
to the number of times a block can be erased before it begins to corrupt
the data stored there; that limit is low enough that it can bring a
premature end to a flash-based device's life, especially if the same block
is repeatedly rewritten. And so on.
A number of approaches exist for making flash-based devices work well.
Many devices, such as USB drives, include a software "flash translation
layer" (FTL); this layer performs the necessary impedance matching to make
a flash device look like an ordinary block device with small sectors.
Internally, the FTL maintains a mapping between logical blocks and physical
erase blocks which allows it to perform wear leveling - distributing
rewrite operations across the device so that no specific erase block wears
out before its time - though some observers question whether low-end flash
devices bother to do that. The use of FTL layers makes life easy for the rest of
the system, but it is not necessarily the way to get the best performance
out of the hardware.
If you can get to the device directly, without an FTL getting in the way,
it is possible to create filesystems which embody an awareness of how flash
works. Most of our contemporary filesystems are designed around rotating
storage, with the result that they work hard to minimize time-consuming
operations like head seeks. A flash-based filesystem need not worry about
such issues, but it must be concerned about things like erase blocks
instead. So making the best use of flash requires a filesystem written
with flash in mind.
The main filesystem for flash-based devices on Linux is the venerable
JFFS2. This filesystem works, but it was designed for devices which are
rather smaller than those available today. Since JFFS2 must do things like
rebuild the entire directory tree at mount time, it can be quite slow on
large devices - for relatively small values of "large" by 2008 standards.
JFFS2 is widely seen as reaching the end of its time.
A more contemporary alternative is LogFS, which has been discussed on these pages in the
past. This work remains unfinished, though, and development has been
relatively slow in recent times; LogFS has not yet been seriously
considered for merging into the mainline. A more recent contender is UBIFS; this code is in a state
of relative completion and its developers are asking for serious review.
UBIFS depends on the UBI layer, which was merged for 2.6.22. UBI
("unsorted block images") is not, technically, an FTL, but it performs a
number of the same functions. At the heart of UBI is a translation table
which maps logical erase blocks (LEBs) onto physical erase blocks (PEBs).
So software using UBI to access flash sees a device providing a simple set
of sequential blocks which apparently do not move. In fact, when an LEB is
rewritten, the new data will be placed into a different location on the
physical device, but the upper layers know nothing about it. So UBI makes
problems like wear leveling and bad block avoidance go away for the upper
layers. UBI also takes care of running time-consuming erase operations in
the background when possible so that upper layers need not wait when
writing a block.
One little problem with UBI is that the logical-to-physical mapping
information is stored in the header of each erase block. So when the UBI
layer initializes a flash device, it must read the header from every block
to build the mapping table in memory; this operation clearly takes time.
For 1GB flash devices, this initialization overhead is tolerable; in the
future, when we'll be booting our laptops with terabyte-sized flash drives
in them, the linear scan will be a problem. The UBIFS developers are aware
of this issue, but believe that it can be solved at the UBI level without
affecting the higher-level filesystem code.
By using UBI, the UBIFS developers are able to stop worrying about some
aspects of flash-based filesystem design. Other problems remain, though.
For example, the large erase blocks provided by flash devices require
filesystems to track data at the sub-block level and to perform occasional
garbage collection: coalescing useful information into new blocks so that
the remaining "dead" space can be reclaimed. Garbage collection, along
with the
potential for blocks to turn bad, makes space management on flash
devices tricky: freeing space may require using more space first, and there
is no way to know how much space will actually become available until the
work has been done.
In the case of UBIFS, space management is an even trickier problem for a
couple of reasons. One is that, like a number of other flash filesystems,
UBIFS performs transparent compression of the data. The other is that,
unlike JFFS2, UBIFS provides full writeback support, allowing data to be
cached in memory for some time before being written to the physical media.
Writeback gives large performance improvements and reduces wear on the
device, but it can lead to big trouble if the filesystem commits to writing
back more data than it actually has the space to store. To deal with this
problem, UBIFS includes a complex "budgeting" layer which manages
outstanding writes with pessimistic assumptions on what will be possible.
Like LogFS, UBIFS uses a "wandering tree" structure to percolate changes up
through the filesystem in an atomic manner. UBIFS also uses a journal,
though, to minimize the number of rewrites to the upper-level nodes in the
tree.
The latest UBIFS posting raised questions about how it compares with
LogFS. The resulting discussion was ... not entirely technical, but a few
clear points came out. UBIFS is in a more complete state and appears to
perform quite a bit better at this time. LogFS is a lot less code, avoids
the boot-time linear scan of the device, and is able to work (with some
flash awareness) through an FTL. Which is better is not a question your
editor is prepared to answer at this time; what does seem clear is that the
growing competition between the two projects has the potential to inspire
big improvements on both sides in the near future.
Comments (16 posted)
By Jonathan Corbet
April 2, 2008
The Linux Foundation has just published
a
white paper, written by Greg Kroah-Hartman, Amanda McPherson, and your
editor, reviewing the origins of the code merged into the kernel from
2.6.11 through 2.6.24. As LWN readers know, the 2.6.2
5 kernel is
getting close to release. So this seems like as good a time as any to look
at what happened with the process in this release cycle.
As of this writing, 12,269 individual changesets have been merged for
2.6.25 - a new record. That beats the previous record (2.6.24, with a mere
10,353 changesets) by almost 2,000. There were 1,174 individual developers
involved with 2.6.25, 419 of whom contributed one single patch. All told,
those developers worked for 159 employers (that your editor could
identify). The changes added 766,979 lines of code and removed 399,791, for
a total growth of 367,188 lines.
Here is an updated version of a plot that your editor has been fond of
showing during talks in recent years:
This plot shows a cumulative count of lines changed over time, with kernel
release dates added in. The effects of the merge window policy can be seen
in the stair-step appearance of the plot. The steps appear to be getting
bigger, but the time between releases has also increased slightly, so the
overall rate of change remains roughly constant. It is a high rate, with
over five million lines changed - well over half the total - in the last
two years.
So who did this work? Here is the traditional table of the most active
developers in the 2.6.25 series:
| Most active 2.6.25 developers |
| By changesets |
| Bartlomiej Zolnierkiewicz | 304 | 2.5% |
| Patrick McHardy | 219 | 1.8% |
| Adrian Bunk | 212 | 1.7% |
| Ingo Molnar | 207 | 1.7% |
| Paul Mundt | 204 | 1.7% |
| Greg Kroah-Hartman | 171 | 1.4% |
| Jesper Nilsson | 166 | 1.4% |
| Thomas Gleixner | 164 | 1.3% |
| Pavel Emelyanov | 155 | 1.3% |
| Harvey Harrison | 148 | 1.2% |
| Herbert Xu | 136 | 1.1% |
| Mauro Carvalho Chehab | 136 | 1.1% |
| Roland McGrath | 134 | 1.1% |
| David Woodhouse | 134 | 1.1% |
| Al Viro | 132 | 1.1% |
| Michael Krufky | 128 | 1.0% |
| Glauber Costa | 127 | 1.0% |
| David S. Miller | 112 | 0.9% |
| Andrew Morton | 109 | 0.9% |
| Takashi Iwai | 104 | 0.8% |
|
| By changed lines |
| Jesper Nilsson | 34407 | 3.7% |
| David Howells | 29733 | 3.2% |
| Eliezer Tamir | 26153 | 2.9% |
| Adrian Bunk | 21998 | 2.4% |
| Kumar Gala | 19753 | 2.2% |
| Paul Mundt | 18918 | 2.1% |
| Jiri Slaby | 18002 | 2.0% |
| Glenn Streiff | 16597 | 1.8% |
| Auke Kok | 13939 | 1.5% |
| David Gibson | 11255 | 1.2% |
| Michael Chan | 11254 | 1.2% |
| Ingo Molnar | 10679 | 1.2% |
| James Bottomley | 9907 | 1.1% |
| Christoph Hellwig | 9784 | 1.1% |
| Mauro Carvalho Chehab | 9332 | 1.0% |
| Bartlomiej Zolnierkiewicz | 9108 | 1.0% |
| Thomas Gleixner | 9104 | 1.0% |
| Patrick McHardy | 8563 | 0.9% |
| Michael Krufky | 8195 | 0.9% |
| Takashi Iwai | 7825 | 0.9% |
|
There are some familiar names on this list, but also some new ones.
Bartlomiej Zolnierkiewicz contributed more changesets than any other
developer; his work is contained entirely within the IDE subsystem.
Patrick McHardy works in the networking area, mostly (but not exclusively)
with the netfilter subsystem. Adrian Bunk continues to make small fixes
all over the tree and to relentlessly hunt down unused code for removal.
Ingo Molnar remains busy in his new role as one of the x86 maintainers;
scheduler work also accounts for a number of his changes. Paul Mundt
maintains the SuperH architecture.
The picture is a little different when one considers how many lines of code
were changed. Jesper Nillson's work was done within the CRIS
architecture. David Howells works all over the tree; his largest
contribution was the addition of the MN10300 architecture code. Eliezer
Tamir contributed the bnx2x (Broadcom Everest) network driver, and Kumar
Gala works with the PowerPC architecture.
There is relatively little change in the lists of employers associated with
all of this work (please remember that the numbers associated with
employers are necessarily approximate):
| Most active 2.6.25 employers |
| By changesets |
| (None) | 1918 | 15.6% |
| Red Hat | 1562 | 12.7% |
| (Unknown) | 1232 | 10.0% |
| Novell | 826 | 6.7% |
| IBM | 758 | 6.2% |
| Intel | 566 | 4.6% |
| SWsoft | 266 | 2.2% |
| Oracle | 250 | 2.0% |
| Astaro | 219 | 1.8% |
| (Academia) | 218 | 1.8% |
| Renesas Technology | 217 | 1.8% |
| Movial | 213 | 1.7% |
| Axis Communications | 166 | 1.3% |
| linutronix | 166 | 1.3% |
| Freescale | 132 | 1.1% |
| Qumranet | 127 | 1.0% |
| Google | 124 | 1.0% |
| Analog Devices | 121 | 1.0% |
| SGI | 118 | 1.0% |
| (Consultant) | 111 | 0.9% |
|
| By lines changed |
| (None) | 132117 | 14.4% |
| (Unknown) | 117993 | 12.8% |
| Red Hat | 103188 | 11.2% |
| IBM | 59249 | 6.4% |
| Freescale | 52336 | 5.7% |
| Intel | 46466 | 5.1% |
| Novell | 41790 | 4.5% |
| Axis Communications | 39382 | 4.3% |
| Broadcom | 37789 | 4.1% |
| Renesas Technology | 23704 | 2.6% |
| Movial | 22327 | 2.4% |
| Hansen Partnership | 12076 | 1.3% |
| Marvell | 11661 | 1.3% |
| Oracle | 11214 | 1.2% |
| linutronix | 10649 | 1.2% |
| Astaro | 10167 | 1.1% |
| (Consultant) | 9342 | 1.0% |
| SWsoft | 7849 | 0.9% |
| MontaVista | 7517 | 0.8% |
| (Academia) | 7353 | 0.8% |
|
As usual, one can also look at who applies a Signed-off-by header to code
for which they are not the author. These headers illustrate the chain of
trust which gets code into the kernel. For 2.6.25, the top approvers of
patches are:
| Sign-offs in the 2.6.25 kernel |
| By developer |
| Andrew Morton | 1513 | 12.2% |
| David S. Miller | 1444 | 11.7% |
| Ingo Molnar | 1153 | 9.3% |
| Thomas Gleixner | 991 | 8.0% |
| John W. Linville | 614 | 5.0% |
| Jeff Garzik | 468 | 3.8% |
| Mauro Carvalho Chehab | 447 | 3.6% |
| Greg Kroah-Hartman | 345 | 2.8% |
| Paul Mackerras | 307 | 2.5% |
| James Bottomley | 306 | 2.5% |
| Jaroslav Kysela | 292 | 2.4% |
| Linus Torvalds | 249 | 2.0% |
| Len Brown | 220 | 1.8% |
| Russell King | 197 | 1.6% |
| Takashi Iwai | 170 | 1.4% |
| Avi Kivity | 167 | 1.4% |
| Bryan Wu | 132 | 1.1% |
| Herbert Xu | 123 | 1.0% |
| Roland Dreier | 121 | 1.0% |
| Kumar Gala | 107 | 0.9% |
|
| By employer |
| Red Hat | 4185 | 33.8% |
| Google | 1516 | 12.2% |
| linutronix | 994 | 8.0% |
| (None) | 883 | 7.1% |
| IBM | 689 | 5.6% |
| Novell | 611 | 4.9% |
| (Unknown) | 534 | 4.3% |
| Intel | 468 | 3.8% |
| Hansen Partnership | 306 | 2.5% |
| Linux Foundation | 254 | 2.1% |
| (Consultant) | 242 | 2.0% |
| Qumranet | 170 | 1.4% |
| Oracle | 126 | 1.0% |
| SGI | 126 | 1.0% |
| Freescale | 121 | 1.0% |
| Cisco | 121 | 1.0% |
| Analog Devices | 115 | 0.9% |
| Astaro | 107 | 0.9% |
| Renesas Technology | 82 | 0.7% |
| Movial | 78 | 0.6% |
|
Some of these developers are quite busy; Andrew Morton is signing off
more than twenty patches every day - weekends included. The gatekeepers to
the kernel continue to work for a relatively small number of companies,
with the top ten employers accounting for over 75% of all non-author
signoffs.
All told, all these numbers paint a picture of a development process which
is healthy and continues to set a fast pace. It incorporates work from an
increasingly large community of developers who are able to work in a highly
cooperative manner despite the fact that their employers are fierce
competitors. There are very few projects like it.
(Thanks to Greg Kroah-Hartman for his help in the creation of these
statistics).
Comments (6 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>