LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel remains 2.6.37-rc5; there have been no 2.6.37 prepatches in the last week. Linus has returned from his travels and has resumed pulling patches, so a new release can be expected in the near future.

Stable updates: the 2.6.27.57, 2.6.32.27, and 2.6.36.2 updates were released on December 9. Greg Kroah-Hartman notes that this is the last 2.6.27 kernel he will be releasing as he is turning that kernel over to Willy Tarreau for maintenance as a long-term kernel (in keeping with the stable tree changes he announced on December 3). "It was a good run, [encompassing] 57 releases, over 791 days, and included 1596 patches (not the largest of any stable release series, .32 has that beat already). [...] Speaking of .32, I'd _strongly_ encourage all existing .27 users to seriously look into migrating to the .32 kernel tree at this point in time as well."

The 2.6.35.10 update was released - with over 200 patches - on December 15 by new maintainer Andi Kleen. Says Andi: "This release contains security fixes and all 2.6.35 users are encouraged to update."

Comments (none posted)

Quotes of the week

Right now I'm traveling without access to my laptop to actually do pulls. That was unintentional, but my laptop got co-opted for more important stuff (child needed it for school an my other laptop has a broken harddisk), so I've just been reading email with my cellphone.
-- Linus Torvalds needs an Android git port

Abstracting something tends to make it permanent. When you have an ugly, special-case temporary hack, there is merit to having it sitting there in the middle of the code staring you in the face. It's very explicit and we won't forget about it.
-- Andrew Morton

Comments (3 posted)

Gettys: Mitigations and Solutions of Bufferbloat

Jim Gettys continues his discussion of bufferbloat with a posting on how to improve the situation on home systems. "Once tuned, Linux's latency (and the router's latency) can be really nice even under high load (even if I've not tried hard to get to the theoretical minimums). But un-tuned, I can get many second latency out of both Linux home routers and my laptop, just by heading to some part of my house where my wireless signal strength is low (I have several chimneys that makes this trivial). By walking around or obstructing your wireless router, you should be easily able to reproduce bufferbloat in either your router or in your laptop, depending on which direction you saturate."

Comments (23 posted)

Directed yield

By Jonathan Corbet
December 15, 2010
Contemporary CPUs have an interesting feature: they can detect when a virtualized guest is spinning on a lock and trap back into the host kernel. The idea is that the host can probably find better things to do with a processor than dedicate it to spinning in a single guest. KVM will currently respond to such a trap by sleeping briefly, allowing some other process to run outside of the virtualized system in question. But, as Rik van Riel pointed out, that is not necessarily the right thing to do.

If one thread in a virtualized system is spinning on a lock, another thread within that system must be holding the lock. Rather than pause the entire guest, it is better to run the lock-holding thread so that the lock can be released. Pausing the guest just delays the release of the lock, causing the virtual machine as a whole to be penalized; that, says Rik, "results in eg. a 64 VCPU Windows guest taking forever and a bit to boot up." Tempting as it may be to just blame Windows, it's probably better to fix the problem.

Rik's fix is to change how the trap handler behaves; rather than yield the CPU entirely, it will take a spinning thread's remaining CPU time slice and give it to a process on another CPU. The hope is that the recipient of this gift (which is essentially a priority boost) will be the one holding the lock, but there is no real guarantee of that currently. This functionality is implemented with a new yield_to() function which, Rik says, could maybe be turned into a system call if it turns out to be useful in that context.

The patch has been through a couple of rounds of review and may find its way into 2.6.38.

Comments (none posted)

Likely unlikely()s

By Jake Edge
December 15, 2010

The kernel has two macros to assist the compiler and CPU in doing branch prediction: likely() and unlikely(). As their names imply, they are meant to annotate tests in the code based on the likelihood that they will evaluate to true. But, getting it wrong such that something marked likely is actually unlikely—or vice versa—can impact performance as the CPU may prefetch the wrong instructions.

Steven Rostedt has been looking at the problem using the annotated branch profiler and found ten places "that really do not need to have an annotated branch, or the annotation is simply the opposite of what it should be". So, he created a series of patches that either switched the sense of the annotation or removed the likely()/unlikely() entirely.

As an example, page_mapping() had an unlikely() annotation that Rostedt reported as being "correct a total of 1909540379 times and incorrect 1270533123 times, with a 39% being incorrect". Those numbers come from his main workstation which runs a variety of standard programs (Firefox, XChat, etc.) as well as participating in his build farm, so it should represent a reasonably "normal" workload. Being wrong 39% of the time was pretty obviously too much and led to the removal of the annotation for that test.

The changes are various subsystems including the scheduler, memory management, and VFS. So far, there have been no complaints, though there have been several requests to completely remove annotations that had just been changed in order to allow the compiler's and CPU's branch prediction logic make the decision. That, and breaking the patches up into separate sets for each subsystem, caused Rostedt to respin them. It would seem likely() that we'll see them make their way into 2.6.38.

Comments (14 posted)

Kernel development news

Dcache scalability and RCU-walk

By Jonathan Corbet
December 14, 2010
The Linux directory entry ("dentry") cache is a key part of the kernel's filename lookup mechanism. The dentry cache speeds the process of looking up files considerably. On systems with large numbers of cores, though, the dentry cache has become a scalability problem for workloads which perform a lot of lookups. Nick Piggin's dcache scalability work looks like it may be set to be merged for 2.6.38, but, according to Nick, this work "has had nowhere near enough review." The code is complicated and tricky, which, certainly makes review harder. Your editor, never afraid to make a total fool of himself, will now attempt to describe how this patch set works just in case it helps.

A dentry's core job is to represent a directory in the filesystem and to cache the mapping between a name found within that directory and its associated inode. To that end, dentries are organized into a hierarchy which mirrors the on-disk hierarchy found in the filesystem. Each dentry looks vaguely like this:

[dentry]

Every dentry keeps a list of children (names contained within the directory) which can be looked up by name; each dentry also points to the inode it represents, its parent dentry, and a number of other things. Note that the real situation is a bit more complicated than shown here; children are kept in hashed lists for faster lookup, an inode with more than one link may have more than one dentry pointing to it, and so on. But, in a conceptual sense, the above diagram shows what is going on.

When the VFS layer needs to turn a path provided by a program into a pointer to the associated inode, it will perform this lookup through the dentry cache. The first step is to get to the starting point - which could be the root of the filesystem (for absolute paths), the current working directory (for relative paths), or a program-specified directory. In the first two cases, the associated dentries can be found by way of the process's task structure. The first component of the path is looked up in the starting dentry, yielding a pointer to the next dentry in the path; that process is repeated until the end of the path is reached.

It may be that some of the dentries in the path are not present in the cache; there is not enough room in memory to cache the entire filesystem hierarchy. When that happens, the lookup code must ask the filesystem (by way of the parent's inode operations) to perform the lookup; a dentry structure can then be created with the result and inserted into the cache. Of course, a lookup will fail if the file (or a component in the path) does not exist; in that case, the VFS may insert a "negative dentry" into the cache to speed up future failed lookups. That optimization is important - just run a simple command under strace to see how common failed lookups really are.

The dentry cache is a dynamic data structure; dentries will come and go frequently in response to filesystem changes and the simple need to clean out old dentries on occasion. Clearly, some sort of protocol is needed to prevent changes from colliding with each other; if one processor removes a dentry while another is using it to look up a name, good things will almost certainly not result. Once upon a time, the global dcache_lock was used to protect the cache during lookup operations. The global lock scaled poorly, so it has not been used for lookups for some time - though it still does protect many other things.

Current kernels use read-copy-update to manage the removal of dentries from the cache, so a lookup operation need not worry about a specific dentry going away as long as that operation does not block. If a lookup has to call into the filesystem code, though, things might indeed block; to ensure that a dentry will stay around in this situation, a reference count is kept. So, as a lookup works its way down the hierarchy, it will increment the reference count of every dentry it passes through. Most of those references are eventually dropped, though the reference on the final dentry must be kept as long as the file is held open.

Making path lookup truly scalable in a highly parallel situation requires making no changes to the dentry structures themselves. The core of Nick's patch set is a new lookup algorithm called "RCU-walk." The key to RCU-walk is the idea that, on workloads where pathname lookup is likely to present scalability problems, the chances are good that most dentries will already be present in the cache. One might think that the current algorithm would perform well in such situations, but there is a catch: the constant incrementing and decrementing of dentry reference counts creates a great deal of cacheline bouncing - enough to slow things considerably. Making path lookup truly scalable in a highly parallel situation requires making no changes to the dentry structures themselves - turning the cache into a read-only data structure during the lookup process, essentially.

So, when the new code needs to perform a path lookup, it starts with an rcu_read_lock() call. The dentry cache should then remain stable enough for the lookup to get to the end of the path and increment the reference count for the final dentry (only). So lookups can be done without locks - and, crucially, without changing any information in the dentries passed through on the way. That makes the lookup scalability problems go away.

Of course, there are a few catches. The most obvious of these is the situation where one of the dentries in the path is not resident in the cache. At that point, the RCU-walk process must stop; the code will attempt to obtain a reference on the final dentry it found, then return to the older, reference-count-based method ("ref-walk") for the rest of the lookup. Sometimes, getting that mid-point reference will not be possible; that situation will force the lookup to restart from the beginning using ref-walk for the entire path.

More challenging, perhaps, is what happens if one of the dentries changes while the lookup is passing through. By normal RCU standards, that should never happen; changing a structure requires making a copy and making the changes there. The dentry cache bends those rules, though; RCU is mostly used to protect against dentry deletion there. So, in particular, a rename can cause a dentry to change attributes - and hashed lookup lists - while another process is using it for a lookup.

Naturally, using a lock to protect against this possibility would wipe out any scalability gains that would otherwise come from all of this work. So the RCU-walk code uses a seqlock instead. Seqlocks do not prevent concurrent changes, but they do allow code to detect when such a change has occurred. Nick has added a seqlock to every dentry which must be incremented whenever the associated name, parent, or inode changes. If the sequence number changes while a lookup is using a dentry, the lookup will be restarted (with the seqlock write-locked, to prevent heavy renaming from starving other operations indefinitely). For more information on the rename problem and how it's handled, see path-lookup.txt, which is included in the patch set.

The use of RCU has one other implication which forces a significant change. Directory permissions must be checked as part of every step in a lookup operation to ensure that users don't access parts of the filesystem which should not be available to them. Permission checking is done by the filesystem, by way of the permission() inode operation. If this checking is to happen safely during the RCU-walk process, the inode structure must not go away while the lookup is in progress. So, in other words, inodes must now be freed with RCU rather than being released directly. The change is relatively straightforward, but it requires tweaking every filesystem implementation in the tree, so the patch is large.

A number of other filesystem callbacks (d_compare() and d_hash(), for example) have also had to be changed to support RCU-walk. Anybody maintaining an out-of-tree filesystem will have some work to do if and when this patch set is merged.

While RCU-walk is perhaps the most significant change in this patch series, it's far from the only one. Many of the other patches are aimed at the elimination of the global dcache_lock altogether. There is a new dcache_hash_lock to protect hashing operations, dcache_lru_lock for modifications to the dentry LRU list, and dcache_inode_lock to protect inode dentry lists. The scope of the dentry d_lock spinlock has been expanded to cover changes to much of the structure; the reference count (formerly an atomic_t) is also covered by the lock now. All told, it's a large set of patches making fundamental changes to some of the core VFS code.

Interestingly, Nick also changed the dentry d_validate() callback, which, he says, "has been broken for a long time." That function depended on a truly scary routine called kmem_ptr_validate(), described this way:

This verifies that the untrusted pointer looks sane; it is _not_ a guarantee that the pointer is actually part of the slab cache in question, but it at least validates that the pointer can be dereferenced and looks half-way sane.

It is hard to imagine that such a function could be put to any sort of safe use. The only user in current kernels is d_validate(); Nick's patch fixes that usage and gets rid of the function. It seems unlikely to be missed.

Given the complexity of this patch set, it is not surprising that reviews have been relatively scarce. Review time for VFS-related patches has always been hard to come by, and these are trickier than most. The ongoing name-calling match between Nick and Dave Chinner, who has also been working in this area, has not helped; neither has the fact that Al Viro appears to have gone into one of his quiet periods. Given that Linus seems fairly well determined to merge this work, it would be good if at least some of the inevitable glitches could be found before the 2.6.38 merge window. Hopefully enough developers will find the time to review and/or test these patches to make some progress in that direction.

Comments (8 posted)

Unmapped page cache control

By Jonathan Corbet
December 13, 2010
Virtualization places some interesting demands on the host system, many of which are related to memory management. When two agents within the system both believe that they are in charge of memory, interesting conflicts are bound to arise. A recent patch from Balbir Singh shows the kind of effort which is being made to address these conflicts, but it also gives a hint at how a more ambitious effort might really solve the problem.

The Linux page cache keeps copies of pages from on-disk files in main memory in the hopes of avoiding I/O operations when those pages are accessed. At any given time, the page cache can easily account for more than half of the system's total memory usage. The actual size of the page cache varies over time; as other types of memory use (kernel memory and anonymous pages) grow, the page cache shrinks to make room. Balancing the demands of the page cache with other memory users can be a challenge, but Linux seems to get it close to right most of the time.

Balbir's patch is intended to give the system administrator a bit more control over page cache usage; to that end, it provides a new boot-time parameter (unmapped_page_control) which sets an upper bound on the number of unmapped pages in the cache. "Unmapped" pages are those which are not mapped into any process's address space - they do not appear in any page tables. Unmapped pages arguably have a lower chance of being needed in the near future; they are also a bit easier for the system to get rid of. This patch thus gives the system administrator a relatively easy way to minimize page cache memory usage.

The obvious question is: why? Page cache pages will be reclaimed anyway if the system has other needs for the memory, so there would seem to be little point in shrinking it prematurely. The problem, it seems, is virtualization. When a process on a virtualized system reads a page from a file, the guest operating system will dutifully store a copy of that page in its page cache. The actual read operation, though, will be passed through to (and executed by) the host, which will also store a copy in its page cache. So the page gets cached twice - perhaps even more times if it is used by multiple virtual machines. Caching a page can be a good thing, but caching multiple copies is likely to be too much of a good thing.

So what Balbir's patch is doing can be explained this way: it is forcibly cleaning copies of pages out of guest page caches to minimize duplicate copies. The memory freed in this way can be captured by a balloon driver and returned to the host, making it available for more productive use elsewhere in the system.

This technique should clearly improve the situation. Less duplication is good, and, if the guest ends up needing some of the freed pages, those pages stand a good chance of being found in the host's page cache. But one can't help but wonder if it might not be an overly indirect approach. Rather than forcibly reclaim pages from the guests' caches, might it be better to have all of the systems share the same page cache? A single, unified page cache could be managed to maximize the performance of the system as a whole; that should yield better results than managing a number of seemingly independent page caches.

Virtualization based on containers has exactly this type of unified page cache since all of the containers are running on the same kernel. That may be one of the reasons why containers are seen to perform better than fully virtualized systems. Bringing a shared page cache to the virtualized world would be a bit of a challenge, though, which probably has a lot to do with why it has not already been done.

To begin with, there would be some clear security issues. A virtualized system should be unable to access any resources which have not been explicitly provided to it. Any sort of shared page cache would have to be designed in a way which would leave the host in control of which pages are visible to each guest. In practice, that would probably mean using the virtualized block drivers which make filesystems available to virtualized guests now. Rather than "read" a page into a page controlled by the guest, the driver could somehow just map the host's copy of the page into the guest's address space.

Making that work correctly would require the addition of a significant new, Linux-only API between the host and the guest. It would be hard to do it in a way which maintained any sort of illusion that the guest is running on hardware of its own. Such a scheme would complicate memory management in the guest - hardware is increasingly dynamic, but individual pages of memory still do not come and go spontaneously. A shared page cache would also frustrate attempts to use huge pages for guest memory.

In other words, the difficulties of sharing the page cache between hosts and guests look to be decidedly nontrivial. It is not surprising that we are still living in a world where scarce memory pages can be soaked up with duplicate copies of data. As long as that situation holds, there will be a place for patches which cause guests to behave in ways which are more friendly to the system as a whole.

Comments (12 posted)

Book review: Linux Kernel Development, third edition

By Jonathan Corbet
December 15, 2010
It has been well over five years since LWN reviewed the second edition of Robert Love's Linux Kernel Development. Needless to say, things have changed a little since the 2.6.10 release covered by that edition. As it happens, the third edition has been out for a few months now; your editor has finally had a chance to read through his copy and put together a review. In summary, the third edition is a much-needed update, and Linux Kernel Development remains a valuable resource, but there are some disappointments to be found as well.

One has to dig a little bit to figure out which kernel version is covered by the third edition; according to the preface, the target is 2.6.34. Robert, ever the optimist, suggests that it will be good for a long time: "As the Linux kernel matures, there is a greater chance of a snapshot of the kernel remaining representative long into the future." Time will tell.

The third edition has been extensively updated, but it retains the same structure as its predecessors. The preface talks of "all-new" chapters, but the number of chapters remains the same. The scheduler discussion has been updated to reflect the merging of the completely fair scheduler. Other relatively recent kernel changes (mutexes, for example) have been added, [book cover] and there are changes throughout to reflect what has happened over the last 24 kernel releases. There is a new chapter on kernel data structures; it contains the linked list discussion previously found in Appendix A, along with coverage of FIFOs, red-black trees, and the idr subsystem. The low-level device model chapter has been combined with the chapter on loadable modules, for some reason, but the discussion is not much changed. The appendix on the random number generator is gone.

All told, the coverage of the core kernel is well written and clear, in an approachable style. Linux Kernel Development is, at this time, probably the best reference available for developers wanting to learn how the kernel works and how the major pieces fit together. Your editor is glad to have a copy on hand.

With that understanding in place, this, too, must be said: the update to the third edition appears to have been done in a hurry. As a result, the book contains a number of errors and inconsistencies, and it fails to cover no end of interesting things which have happened in the kernel over the last five years while retaining text which was obsolete even in previous editions. Robert has not been hugely present in the kernel development community in the last few years (he got a job at a company with a reputation for removing developers from the community) and, unfortunately, it shows.

For example, this book (which covers 2.6.34) includes a section on the anticipatory I/O scheduler - which was removed in 2.6.33. There is still talk of the "Linus" scheduler - which (as is noted in the book) was a 2.4 feature. The mutual exclusion chapter discusses semaphores - which have been deprecated for mutual exclusion purposes - with the section on mutexes seemingly bolted on afterward. (The book also says elsewhere that we cannot kill a process in an uninterruptible sleep because it "may hold a semaphore") There is a lengthy discussion of the old "bottom half" mechanism which is long gone; the removed-in-2.5 task queue mechanism also merits a section of its own. The unlamented ksymoops tool, we are told, "probably came with your distribution."

Some things are simply wrong. We're told that do_exit() calls del_timer_sync(), but that last happened in 2.6.15. The workqueue discussion appears to be stuck in a 2.6.10 time warp; changes which were merged for 2.6.20 are not reflected here. Kernel stack size is said to be 16K on 64-bit architectures because they usually have 8K pages. The version of struct file shown on page 279 never existed; it includes f_ep_lock which was renamed (by your editor) to f_lock in 2.6.30. The "process address space" chapter says, several times, that all Linux systems have three-level page tables, despite the fact that the fourth level was added for 2.6.11. The device model chapter no longer mentions struct subsystem, but it still appears in the associated diagram.

And many things which should be discussed in a contemporary book aren't. Developers working on the kernel now should probably be familiar with control groups, contemporary debugging tools (kmemleak, lockdep, the fault injection framework, ...), debugfs, ftrace, git, high-resolution timers, huge pages, linux-next, multiple slab allocators, namespaces, perf, power management and quality of service, read-copy-update, readahead, reverse mapping in the VM subsystem, scheduler domains, splice(), TASK_KILLABLE, threaded interrupt handlers, virtualization, and so on. No book on the kernel can hope to be complete and still be something that a person of ordinary strength can physically lift, but these topics are all important enough that, one would assume, at least some of them would merit coverage, but none are mentioned in the third edition.

Keeping up with all that is happening in the kernel is hard - your editor hopes that readers will trust him to understand this. So it is not surprising that some mistakes would slip through when a book is fast-forwarded from 2.6.10 to 2.6.34. But the amount of old stuff that leaked through, combined with the things which should have been mentioned but weren't, seems a bit high; some of them should, at least, have been caught in technical review. As a result, the third edition of Linux Kernel Development is not as good as it could have been.

These grumbles notwithstanding, your editor will restate what was said above: this book remains the best overview of contemporary kernel development available today. Robert is a talented and engaging writer who is able to cover complex topics in a readily understandable manner. The third edition merits a place on the "L1 bookshelf" (the one which can be reached without getting out of the chair); developers who are working with the kernel will probably want a copy.

Comments (24 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds