Brief items
The current development kernel remains 2.6.37-rc5; there have been
no 2.6.37 prepatches in the last week. Linus has returned from his travels
and has resumed pulling patches, so a new release can be expected in the
near future.
Stable updates: the 2.6.27.57, 2.6.32.27, and 2.6.36.2 updates were released on
December 9.
Greg Kroah-Hartman notes that this
is the last 2.6.27 kernel he will be releasing as he is turning that kernel
over to Willy Tarreau for maintenance as a long-term kernel (in keeping with
the stable tree changes he
announced on December 3). "It was a good run, [encompassing] 57 releases, over 791 days, and included
1596 patches (not the largest of any stable release series, .32 has that
beat already).
[...]
Speaking of .32, I'd _strongly_ encourage all existing .27 users to
seriously look into migrating to the .32 kernel tree at this point in
time as well."
The 2.6.35.10 update was released - with
over 200 patches - on December 15 by new maintainer Andi Kleen. Says
Andi: "This release contains security
fixes and all 2.6.35 users are encouraged to update."
Comments (none posted)
Right now I'm traveling without access to my laptop to actually do
pulls. That was unintentional, but my laptop got co-opted for more
important stuff (child needed it for school an my other laptop has
a broken harddisk), so I've just been reading email with my
cellphone.
--
Linus Torvalds needs an Android git port
Abstracting something tends to make it permanent. When you have an
ugly, special-case temporary hack, there is merit to having it
sitting there in the middle of the code staring you in the face.
It's very explicit and we won't forget about it.
--
Andrew Morton
Comments (3 posted)
Jim Gettys continues his discussion of bufferbloat with
a
posting on how to improve the situation on home systems. "
Once
tuned, Linux's latency (and the router's latency) can be really nice even
under high load (even if I've not tried hard to get to the theoretical
minimums). But un-tuned, I can get many second latency out of both Linux
home routers and my laptop, just by heading to some part of my house where
my wireless signal strength is low (I have several chimneys that makes this
trivial). By walking around or obstructing your wireless router, you
should be easily able to reproduce bufferbloat in either your router or in
your laptop, depending on which direction you saturate."
Comments (23 posted)
By Jonathan Corbet
December 15, 2010
Contemporary CPUs have an interesting feature: they can detect when a
virtualized guest is spinning on a lock and trap back into the host
kernel. The idea is that the host can probably find better things to do
with a processor than dedicate it to spinning in a single guest. KVM will
currently respond to such a trap by sleeping briefly, allowing some other
process to run outside of the virtualized system in question. But, as Rik
van Riel
pointed out, that is not
necessarily the right thing to do.
If one thread in a virtualized system is spinning on a lock, another thread
within that system must be holding the lock. Rather than pause the entire
guest, it is better to run the lock-holding thread so that the lock can be
released. Pausing the guest just delays the release of the lock, causing
the virtual machine as a whole to be penalized; that, says Rik,
"results in eg. a 64 VCPU Windows guest taking forever and a bit to
boot up." Tempting as it may be to just blame Windows, it's
probably better to fix the problem.
Rik's fix is to change how the trap handler behaves; rather than yield the
CPU entirely, it will take a spinning thread's remaining CPU time slice and
give it to a process on another CPU. The hope is that the recipient of
this gift (which is essentially a priority boost) will be the one holding
the lock, but there is no real guarantee of that currently. This
functionality is implemented with a new yield_to() function which,
Rik says, could maybe be turned into a system call if it turns out to be
useful in that context.
The patch has been through a couple of rounds of review and may find its
way into 2.6.38.
Comments (none posted)
By Jake Edge
December 15, 2010
The kernel has two macros to assist the compiler and CPU in doing branch
prediction: likely() and unlikely(). As their names
imply, they are meant to annotate tests in the code based on the likelihood
that they will evaluate to true. But, getting it wrong such that something
marked likely is actually unlikely—or vice versa—can impact
performance as the CPU may prefetch the wrong instructions.
Steven Rostedt has been looking
at the problem using the annotated branch profiler and found ten places "that really do not need to have
an annotated branch, or the annotation is simply the opposite of
what it should be". So, he created a series of patches that either
switched the sense of the annotation or removed the
likely()/unlikely() entirely.
As an example, page_mapping() had an unlikely()
annotation that Rostedt reported as being "correct a total of 1909540379 times and
incorrect 1270533123 times, with a 39% being incorrect". Those
numbers come from his main workstation which runs a variety of standard
programs (Firefox, XChat, etc.) as well as participating in his build farm,
so it should represent a reasonably "normal" workload. Being wrong 39% of
the time was pretty obviously too much and led
to the removal of the annotation for that test.
The changes are various subsystems including the scheduler, memory
management, and VFS. So far, there have been no complaints, though there
have been several requests to completely remove annotations that had just
been changed in order to allow
the compiler's and CPU's branch prediction logic make the decision. That,
and breaking
the patches up into separate sets for each subsystem, caused Rostedt to
respin them. It would seem likely() that we'll see them make
their way into 2.6.38.
Comments (14 posted)
Kernel development news
By Jonathan Corbet
December 14, 2010
The Linux directory entry ("dentry") cache is a key part of the kernel's
filename lookup mechanism. The dentry cache speeds the process of looking
up files
considerably. On systems with large numbers of cores, though, the dentry
cache has become a scalability problem for workloads which perform a lot of
lookups. Nick Piggin's
dcache scalability
work looks like it may be set to be merged for 2.6.38, but, according
to Nick, this work "
has had nowhere near enough review." The
code is complicated and tricky, which, certainly makes review harder. Your
editor, never afraid to make a total fool of himself, will now attempt to
describe how this patch set works just in case it helps.
A dentry's core job is to represent a directory in the filesystem and to cache
the mapping between a name found within that directory and its associated
inode. To that end, dentries are organized into a hierarchy which mirrors
the on-disk hierarchy found in the filesystem. Each dentry looks vaguely
like this:
Every dentry keeps a list of children (names contained within the directory)
which can be looked up by name; each dentry also points to the inode it
represents, its parent dentry, and a number of other things. Note that the
real situation is a bit more complicated than shown here; children are kept
in hashed lists for faster lookup, an inode with more than one link may
have more than one dentry pointing to it, and so on. But, in a conceptual
sense, the above diagram shows what is going on.
When the VFS layer needs to turn a path provided by a program into a
pointer to the associated inode, it will perform this lookup through the
dentry cache. The first step is to get to the starting point - which could
be the root of the filesystem (for absolute paths), the current working
directory (for relative paths), or a program-specified directory. In the
first two cases, the associated dentries can be found by way of the
process's task structure. The first component of the path is looked up in
the starting dentry, yielding a pointer to the next dentry in the path;
that process is repeated until the end of the path is reached.
It may be that some of the dentries in the path are not present in the
cache; there is not enough room in memory to cache the entire filesystem
hierarchy. When that happens, the lookup code must ask the filesystem (by
way of the parent's inode operations) to perform the lookup; a dentry
structure can then be created with the result and inserted into the cache.
Of course, a lookup will fail if the file (or a component in the path) does
not exist; in that case, the VFS may insert a "negative dentry" into the
cache to speed up future failed lookups. That optimization is important -
just run a simple command under strace to see how common failed
lookups really are.
The dentry cache is a dynamic data structure; dentries will come and go
frequently in response to filesystem changes and the simple need to clean
out old dentries on occasion. Clearly, some sort of protocol is needed to
prevent changes from colliding with each other; if one processor removes a
dentry while another is using it to look up a name, good things will almost
certainly not result. Once upon a time, the global dcache_lock
was used to protect the cache during lookup operations. The global lock
scaled poorly, so it has not been used for lookups for some time - though
it still does protect many other things.
Current kernels use read-copy-update to
manage the removal of dentries from the cache, so a lookup operation need
not worry about a specific dentry going away as long as that operation does
not block. If a lookup has to call into the filesystem code, though,
things might indeed block; to ensure that a dentry will stay around in this
situation, a reference count is kept. So, as a lookup works its way down
the hierarchy, it will increment the reference count of every dentry it
passes through. Most of those references are eventually dropped, though
the reference on the final dentry must be kept as long as the file is held
open.
[PULL QUOTE:
Making path lookup truly scalable in a highly
parallel situation requires making no changes to the dentry structures
themselves.
END QUOTE]
The core of Nick's patch set is a new lookup algorithm called "RCU-walk."
The key to RCU-walk is the idea that, on workloads where
pathname lookup is likely to present scalability problems, the chances are
good that most dentries will already be present in the cache. One might
think that the current algorithm would perform well in such situations, but
there is a catch: the constant incrementing and decrementing of dentry
reference counts creates a great deal of cacheline bouncing - enough to
slow things considerably. Making path lookup truly scalable in a highly
parallel situation requires making no changes to the dentry structures
themselves - turning the cache into a read-only data structure during the
lookup process, essentially.
So, when the new code needs to perform a path lookup, it starts with an
rcu_read_lock() call. The dentry cache should then remain stable
enough for the lookup to get to the end of the path and increment the
reference count for the final dentry (only). So lookups can be done
without locks - and, crucially, without changing any information in the
dentries passed through on the way. That makes the lookup scalability
problems go away.
Of course, there are a few catches. The most obvious of these is the
situation where one of the dentries in the path is not resident in the
cache. At that point, the RCU-walk process must stop; the code will
attempt to obtain a reference on the final dentry it found, then return to
the older, reference-count-based method ("ref-walk") for the rest of the
lookup. Sometimes, getting that mid-point reference will not be possible;
that situation will force the lookup to restart from the beginning using
ref-walk for the entire path.
More challenging, perhaps, is what happens if one of the dentries changes
while the lookup is passing through. By normal RCU standards, that should
never happen; changing a structure requires making a copy and making the
changes there. The dentry cache bends those rules, though; RCU is mostly
used to protect against dentry deletion there. So, in particular, a rename
can cause a dentry to change attributes - and hashed lookup lists - while
another process is using it for a lookup.
Naturally, using a lock to protect against this possibility would wipe out
any scalability gains that would otherwise come from all of this work. So
the RCU-walk code uses a seqlock instead.
Seqlocks do not prevent concurrent changes, but they do allow code to
detect when such a change has occurred. Nick has added a seqlock to every
dentry which must be incremented whenever the associated name, parent, or inode
changes. If the sequence number changes while a lookup is using a dentry, the
lookup will be restarted (with the seqlock write-locked, to prevent heavy
renaming from starving other operations indefinitely).
For more information on the rename problem and
how it's handled, see path-lookup.txt,
which is included in the patch set.
The use of RCU has one other implication which forces a significant
change. Directory permissions must be checked as part of every step in a
lookup operation to ensure that users don't access parts of the filesystem
which should not be available to them. Permission checking is done by the
filesystem, by way of the permission() inode operation. If this
checking is to happen safely during the RCU-walk process, the inode
structure must not go away while the lookup is in progress. So, in other
words, inodes must now be freed with RCU rather than being released
directly. The change is relatively straightforward, but it requires
tweaking every filesystem implementation in the tree, so the patch is
large.
A number of other filesystem callbacks (d_compare() and
d_hash(), for example) have also had to be changed to support
RCU-walk. Anybody maintaining an out-of-tree filesystem will have some
work to do if and when this patch set is merged.
While RCU-walk is perhaps the most significant change in this patch series,
it's far from the only one. Many of the other patches are aimed at the
elimination of the global dcache_lock altogether. There is a new
dcache_hash_lock to protect hashing operations,
dcache_lru_lock for modifications to the dentry LRU list, and
dcache_inode_lock to protect inode dentry lists. The scope of the
dentry d_lock spinlock has been expanded to cover changes to much
of the structure; the reference count (formerly an atomic_t) is
also covered by the lock now. All told, it's a large set of patches making
fundamental changes to some of the core VFS code.
Interestingly, Nick also changed the dentry d_validate() callback,
which, he says, "has been broken for a long time." That
function depended on a truly scary routine called
kmem_ptr_validate(), described this way:
This verifies that the untrusted pointer looks sane; it is _not_ a
guarantee that the pointer is actually part of the slab cache in
question, but it at least validates that the pointer can be
dereferenced and looks half-way sane.
It is hard to imagine that such a function could be put to any sort of safe
use. The only user in current kernels is d_validate(); Nick's
patch fixes that usage and gets rid of the function. It seems unlikely to
be missed.
Given the complexity of this patch set, it is not surprising that reviews
have been relatively scarce. Review time for VFS-related patches has
always been hard to come by, and these are trickier than most. The ongoing
name-calling match between Nick and Dave Chinner, who has also been working
in this area, has not helped; neither has the fact that Al Viro appears to
have gone into one of his quiet periods. Given that Linus seems fairly
well determined to merge this work, it would be good if at least some of
the inevitable glitches could be found before the 2.6.38 merge window.
Hopefully enough developers will find the time to review and/or test these
patches to make some progress in that direction.
Comments (8 posted)
By Jonathan Corbet
December 13, 2010
Virtualization places some interesting demands on the host system, many of
which are related to memory management. When two agents within the system
both believe that they are in charge of memory, interesting conflicts are
bound to arise. A recent patch from Balbir Singh shows the kind of effort
which is being made to address these conflicts, but it also gives a hint at
how a more ambitious effort might really solve the problem.
The Linux page cache keeps copies of pages from on-disk files in main
memory in the hopes of avoiding I/O operations when those pages are
accessed. At any given time, the page cache can easily account for more
than half of the system's total memory usage. The actual size of the page
cache varies over time; as other types of memory use (kernel memory and
anonymous pages) grow, the page cache shrinks to make room. Balancing the
demands of the page cache with other memory users can be a challenge, but
Linux seems to get it close to right most of the time.
Balbir's patch is intended to give the
system administrator a bit more control over page cache usage; to that end,
it provides a new boot-time parameter (unmapped_page_control)
which sets an upper bound on the number of unmapped pages in the cache.
"Unmapped" pages are those which are not mapped into any process's address
space - they do not appear in any page tables. Unmapped pages arguably
have a lower chance of being needed in the near future; they are also a bit
easier for the system to get rid of. This patch thus gives the system
administrator a relatively easy way to minimize page cache memory usage.
The obvious question is: why? Page cache pages will be reclaimed anyway if
the system has other needs for the memory, so there would seem to be little
point in shrinking it prematurely. The problem, it seems, is
virtualization. When a process on a virtualized system reads a page from a
file, the guest operating system will dutifully store a copy of that page
in its page cache. The actual read operation, though, will be passed
through to (and executed by) the host, which will also store a copy in its
page cache. So the page gets cached twice - perhaps even more times if it
is used by multiple virtual machines. Caching a page can be a good thing,
but caching multiple copies is likely to be too much of a good thing.
So what Balbir's patch is doing can be explained this way: it is forcibly
cleaning copies of pages out of guest page caches to minimize duplicate
copies. The memory freed in this way can be captured by a balloon driver
and returned to the host, making it available for more productive use
elsewhere in the system.
This technique should clearly improve the situation. Less duplication is
good, and, if the guest ends up needing some of the freed pages, those
pages stand a good chance of being found in the host's page cache. But one
can't help but wonder if it might not be an overly indirect approach.
Rather than forcibly reclaim pages from the guests' caches, might it be
better to have all of the systems share the same page cache? A single,
unified page cache could be managed to maximize the performance of the
system as a whole; that should yield better results than managing a number
of seemingly independent page caches.
Virtualization based on containers has exactly this type of unified page
cache since all of the containers are running on the same kernel. That may
be one of the reasons why containers are seen to perform better than fully
virtualized systems. Bringing a shared page cache to the virtualized world
would be a bit of a challenge, though, which probably has a lot to do with
why it has not already been done.
To begin with, there would be some clear security issues. A virtualized
system should be unable to access any resources which have not been
explicitly provided to it. Any sort of shared page cache would have to be
designed in a way which would leave the host in control of which pages are
visible to each guest. In practice, that would probably mean using the
virtualized block drivers which make filesystems available to virtualized
guests now. Rather than "read" a page into a page controlled by the guest,
the driver could somehow just map the host's copy of the page into the
guest's address space.
Making that work correctly would require the addition of a significant new,
Linux-only API between the host and the guest. It would be hard to do it
in a way which maintained any sort of illusion that the guest is running on
hardware of its own. Such a scheme would complicate memory management in
the guest - hardware is increasingly dynamic, but individual pages of
memory still do not come and go spontaneously. A shared page cache would
also frustrate attempts to use huge pages for guest memory.
In other words, the difficulties of sharing the page cache between hosts
and guests look to be decidedly nontrivial. It is not surprising that we
are still living in a world where scarce memory pages can be soaked up with
duplicate copies of data. As long as that situation holds, there will be a
place for patches which cause guests to behave in ways which are more
friendly to the system as a whole.
Comments (12 posted)
By Jonathan Corbet
December 15, 2010
It has been well over five years since LWN
reviewed the second edition of Robert Love's
Linux Kernel Development. Needless to say, things have changed a
little since the 2.6.10 release covered by that edition. As it happens,
the third edition has been out for a few months now; your editor has
finally had a chance to read through his copy and put together a review.
In summary, the third edition is a much-needed update, and
Linux Kernel
Development remains a valuable resource, but there are some
disappointments to be found as well.
One has to dig a little bit to figure out which kernel version is covered
by the third edition; according to the preface, the target is 2.6.34.
Robert, ever the optimist, suggests that it will be good for a long time:
"As the Linux kernel matures, there is a greater chance of a snapshot
of the kernel remaining representative long into the future." Time
will tell.
The third edition has been extensively updated, but it retains the same
structure as its predecessors. The preface talks of "all-new" chapters,
but the number of chapters remains the same. The scheduler discussion has
been updated to reflect the merging of the completely fair scheduler. Other
relatively recent kernel changes (mutexes, for example) have been added,
and there are changes throughout to reflect what has happened over the last
24 kernel releases.
There is a new chapter on kernel data structures; it contains
the linked list discussion previously found in Appendix A, along with
coverage of FIFOs, red-black trees, and the idr subsystem. The low-level
device model chapter has been combined with the chapter on loadable
modules, for some reason, but the discussion is not much changed. The
appendix on the random number generator is gone.
All told, the coverage of the core kernel is well written and clear, in an
approachable style. Linux Kernel Development is, at this time,
probably the best reference available for developers wanting to learn how
the kernel works and how the major pieces fit together. Your editor is
glad to have a copy on hand.
With that understanding in place, this, too, must be said: the update to
the third edition appears to have been done in a hurry. As a result, the
book contains a number of errors and inconsistencies, and it fails to cover
no end of interesting things which have happened in the kernel over the
last five years while retaining text which was obsolete even in previous
editions. Robert has not been hugely present in the kernel development
community in the last few years (he got a job at a company with a
reputation for removing developers from the community) and, unfortunately,
it shows.
For example, this book (which covers 2.6.34) includes a section on the
anticipatory I/O scheduler - which was removed in 2.6.33. There is still
talk of the "Linus" scheduler - which (as is noted in the book) was a 2.4
feature. The mutual exclusion chapter discusses semaphores - which have
been deprecated for mutual exclusion purposes - with the section on mutexes
seemingly bolted on afterward.
(The book also says elsewhere that we cannot kill a process in an
uninterruptible sleep because it "may hold a semaphore")
There is a lengthy discussion of the old
"bottom half" mechanism which is long gone; the removed-in-2.5 task queue
mechanism also merits a section of its own. The unlamented
ksymoops tool, we are told, "probably came with your
distribution."
Some things are simply wrong. We're told that do_exit() calls
del_timer_sync(), but that last happened in 2.6.15. The workqueue
discussion appears to be stuck in a 2.6.10 time warp; changes which were
merged for 2.6.20 are not reflected here. Kernel stack size is said to be
16K on 64-bit architectures because they usually have 8K pages. The
version of struct file shown on page 279 never existed; it
includes f_ep_lock which was renamed (by your editor) to
f_lock in 2.6.30. The "process address space" chapter says,
several times, that all Linux systems have three-level page tables, despite
the fact that the fourth level was added for 2.6.11. The device model
chapter no longer mentions struct subsystem, but it still appears
in the associated diagram.
And many things which should be discussed in a contemporary book aren't.
Developers working on the kernel now should probably be familiar with
control groups, contemporary debugging tools (kmemleak, lockdep, the fault
injection framework, ...), debugfs, ftrace, git, high-resolution timers, huge
pages, linux-next, multiple slab allocators, namespaces, perf, power
management and quality of service, read-copy-update, readahead, reverse
mapping in the VM subsystem, scheduler domains, splice(),
TASK_KILLABLE, threaded interrupt handlers, virtualization, and so on.
No book on the kernel can hope to be complete and still be something that a
person of ordinary strength can physically lift, but these topics are all
important enough that, one would assume, at least some of them would merit
coverage, but none are mentioned in the third edition.
Keeping up with all that is happening in the kernel is hard - your editor
hopes that readers will trust him to understand this. So it is not
surprising that some mistakes would slip through when a book is
fast-forwarded from 2.6.10 to 2.6.34. But the amount of old stuff that
leaked through, combined with the things which should have been mentioned
but weren't, seems a bit high; some of them should, at least, have been
caught in technical review. As a result, the third edition of Linux
Kernel Development is not as good as it could have been.
These grumbles notwithstanding, your editor will restate what was said
above: this book remains
the best overview of contemporary kernel development available today.
Robert is a talented and engaging writer who is able to cover complex
topics in a readily understandable manner. The third edition merits a
place on the "L1 bookshelf" (the one which can be reached without getting
out of the chair); developers who are working with the kernel will
probably want a copy.
Comments (24 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>