By Jonathan Corbet
October 26, 2010
Nick Piggin's
VFS scalability
patch set has been under development for well over one year. Linus was
ready to pull this work during the 2.6.36 merge window, but Nick asked for
more time for things to settle out; as a result, only some of the simpler
parts were merged then. Last week, we
mentioned that some developers
became concerned when it started to become clear that the remaining work
would not be ready for 2.6.37 either. Out of that concern came a competing
version of the patch set (by Dave Chinner) and a big fight. This
discussion was of the relatively deep and intimidating variety, but your
editor, never afraid to make a total fool of himself, will attempt to
clarify the core disagreements and a possible path forward anyway.
The global inode_lock is used within the virtual filesystem layer
(VFS) to protect several data structures and a wide variety of
inode-oriented operations. As a global lock,
it has become an increasingly annoying bottleneck as the number of CPUs and
threads in systems increases; it clearly needs to be broken up in a way
which makes it more scalable. Unfortunately, like a number of old locks in
the VFS, the boundaries of what's protected by inode_lock are not
always entirely clear, so any attempts to change locking in that area must
be done with a great deal of caution. That is why improving inode locking
scalability has been such a slow affair.
Getting rid of inode_lock requires putting some other locking in
place for everything that inode_lock protects. Nick's patch set
creates separate global locks for some of those resources:
wb_inode_list_lock for the list of inodes under writeback, and
inode_lru_lock for the list of inodes in the cache. The standalone
inodes_stat statistics structure is converted over to atomic
types. Then the existing i_lock per-inode spinlock is used to
cover everything else in the inode structure; once that is done,
inode_lock can be removed. The remainder of the patch set (more
than half of the total) is then dedicated to reducing the coverage of
i_lock, often by using read-copy-update (RCU) instead.
Before any of that, though, Nick's patch set changed the way the core
memory management "shrinker" code works. Shrinkers are callbacks which can
be invoked by the core when memory is tight; their job is then to reduce
the amount of memory used by a specific data structure. The inode and
dentry caches can take up quite a bit of memory, so they both have
shrinkers which will free up (hopefully) unneeded cache entries when the
memory is needed elsewhere. Nick changed the shrinker API to cause it to
target specific memory zones; that allows the core to balance free memory
across memory types and across NUMA nodes.
The per-zone shrinkers were one of the early flash points in this debate.
Dave Chinner and others on the VFS side of the house worried that invoking
shrinkers in such a fine-grained way would increase contention at the
filesystem level and make it
harder to shrink the caches in an efficient way. They also thought that
this change was orthogonal to the core goal of eliminating the scalability
problems caused by the global inode_lock. Nick fought hard for
per-zone shrinkers, and he clearly believes that they are necessary, but he
has also dropped them from his patch set for now in an attempt to push
things forward.
The next disagreement has to do with the coverage of i_lock; Dave
Chinner's alternative patch set avoids using i_lock to cover most
of the inode structure. Instead, Dave introduces other locks from
the outset, reaching a point where he has relatively fine-grained lock
coverage by the time inode_lock is removed at the end of his
series. Compared to this approach, Nick's patches have been criticized as
being messy and not as scalable.
Nick's response is that the "width" of i_lock is a detail which
can be resolved later. His
intent was to do the minimal amount of work required to allow the removal
of inode_lock, without going straight for the ultimate scalable
solution. The goal was to be able to ensure that the locking remains
correct by changing as little as possible before the removal of the global
lock; that way, hopefully, there are fewer chances of breaking things.
Beyond that, any bugs which do slip through before the patch removing
inode_lock will almost certainly not reveal themselves until after
that removal. That means that anybody trying to use bisection to find a
bug will end up at the inode_lock removal patch instead of the
real culprit. Thus, minimizing the number of changes before that removal
should make debugging easier.
That is why Nick removes inode_lock before the middle of his patch
series, while Dave's series does that removal near the end. Both patch
sets include a number of the same changes - putting per-bucket locks onto
the inode hash table, for example - but Nick does it after removing
inode_lock, while Dave does it before. There are also
differences, with Nick heading deep into RCU territory while Dave avoids
using RCU. Both developers claim to be aiming for similar end results,
they just take different roads to get there.
[PULL QUOTE:
One of the hardest problems
in the VFS is ensuring that all locks are taken in the proper order so that
the system will not deadlock.
END QUOTE]
Finally, there is also a deep disagreement over the locking of the inode
cache itself. In current kernels, the cache data structure (the LRU and
writeback lists, essentially) is covered by inode_lock with the
rest. Both patch sets create separate locks for the LRU and for
writeback. The problem is with lock ordering; one of the hardest problems
in the VFS is ensuring that all locks are taken in the proper order so that
the system will not deadlock. Nick's patches require the VFS to acquire
i_lock for the inode(s) of interest prior to acquiring the
writeback or LRU locks; Dave, instead, wants i_lock to be the
innermost lock.
The problem is that it is not always possible to acquire the locks in the
specified order. Code which is working through the LRU list, for example, must
have that list locked; if it then decides to operate on an inode found in
the LRU list, it must lock the inode. But that violates Nick's locking
order. To make things work correctly, Nick uses spin_trylock() in
such situations to avoid hanging. Uses of spin_trylock() tend to
attract scrutiny, and that is the case here; Dave has described the code as "a
large mess of trylock operations" which he has gone out of his way
to avoid. Nick responds that the code is
not that bad, and that Dave's approach brings locking complexities of its
own.
This is about where Al Viro jumped in,
calling both approaches wrong. Al would like to see the writeback locks
taken prior to i_lock (because code tends to work from the list
first, prior to attacking individual inodes), but he says the LRU lock
should be taken after i_lock because code changing the LRU status
of an inode will normally already have that inode's lock. According to Al, Nick is overly concerned with
the management of the various inode lists and, as a result,
"overengineering" the code. After some discussion, Dave eventually agreed with something close to Al's view and
acknowledged that Nick's placement of the LRU lock below i_lock
was correct, eliminating that point of contention.
Al has also described the way he would like things
to proceed; this is a good thing. When it comes to VFS locking, few are
willing to challenge his point of view; that means that he can probably
bring about a resolution to this particular dispute. He wants a patch
series which starts with the split of the writeback and LRU lists, then
proceeds by pulling things out from under inode_lock one at a
time. He is apparently pulling together a tree based on both Nick's and
Dave's work, but with things done in the order he likes. The end result
will probably be credited to Nick, who figured out how to solve a long list
of difficult problems around inode_lock, but it will differ
significantly from what he initially proposed.
What is not at all clear, though, is how much of this will come together
for the 2.6.37 merge window. Al has a long history of last-second pull
requests full of hairy changes; Linus tends to let him get away with it.
But this would be very last minute, and the changes are deep, so, while Al
has pushed some of the initial changes, the core locking work may not be
ready in time for 2.6.37. Either way, once inode scalability has been
taken care of, discussion can begin
on the removal of dcache_lock, which is a rather more complex
problem than inode_lock; that should be interesting to watch.
(
Log in to post comments)