LWN.net Logo

Resolving the inode scalability discussion

By Jonathan Corbet
October 26, 2010
Nick Piggin's VFS scalability patch set has been under development for well over one year. Linus was ready to pull this work during the 2.6.36 merge window, but Nick asked for more time for things to settle out; as a result, only some of the simpler parts were merged then. Last week, we mentioned that some developers became concerned when it started to become clear that the remaining work would not be ready for 2.6.37 either. Out of that concern came a competing version of the patch set (by Dave Chinner) and a big fight. This discussion was of the relatively deep and intimidating variety, but your editor, never afraid to make a total fool of himself, will attempt to clarify the core disagreements and a possible path forward anyway.

The global inode_lock is used within the virtual filesystem layer (VFS) to protect several data structures and a wide variety of inode-oriented operations. As a global lock, it has become an increasingly annoying bottleneck as the number of CPUs and threads in systems increases; it clearly needs to be broken up in a way which makes it more scalable. Unfortunately, like a number of old locks in the VFS, the boundaries of what's protected by inode_lock are not always entirely clear, so any attempts to change locking in that area must be done with a great deal of caution. That is why improving inode locking scalability has been such a slow affair.

Getting rid of inode_lock requires putting some other locking in place for everything that inode_lock protects. Nick's patch set creates separate global locks for some of those resources: wb_inode_list_lock for the list of inodes under writeback, and inode_lru_lock for the list of inodes in the cache. The standalone inodes_stat statistics structure is converted over to atomic types. Then the existing i_lock per-inode spinlock is used to cover everything else in the inode structure; once that is done, inode_lock can be removed. The remainder of the patch set (more than half of the total) is then dedicated to reducing the coverage of i_lock, often by using read-copy-update (RCU) instead.

Before any of that, though, Nick's patch set changed the way the core memory management "shrinker" code works. Shrinkers are callbacks which can be invoked by the core when memory is tight; their job is then to reduce the amount of memory used by a specific data structure. The inode and dentry caches can take up quite a bit of memory, so they both have shrinkers which will free up (hopefully) unneeded cache entries when the memory is needed elsewhere. Nick changed the shrinker API to cause it to target specific memory zones; that allows the core to balance free memory across memory types and across NUMA nodes.

The per-zone shrinkers were one of the early flash points in this debate. Dave Chinner and others on the VFS side of the house worried that invoking shrinkers in such a fine-grained way would increase contention at the filesystem level and make it harder to shrink the caches in an efficient way. They also thought that this change was orthogonal to the core goal of eliminating the scalability problems caused by the global inode_lock. Nick fought hard for per-zone shrinkers, and he clearly believes that they are necessary, but he has also dropped them from his patch set for now in an attempt to push things forward.

The next disagreement has to do with the coverage of i_lock; Dave Chinner's alternative patch set avoids using i_lock to cover most of the inode structure. Instead, Dave introduces other locks from the outset, reaching a point where he has relatively fine-grained lock coverage by the time inode_lock is removed at the end of his series. Compared to this approach, Nick's patches have been criticized as being messy and not as scalable.

Nick's response is that the "width" of i_lock is a detail which can be resolved later. His intent was to do the minimal amount of work required to allow the removal of inode_lock, without going straight for the ultimate scalable solution. The goal was to be able to ensure that the locking remains correct by changing as little as possible before the removal of the global lock; that way, hopefully, there are fewer chances of breaking things. Beyond that, any bugs which do slip through before the patch removing inode_lock will almost certainly not reveal themselves until after that removal. That means that anybody trying to use bisection to find a bug will end up at the inode_lock removal patch instead of the real culprit. Thus, minimizing the number of changes before that removal should make debugging easier.

That is why Nick removes inode_lock before the middle of his patch series, while Dave's series does that removal near the end. Both patch sets include a number of the same changes - putting per-bucket locks onto the inode hash table, for example - but Nick does it after removing inode_lock, while Dave does it before. There are also differences, with Nick heading deep into RCU territory while Dave avoids using RCU. Both developers claim to be aiming for similar end results, they just take different roads to get there.

One of the hardest problems in the VFS is ensuring that all locks are taken in the proper order so that the system will not deadlock. Finally, there is also a deep disagreement over the locking of the inode cache itself. In current kernels, the cache data structure (the LRU and writeback lists, essentially) is covered by inode_lock with the rest. Both patch sets create separate locks for the LRU and for writeback. The problem is with lock ordering; one of the hardest problems in the VFS is ensuring that all locks are taken in the proper order so that the system will not deadlock. Nick's patches require the VFS to acquire i_lock for the inode(s) of interest prior to acquiring the writeback or LRU locks; Dave, instead, wants i_lock to be the innermost lock.

The problem is that it is not always possible to acquire the locks in the specified order. Code which is working through the LRU list, for example, must have that list locked; if it then decides to operate on an inode found in the LRU list, it must lock the inode. But that violates Nick's locking order. To make things work correctly, Nick uses spin_trylock() in such situations to avoid hanging. Uses of spin_trylock() tend to attract scrutiny, and that is the case here; Dave has described the code as "a large mess of trylock operations" which he has gone out of his way to avoid. Nick responds that the code is not that bad, and that Dave's approach brings locking complexities of its own.

This is about where Al Viro jumped in, calling both approaches wrong. Al would like to see the writeback locks taken prior to i_lock (because code tends to work from the list first, prior to attacking individual inodes), but he says the LRU lock should be taken after i_lock because code changing the LRU status of an inode will normally already have that inode's lock. According to Al, Nick is overly concerned with the management of the various inode lists and, as a result, "overengineering" the code. After some discussion, Dave eventually agreed with something close to Al's view and acknowledged that Nick's placement of the LRU lock below i_lock was correct, eliminating that point of contention.

Al has also described the way he would like things to proceed; this is a good thing. When it comes to VFS locking, few are willing to challenge his point of view; that means that he can probably bring about a resolution to this particular dispute. He wants a patch series which starts with the split of the writeback and LRU lists, then proceeds by pulling things out from under inode_lock one at a time. He is apparently pulling together a tree based on both Nick's and Dave's work, but with things done in the order he likes. The end result will probably be credited to Nick, who figured out how to solve a long list of difficult problems around inode_lock, but it will differ significantly from what he initially proposed.

What is not at all clear, though, is how much of this will come together for the 2.6.37 merge window. Al has a long history of last-second pull requests full of hairy changes; Linus tends to let him get away with it. But this would be very last minute, and the changes are deep, so, while Al has pushed some of the initial changes, the core locking work may not be ready in time for 2.6.37. Either way, once inode scalability has been taken care of, discussion can begin on the removal of dcache_lock, which is a rather more complex problem than inode_lock; that should be interesting to watch.


(Log in to post comments)

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds