JLS: Increasing VFS scalability

By Jonathan Corbet
November 3, 2009

It can be tempting to dismiss scalability work as being of interest mainly to companies running massive server systems; most "ordinary" Linux users are not running into the kind of problems that scalability-oriented developers are trying to fix. But, of course, the truth of the matter is that those users haven't encountered those problems yet. The past work of scalability-oriented developers is what makes our current desktop and laptop systems work as well as they do; their current work will enable next year's consumer-level systems. So Nick Piggin's Japan Linux Symposium talk on virtual filesystem scalability will be of interest to anybody who anticipates using Linux in the future.

That said, one of the key constraints on scalability work is that it must not worsen performance on current systems. So Nick is taking care that his VFS work will improve scalability with no impact on single-threaded performance. Beyond that, he is aiming to improve scalability within a single filesystem - forcing system administrators to split their filesystems to get better performance would be cheating. To get there, he has identified five specific bottlenecks which must be addressed.

The first of those is files_lock; it is, he says, the easiest to fix. This global lock protects a per-superblock list of open files; it is needed by the file open and close paths. As the number of threads grows, this lock limits the scalability of filesystem-oriented workloads. The lock itself is only part of the problem; the real issue is that a single list_head is never going to be scalable in multiprocessor situations. In this case, it turns out that the kernel almost never needs to read the full list of open files; that only happens at unmount time. So turning the single list into a per-CPU list is a viable option; it eliminates the locking altogether and makes the management of the list scalable. The only tricky part is when files are removed; that requires cross-CPU access to the list.

Next on the list is vfsmount_lock, which is used when finding mounts from directory entry ("dentry") structures. This lock is taken when crossing mount points in the path lookup process; it is also used at mount and unmount time. Pathname lookup is clearly a performance-critical path in the kernel, so getting rid of a global lock can only be a good thing. Nick considered using read-copy-update (RCU) for pathname lookup, but he found it to still be too slow. Part of the problem is the need to block all readers at unmount time, something that RCU cannot do on its own.

The solution is to go to per-CPU locks. Nick has introduced a variant on per-CPU locks called brlocks, or "big reader locks." These locks share the name and goal of the 2.4.x brlocks which were removed in the 2.5 development cycle, but the implementation is different. Essentially, a brlock is per-CPU for read access, but write access excludes all other users on all CPUs. Since pathname lookup is a read-only operation, brlocks will be fast where the kernel needs them to be; unmounts will be slow, but those are relatively rare operations.

mnt_count is a per-filesystem reference count, incremented for each open and decremented for each close. Like the global list described above, this global counter limits the scalability of opens and closes. Once again, going per-CPU is the obvious solution here, with the minor problem that a put() operation must check whether the (global) count is zero. But, as it happens, that case only comes about when the filesystem is not actually mounted, so this check need not be performed most of the time.

The hardest one to fix is dcache_lock. Most VFS operations need it, with the sole exception of name lookup, which has used RCU for a while now. Some operations - LRU scanning and reclaim in the dentry cache in particular - can hold the lock for a long time. And the lock covers a whole bunch of different - and sometimes unknown - things. The exporting of dcache_lock to filesystems has not helped here; individual filesystems are using it for their own, not always clear, ends. So a developer trying to bring dcache_lock under control must start by trying to figure out what it is being used to protect.

Nick has done his best to split apart the various locking cases; these include the dentry cache hash, the dentry LRU list, the inode dentry alias list, various statistics, etc. Some of this stuff is moved under the protection of the per-dentry spinlock (d_lock); other things, like the dentry hash and LRU, get new locks. There are a lot of problems still, starting with lock-ordering challenges. Nick is working around some of these using non-blocking "trylock" operations, but that kind of code tends to be hard to merge. The various locking cases are still not truly independent from each other; among other things, that imposes more ordering requirements. And walking up the directory tree (trying to determine a path name from a dentry, usually) becomes much harder in the absence of a global lock.

In summary, cleaning up dcache_lock looks like a long and messy project. This is just the lock which is showing up as the worst bottleneck in some situations, though, so the work needs to be done.

Finally, there is the matter of inode_lock, which is needed by most inode operations (lookup, creation, destruction, writeback, sync, etc). As with dcache_lock, Nick has split the locking into a number of independent classes - the inode itself, the inode hash, the LRU list, and so on. Some of these classes are moved under the per-inode lock, while specific locks have been added for some cases. The per-superblock inode list has been made into a per-CPU variable, as have the counters used to generate statistics. Nick has also made the allocation of inode numbers into a per-CPU operation by assigning a range of numbers to each processor. This means that inode numbers are no longer allocated sequentially; it's not clear whether that will be a problem or not.

So what comes of all this work? Nick claims "great" open/close scalability, and "good" create/unlink scalability. He showed the results of running a microbenchmark which just did close(open(path)) repeatedly; with current mainline, he was able to get 450 operations/second on each of 64 CPUs. With the scalability patches added, that rate went up to over 300,000 operations/second - a significant improvement. Running unlink(creat(path)) shows better scalability even with two CPUs - but it does, for some reason, impose a cost on single-threaded workloads on the ia-64 architecture.

The VFS scalability work is clearly worth doing; we'll all be glad that these problems have been ironed out someday. But there's still some messy things to clean up, so this patch set (or the gnarlier parts of it, anyway) may take a while on their way into the mainline.

Index entries for this article
Kernel	Filesystems/Virtual filesystem layer
Kernel	Scalability