VFS scalability patches in 2.6.36

By Jonathan Corbet
August 24, 2010

It is rare for Linus to talk about what he plans to merge in a given development cycle before the merge window opens; it seems that he prefers to see what the pull requests look like and make his decisions afterward. He made an exception in the 2.6.35 announcement, though:

On a slightly happier note: one thing I do hope we can merge in the upcoming merge window is Nick Piggin's cool VFS scalability series. I've been using it on my own machine, and gone through all the commits (not that I shouldn't go through some of them some more), and am personally really excited about it. It's seldom we see major performance improvements in core code that are quite that noticeable, and Nick's whole RCU pathname lookup in particular just tickles me pink.

It's a rare developer who, upon having tickled the Big Penguin to that particular shade, will hold off on merging his changes. But Nick asked that the patches sit out for one more cycle, perhaps out of the entirely rational fear of bugs which might irritate users to a rather deeper shade. So Linus will have to wait a bit for his RCU pathname lookup code. That said, some parts of the VFS scalability code did make it into the mainline for 2.6.36-rc2.

Like most latter-day scalability work, the VFS work is focused on increasing locality and eliminating situations where CPUs must share resources. Given that a filesystem is an inherently global structure, increasing locality can be a challenging task; as a result, parts of Nick's patch set are on the complex and tricky side. But, in the end, it comes down to dealing with things locally whenever possible, but making global action possible when the need arises.

The first step is the introduction of two new lock types, the first of which is called a "local/global lock" (lglock). An lglock is intended to provide very fast access to per-CPU data while making it possible (at a rather higher cost) to get at another CPU's data. An lglock is created with:

    #include <linux/lglock.h>

    DEFINE_LGLOCK(name);

The DEFINE_LGLOCK() macro is a 99-line wonder which creates the necessary data structure and accessor functions. By design, lglocks can only be defined at the file global level; they are not intended to be embedded within data structures.

Another set of macros is used for working with the lock:

    lg_lock_init(name);
    lg_local_lock(name);
    lg_local_unlock(name);
    lg_local_lock_cpu(name, int cpu);
    lg_local_unlock_cpu(name, int cpu);

Underneath it all, an lglock is really just a per-CPU array of spinlocks. So a call to lg_local_lock() will acquire the current CPU's spinlock, while lg_local_lock_cpu() will acquire the lock belonging to the specified cpu. Acquiring an lglock also disables preemption, which would not otherwise happen in realtime kernels. As long as almost all locking is local, it will be very fast; the lock will not bounce between CPUs and will not be contended. Both of those assumptions go away, of course, if the cross-CPU version is used.

Sometimes it is necessary to globally lock the lglock:

    lg_global_lock(name);
    lg_global_unlock(name);
    lg_global_lock_online(name);
    lg_global_unlock_online(name);

A call to lg_global_lock() will go through the entire array, acquiring the spinlock for every CPU. Needless to say, this will be a very expensive operation; if it happens with any frequency at all, an lglock is probably the wrong primitive to use. The _online version only acquires locks for CPUs which are currently running, while lg_global_lock() acquires locks for all possible CPUs.

The VFS scalability patch set also brings back the "big reader lock" concept. The idea behind a brlock is to make locking for read access as fast as possible, while making write locking possible. The brlock API (also defined in <linux/lglock.h>) looks like this:

    DEFINE_BRLOCK(name);

    br_lock_init(name);
    br_read_lock(name);
    br_read_unlock(name);
    br_write_lock(name);
    br_write_unlock(name);

As it happens, this version of brlocks is implemented entirely with lglocks; br_read_lock() maps directly to lg_local_lock(), and br_write_lock() turns into lg_global_lock().

The first use of lglocks is to protect the list of open files which is attached to each superblock structure. This list is currently protected by the global files_lock, which becomes a bottleneck when a lot of open() and close() calls are being made. In 2.6.36, the list of open files becomes a per-CPU array, with each CPU managing its own list. When a file is opened, a (cheap) call to lg_local_lock() suffices to protect the local list while the new file is added.

When a file is closed, things are just a bit more complicated. There is no guarantee that the file will be on the local CPU's list, so the VFS must be prepared to reach across to another CPU's list to clean things up. That, of course, is what lg_local_lock_cpu() is for. Cross-CPU locking will be more expensive than local locking, but (1) it only involves one other CPU, and (2) in situations where there is a lot of opening and closing of files, chances are that the process working with any specific file will not migrate between CPUs during the (presumably short) time that the file is open.

The real reason that the per-superblock open files list exists is to let the kernel check for writable files when a filesystem is being remounted read-only. That operation requires exclusive access to the entire list, so lg_global_lock() is used. The global lock is costly, but read-only remounts are not a common occurrence, so nobody is likely to notice.

Also for 2.6.36, Nick changed the global vfsmount_lock into a brlock. This lock protects the tree of mounted filesystems; it must be acquired (in a read-only mode) whenever a pathname lookup crosses from one mount point to the next. Write access is only needed when filesystems are mounted or unmounted - again, an uncommon occurrence on most systems. Nick warns that this change is unlikely to speed up most workloads now - indeed, it may slow some down slightly - but its value will become clearer when some of the other bottlenecks are taken care of.

Aside from a few smaller changes, that is where VFS scalability work stops for the 2.6.36 development cycle. The more complicated work - dealing with dcache_lock in particular - will go through a few more months of testing before it is pushed toward the mainline. Then, perhaps, we'll see Linus in a proper shade of pink.

Index entries for this article
Kernel	Filesystems/Virtual filesystem layer
Kernel	Scalability

VFS scalability patches in 2.6.36

Posted Jan 5, 2011 14:35 UTC (Wed) by razb (guest, #43424) [Link]

the biggest macro i ever seen