Range reader/writer locks for the kernel

By Jonathan Corbet
June 5, 2017

The kernel uses a variety of lock types internally, but they all share one feature in common: they are a simple either/or proposition. When a lock is obtained for a resource, the entire resource is locked, even if exclusive access is only needed to a part of that resource. Many resources managed by the kernel are complex entities for which it may make sense to only lock a smaller part; files (consisting of a range of bytes) or a process's address space are examples of this type of resource. For years, kernel developers have talked about adding "range locks" — locks that would only apply to a portion of a given resource — as a way of increasing concurrency. Work has progressed in that area, and range locks may soon be added to the kernel's locking toolkit.

Jan Kara posted a range-locking mechanism in 2013, but that work stalled and never made it into the mainline. More recently, Davidlohr Bueso has picked up that work and extended it. The result is a new form of reader/writer lock — a lock, in other words, that distinguishes between read-only and write access to a resource. Reader/writer locks can increase concurrency in settings where the protected resource is normally accessed by readers, since all readers can run simultaneously. Whenever a writer comes along, though, it must have exclusive access to the resource. Balancing access between readers and writers can be a tricky business where the wrong decisions can lead to starvation, unfairness, or poor concurrency.

Since range locks only cover part of a resource, there can be many of them covering separate parts of the resource as a whole. The data structure that describes all of the known range locks, including those that are waiting for the needed range to become available, for a given resource is a "range lock tree", represented by struct range_lock_tree. This "tree" is the lock that protects the resource as a whole; it will be typically located in or near the relevant data structure where one would otherwise find a simpler lock. Thus, a range-locking implementation will tend to start with something like:

    #include <linux/range_lock.h>

    DEFINE_RANGE_LOCK_TREE(my_tree);

Given the range_lock_tree structure to protect the resource, a thread needing access to a portion of that resource will need to acquire a lock on the range of interest. A lock on a specific range (whether granted or not) is represented by struct range_lock. It is possible to declare and initialize a range lock statically with either of:

    DEFINE_RANGE_LOCK(my_lock, start, end);
    DEFINE_RANGE_LOCK_FULL(name);

The second variant above will describe a lock on the entire range. It is also possible to initialize a range_lock structure at run time with either of:

    void range_lock_init(struct range_lock *lock, unsigned long start,
    			 unsigned long end);
    void range_lock_init_full(struct range_lock *lock);

Actually acquiring a range lock requires calling one of a large set of primitives. In the simplest case, a call to range_read_lock() will acquire a read lock on the indicated range, blocking if necessary to wait for the range to become available:

    void range_read_lock(struct range_lock_tree *tree, struct range_lock *lock);

The lock for the entire resource is provided as tree, while lock describes the region that is to be locked. Like most sleeping lock primitives, read_range_lock() will go into a non-interruptible sleep if it must wait. That behavior can be changed by calling one of the other locking functions:

    int range_read_lock_interruptible(struct range_lock_tree *tree,
				      struct range_lock *lock);
    int range_read_lock_killable(struct range_lock_tree *tree, struct range_lock *lock);
    int range_read_trylock(struct range_lock_tree *tree, struct range_lock *lock);

In any case, a read lock that has been granted must eventually be released with:

    void range_read_unlock(struct range_lock_tree *tree, struct range_lock *lock);

If, instead, the range must be written to, a write lock should be obtained with one of:

    void range_write_lock(struct range_lock_tree *tree, struct range_lock *lock);
    int range_write_lock_interruptible(struct range_lock_tree *tree,
				       struct range_lock *lock);
    int range_write_lock_killable(struct range_lock_tree *tree, struct range_lock *lock);
    int range_write_trylock(struct range_lock_tree *tree, struct range_lock *lock);

A call to range_write_unlock() will release a write lock. It is also possible to turn a write lock into a read lock with:

    void range_downgrade_write(struct range_lock_tree *tree, struct range_lock *lock);

The implementation does not give any particular priority to either readers or writers. If a writer is waiting for a given range, a reader that arrives later requesting an intersecting range will wait behind the writer, even if other readers are active in that range at the time. The result is, possibly, less concurrency than might otherwise be possible, but this approach also ensures that writers will not be starved for access.

This patch set has been through a few revisions and does not seem to be generating much more in the way of comments, so it might be about ready to go. The first user is the Lustre filesystem, which is already using a variant of Kara's range-lock implementation internally to control access to ranges of files. But there is a potentially more interesting user waiting on the wings: using range locks as a replacement for mmap_sem.

The reader/writer semaphore known as mmap_sem is one of the most intractable contention points in the memory-management subsystem. It protects a process's memory map, including, to an extent, the page tables. Many performance-sensitive operations, such as handling page faults, must acquire mmap_sem with the result that, on many workloads, contention for mmap_sem is a significant performance bottleneck. Protecting a process's virtual address space would appear to be a good application for a range lock. Most of the time, a change to the address space does not affect the entire space; it is, instead, focused on a particular set of addresses. Using range locks would allow more operations on a given address space to proceed concurrently, reducing contention and improving performance.

The patch set (posted by Laurent Dufour) does not yet achieve that goal; instead, the entire range is locked every time. Thus, with these patches, a range lock replaces mmap_sem without really changing how things work. Restricting the change in this way allows the developers to be sure that the switch to a range lock has not introduced any bugs of its own. Once confidence in that change exists, developers will be able to start reducing the ranges to what is actually needed.

These changes will need to be made with care, especially since what is being protected by mmap_sem is not always clear. But, given enough development cycles, the mmap_sem bottleneck should slowly dissolve away, leaving us with a faster, more concurrent memory-management subsystem. Some improvements are worth waiting for.

Index entries for this article
Kernel	Memory management/mmap_sem
Kernel	Range locks

32-bit systems

Posted Jun 5, 2017 21:11 UTC (Mon) by abatters (✭ supporter ✭, #6932) [Link] (3 responses)

The old version of range_lock_init() (in linux/drivers/staging/lustre/lustre/llite/range_lock.c) uses __u64 for the start/end offsets, but this generic version uses "unsigned long", which will limit its range to 4 GB on 32-bit systems. That is fine for mmap(), but would limit its usefulness for "ranges of files", at least for byte-level granularity.

32-bit systems

Posted Jun 6, 2017 1:21 UTC (Tue) by Paf (subscriber, #91811) [Link]

The Lustre developers (I'm one, but it wasn't me) pointed this out, actually... A few weeks or maybe months ago now. I'm surprised it hasn't been fixed!

32-bit systems

Posted Jun 6, 2017 15:01 UTC (Tue) by ianmcc (subscriber, #88379) [Link] (1 responses)

looks to be screaming out for template <typename int_type> ,,,,

32-bit systems

Posted Jun 6, 2017 15:27 UTC (Tue) by davidlohr (subscriber, #87713) [Link]

I considered this, but ultimately decided it's not worth it just for lustre. A good example of templates is what interval trees do, and there it's most certainly worth the hassle. In any case lustre folks are well aware of the 32-bit situation.

How about update locks?

Posted Jun 6, 2017 22:14 UTC (Tue) by saffroy (guest, #43999) [Link] (4 responses)

For some reason I'd expect that people would want an API for update locks before range locks. For a definition of update locks, see the "PW" lock mode in https://en.wikipedia.org/wiki/Distributed_lock_manager.

Update locks are typically used in combination with "CR" (aka reader lock) and "EX" (exclusive lock) modes, with an API to atomically convert from update to exclusive. A great use case it when one wants to prepare some kind of structure update privately while letting readers work, and then convert to exclusive and atomically publish the update: this helps minimize the time during which the exclusive lock is held.

Is that kind of lock API discussed?

How about update locks?

Posted Jun 7, 2017 1:20 UTC (Wed) by neilbrown (subscriber, #359) [Link]

> A great use case it when one wants to prepare some kind of structure update privately while letting readers work, and then convert to exclusive and atomically publish the update:

The normal approach to that use case in the kernel is to prepare the update, then grab the exclusive lock, then see if the structure has changed since you started preparing the update (e.g using a sequence counter). If it has, bale out and start again. If it hasn't, publish the update.

I imagine that approach would not be ideal in the cluster context that DLM was designed for, as latencies would be higher etc. In the kernel, it seems to work well.

As a general rule, keeping the locks simple minimizes the time it takes to claim and release them. Splitting locks (such as replacing a per-hash-table lock with lots of per-hash-chain locks) tends to be the better approach to scalability, rather than anything more complex that mutual-exclusion.

Range locks are handling a fairly unique case. Files are used in an enormous variety of ways - sometimes as a whole, sometimes as lots of individual records. In some case the whole-file mmap_sem really is simplest and best. Other times per-page locks are best. But sometimes, taking mmap_sem will cause too much contention, while taking the page lock on every single page would take even longer... and some of the pages might not be allocated yet.

So range locks are being added, not because it is a generally good idea, but because there is a specific use case (managing the internals of files) that seems to justify them. Do you know of a specific in-kernel use case that would significantly benefit from upgradeable locks? (We already have downgradeable locks - see downgrade_write()).

How about update locks?

Posted Jun 16, 2017 21:47 UTC (Fri) by Wol (subscriber, #4433) [Link] (2 responses)

> Update locks are typically used in combination with "CR" (aka reader lock) and "EX" (exclusive lock) modes, with an API to atomically convert from update to exclusive.

When I wrote an accounts system, the OS allowed you to specify "many readers or one writer", "many readers and one writer", or "many readers and many writers". So all the accounts files were spec'd as "many readers and one writer".

So when the user updated the system, the program ran in read-only mode until they hit "confirm", at which point it opened the ledger master files followed by the individual ledger files in read-write mode, and replayed the update for real. (It does help if you have a coherent overall design when you need to do locking :-)

Range locks would be very useful for certain systems, though - relational databases spring to mind.

Cheers,
Wol

How about update locks?

Posted Jun 17, 2017 1:03 UTC (Sat) by nybble41 (subscriber, #55106) [Link]

> So when the user updated the system, the program ran in read-only mode until they hit "confirm", at which point it opened the ledger master files followed by the individual ledger files in read-write mode, and replayed the update for real.

So what happened when two instances of the program prepared conflicting updates? Obviously only one can replay its update at a time, but whichever program goes second won't be aware of the first update while preparing its changes. Does the update fail after it sees that the data changed, or does it simply overwrite the changes the first program did with its own changes based on obsolete data?

This is, I believe, the problem that update locks are designed to solve. They indicate an intention to update the record in the future (after upgrading to an exclusive lock). Only one thread can prepare an update at a time. In the meantime, other threads can still read the data so long as they aren't preparing to make an update based on it. It's a similar concept to a reader/writer lock except that with an R/W lock there is no coherent way to atomically upgrade from a reader lock to a writer lock without the possibility of failure. (What would happen if multiple threads tried to upgrade? One would have to go first, and then the others would either fail to upgrade to a writer lock or see different data than was present before upgrading.) An update lock is like a "privileged" reader lock in the sense that there can be many readers, but only one of them (the updater) is able to upgrade to a writer lock.

How about update locks?

Posted Jun 17, 2017 1:31 UTC (Sat) by andresfreund (subscriber, #69562) [Link]

> Range locks would be very useful for certain systems, though - relational databases spring to mind.

Hence most databases having row-level locking.

Range reader/writer locks for the kernel

Posted Jun 7, 2017 12:03 UTC (Wed) by allenbh (subscriber, #104242) [Link]

While the article mentions locking of different parts of the range, presumably by different threads, the article doesn't specifically mention locking different parts of the range at the same time by the same thread, and in that case any constraints for ordering of acquiring locks on the range. I wonder if the implementation will do cycle detection at run time, or leave it up to users of the lock not to introduce cyclic deadlocks, or only allow locking one contiguous part of the range at a time. I wonder how effective static analysis will be, and what kind of techniques might need to be added for static analysis to check the use of range locks.