LWN.net Logo

LSFMM: Lock scaling

By Jonathan Corbet
April 23, 2013
LSFMM Summit 2013
Andi Kleen opened the LSFMM 2013 session on locking performance by noting that, over the years, kernel developers have worked to improve scalability by adding increasingly fine-grained locking to the kernel. Locks with small coverage help to avoid contention, but, when there is contention, the resulting communication overhead wipes out the advantages that fine-grained locks are supposed to provide. We were, he said, risking the defeat of our scalability goals through excessive application of techniques designed to improve scalability.

His main assertion was that processing single objects within a lock is not a scalable approach. What we need, he said, is interfaces that handle groups of objects in batches. Processing multiple objects under a single lock will amortize the cost of that lock over much more work, increasing the overall scalability of the system. That said, there are some challenges, starting with the choice of the optimal batch size. Developers tend to like 64, he said, but that may not be the right value. The best batch size may be hardware-dependent and needs to be tunable.

As an example, he mentioned the per-zone LRU lock that protects the lists of pages maintained by the memory management subsystem. That lock is currently taken and released for every page processed, even when many pages need to be operated on. Taking the lock once to process a larger batch of pages would improve the scalability of the system.

Ted Ts'o objected that the tuning of batch sizes is a problem. Lots of users will never try, and many who do will get it wrong. Automatic tuning is an obvious solution, but, there, too, there are difficulties, starting with a high-level policy question: should the system be tuned for throughput or latency? Large batches may improve throughput, but longer lock hold times can cause increased latencies — a change that some users may well object to.

Dave Chinner pointed out that adding batching interfaces may risk creating unpredictable behavior (unknown latencies, for example) in the system; it would also increase memory use as a result of the arrays needed to track the batches. There are, he said, not that many problematic locks in the kernel. He would prefer developers to look for algorithmic changes to avoid the lock contention altogether.

A suggestion was made that a lot of contention could be avoided by moving to message passing for communication between NUMA nodes. It is fair to say that support for this idea was minimal, though. Ted asked about which locks, in particular, would be amenable to batching; Andi responded that per-page processing in filesystem code was at the top of his list, especially in code dealing with buffered writes. There are also some locks in the networking subsystem, but the increasing use of segmentation offload means that batching is being done in the hardware. Translation lookaside buffer (TLB) flushes are also done individually, even when multiple entries need to be flushed.

There was a general agreement that it would be a good idea to enhance the kernel to provide better information about which locks are contended, but nobody committed to actually doing that work.


(Log in to post comments)

LSFMM: Lock scaling

Posted Apr 24, 2013 4:37 UTC (Wed) by dlang (✭ supporter ✭, #313) [Link]

> the tuning of batch sizes is a problem.

In rsyslog, locking around the main message queue was a limiting factor (exactly the per-item locking that Andi is complaining about)

When we introduced batching of messages, we were worried about the same throughput vs latency tradeoff, but Rainer identified the batch processing algorithm that we ended up using which has not required much tuning in practice.

Instead of the knee-jerk thinking of trying to wait until you have enough items to fill your batch size, instead just process however many items are outstanding _right_now_, up to your max batch size. If that's only one item, you process one item. If the rate of new items being generated is low, you continue to lock for each item, but there is zero contention for the lock.

As the rate of new items climbs, eventually you get to the point where there are multiple items waiting to be processed when the system goes to process one of them. You then process all outstanding items with one lock cycle

This way, the latency for any item is always as small as it can possibly be (one lock cycle * (# outstanding items/max batch size)), but as the load goes up the efficiency climbs to let you keep going.

It doesn't matter if the load is caused by items being generated faster, or by the resources needed to process the items being used for other things.

The drawback to this approach is that it uses more CPU at low load levels, which is just fine for this one task, but may impact the system as a whole.

LSFMM: Lock scaling

Posted Apr 24, 2013 14:15 UTC (Wed) by martinfick (subscriber, #4455) [Link]

> This way, the latency for any item is always as small as it can possibly be (one lock cycle * (# outstanding items/max batch size)).

I suspect that the latencies they are concerned about are those caused by processing more items without releasing the lock, not those caused by waiting to fill up a batch.

LSFMM: Lock scaling

Posted Apr 25, 2013 7:55 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

> I suspect that the latencies they are concerned about are those caused by processing more items without releasing the lock, not those caused by waiting to fill up a batch.

This latency is bounded by setting the max batch size.

But unless you allow for items to be re-ordered, allowing batching will greatly decrease latency for most items, because it gets more work done.

In addition, in many cases you don't need to hold the lock for the entire time you are working on something.

for example, in rsyslog the threads delivering logs out lock the main queue, mark what messages they are working on, unlock the queue, work on the message, then lock the queue to mark the messages as completed (and unlock the queue as they are done)

The old way where the output threads locked the queue, processed one message, unlocked the queue and repeated meant that the queue was locked for a much larger percentage of the time, and a lot of CPU time was spent on lock contention issues

appropriate batching (and adjustment of data structures) can work wonders with this sort of problem.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds