Shrinking shrinker locking overhead
Kernel subsystems that maintain caches should register a shrinker that can be called when the kernel needs to free memory for other uses. A shrinker is described by struct shrinker; among other things, it contains a pair of callbacks that the kernel can use to query how many cached objects could be freed, and to ask that they actually be freed. Shrinkers can be asked to focus on a specific NUMA node or memory control group, but not all shrinkers implement that functionality. Since shrinkers are called from the reclaim path when memory is tight, they should be quick and refrain from allocating memory themselves.
Shrinkers can be registered and deleted as the system runs, creating a concurrency problem: a shrinker should not be deleted while it is running, and the list of shrinkers must be changed carefully given that other CPUs may be traversing it at the same time. In current kernels, the shrinker list is protected by a reader/writer semaphore (rwsem); traversing the list to run shrinkers requires read access, while changing the list requires exclusive write access. This was meant to be a fast solution; frequent traversals of the list (reads) can run concurrently, while changes to the list that would require write access are relatively rare.
This rwsem, it turns out, can be a performance bottleneck on busy systems. It is a global lock, so frequent acquisitions and releases can create a lot of cache-line bouncing, slowing the system even if the lock itself is not contended. Things can get worse if a shrinker runs (or is blocked) for a long time. If a writer comes along, it will request a write lock, which will have to wait until all existing read locks are dropped; meanwhile, the write-lock request blocks any additional read locks from being granted. In this situation, a long-running shrinker can clog up the works for some time.
Performance problems of this type come up often in the kernel, and the path to their solution is reasonably well-worn at this point; it almost inevitably involves using read-copy-update (RCU) to defer changes to existing structures until all users are gone.
In this case, the patch series starts by changing the shrinker registration interface so that all shrinkers are allocated dynamically — even those that are present from boot and cannot be removed. This change allows all shrinkers to be treated uniformly, getting rid of special cases, and sets the stage for changing how shrinker registration is handled. As seen in this patch, a new shrinker instance is created with shrinker_alloc(), made active with shrinker_register(), and released with shrinker_free().
There are a couple of implications here. One, as noted in the cover letter, is that this change will break all out-of-tree modules that implement shrinkers; they will have to be converted to the new API or they will fail to load. This is a deliberate change to ensure that, in kernels implementing the new mechanism, no old-style shrinkers are in use. A more quiet change is that, while the existing register_shrinker() interface is exported to all modules, the new functions are exported as GPL-only. As a result, proprietary kernel modules that implement shrinkers will not be fixable at all.
The bulk of this 45-part patch series is focused on converting all in-kernel shrinkers to the new API, after which the old one is deleted. The real purpose of the patch set is only achieved in patch 42, where the lockless algorithm is introduced. The shrinker structure gains three new fields: a reference count, a completion to be used for removals, and an rcu_head structure.
When a shrinker is registered, its reference count is set to one, and it is added (in an RCU-safe manner) to the shrinker list; it is then available to be called when the memory-management subsystem needs to find some memory. The traversals of the shrinker list are performed with the RCU lock held, meaning that the entries in the list will not disappear at an inconvenient time. To invoke a shrinker, the kernel will first attempt to increment its reference count; that attempt will only succeed if the count is already greater than zero. The RCU lock will then be dropped, and the shrinker invoked. Once its work is done, the RCU lock will be reacquired, and the reference count decremented. Since the reclaim code held a reference, the shrinker will not have disappeared while the lock was dropped.
When the time comes to remove a shrinker, shrinker_free() will drop the reference acquired at registration time, then use the completion to wait until all other references (if any) are also dropped. At this point, the fact that the reference count is zero means that shrinker will not acquire any more users, since an attempt to increment the reference count only succeeds if that count is greater than zero. But there may still be threads traversing the shrinker list and seeing this shrinker's entry there, so its removal has to be handled with care. That, of course, is what RCU is for; the entry is taken off the list, but then handed to RCU until a grace period passes, after which it is known that the shrinker structure can be safely freed.
With these changes made, the shrinker rwsem is no longer used during the invocation of shrinkers; it is only taken for write access when changes are being made to the shrinkers themselves. The final patch in the series turns the rwsem into a lower-overhead mutex, and the work is done.
This series is in its sixth revision, and the stream of comments appears to
be slowing down. Benchmark results show no regressions from this change,
unlike previous attempts to address the locking bottleneck that created
problems elsewhere. Unless new problems turn up somewhere — always a
possibility with this kind of low-level code — it looks like lockless
shrinking may be reaching a point where it is ready for wider testing in
linux-next.
| Index entries for this article | |
|---|---|
| Kernel | Memory management/Shrinkers |
| Kernel | Releases/6.7 |
