Warming up to frozen pages for networking
The kernel uses reference counts to keep track of which pages of memory are in use. For example, a page in shared memory that is mapped into the address space of several processes will track a reference from each of those processes. As long as at least one of those processes exists and keeps the page mapped, that page will not be freed. The management of reference counts is not free, though; their manipulation requires expensive atomic operations, and the counts themselves take up memory. That has led to a desire to do without reference counting in places where it is not strictly necessary. The slab allocator, for example, tracks the objects it manages within each page and does not need a separate reference count for the page as a whole. In kernels prior to 6.14, though, slab pages are duly reference-counted anyway.
Frozen pages were introduced as a way to eliminate this overhead when possible; in a frozen page, the reference count is set to zero and stays there. Since the lifecycle of the page is tracked separately, there is no need to increment or decrement its count, so that overhead is avoided. Eventually, it will become possible to eliminate the reference count for frozen pages entirely (rather than just keeping it at zero), but there is work yet to be done to reach that point.
Reinecke encountered a kernel crash deep within the networking subsystem; after carefully bisecting the problem, he identified the commit switching the slab allocator to frozen pages as the culprit. Some extensive debugging and discussion ensued, and it eventually became clear that the networking code was trying to increase the reference count on a frozen page, leading to all kinds of internal confusion and an eventual crash.
Sending data through the network can be a complex operation involving pages scattered throughout physical memory. The networking subsystem, like others, handles this complexity by creating a vector describing the various chunks of data to be transferred. All of the pages contained within that vector need to remain present and valid while the operation is underway, so each page's reference count is incremented at the beginning, and decremented once the operation is complete. Many I/O paths within the kernel have traditionally followed that same pattern.
With the shift toward folios and the desire to avoid unneeded reference-count operations, though, that pattern has shifted. The I/O paths need to avoid reference-count manipulations whenever possible, and certainly when those manipulations cannot be done at all, so those paths have changed to adapt. At least, they have in some parts of the kernel; Matthew Wilcox expressed some surprise on learning that the job was only partially done:
I thought we'd done all the work needed to get rid of these pointless refcount bumps. Turns out that's only on the block side (eg commit e4cc64657bec). So what does networking need in order to understand that some iovecs do not need to mess with the refcount?
Reinecke answered
that this kind of change was not going to be easy; the code is complex, and
the place where a reference is taken may be far away from — and, indeed, in
a completely different layer from — where that reference must be released.
Wilcox, meanwhile, posted a patch adding
some checks within the memory-management code that prevents attempts to manipulate
reference counts on slab pages, which are the only frozen pages in the 6.14
kernel. That change, described as "a quick hack
", was intended as a
way to avoid having to revert the use of frozen pages entirely.
Even then, it took one more change from Vlastimil Babka, touching the networking code directly, to make the problem go away. Reinecke acknowledged the fix, but complained about the need to keep track of whether specific pages needed their reference counts updated or not:
Previously we could just do a bvec iteration, get a reference for each page, and start processing. Now suddenly the caller has to check if it's a slab page and don't get a reference for that. Not only that, he also has to remember to _not_ drop the reference when he's done. And, of course, tracing get_page() and the corresponding put_page() calls through all the layers. Really?
His complaint garnered little sympathy, though. Instead, Wilcox asserted that the
networking layer needs to move away from using reference counts on pages,
both to allow the memory-management hack to be removed and to improve
networking performance. He added: "What worries me is that nobody in
networking has replied to this thread yet. Do they not care?
" In an
attempt to provoke such a response, he changed the subject line to:
"Networking people smell funny and make poor life choices
".
Even that, though, has failed to motivate any sort of significant response
from the networking subsystem. The only reply was from Cong Wang, who suggested that
"using AI copilot or whatever automation tool
" might help — a
suggestion that does not appear to have gained any traction. Wilcox has posted his
workaround as a separate patch that, one would expect, will find its
way into 6.14 prior to its release.
As of this writing, that is where the situation stands. The immediate
problem should be fixed, but the wider question of the management of
reference counts for pages across the kernel remains unanswered. Perhaps
the upcoming Linux
Storage, Filesystem, Memory-Management, and BPF Summit will include a
discussion on this issue; stay tuned.
Index entries for this article | |
---|---|
Kernel | Memory management/struct page |
Kernel | Networking |
Posted Mar 14, 2025 8:28 UTC (Fri)
by taladar (subscriber, #68407)
[Link] (9 responses)
Posted Mar 14, 2025 9:23 UTC (Fri)
by npws (subscriber, #168248)
[Link]
Posted Mar 14, 2025 9:41 UTC (Fri)
by tux3 (subscriber, #101245)
[Link] (6 responses)
One solution is to have the hack like today where you take a mix of frozen and live pages, and then have to check which one it is.
And the other way is to finish converting everything, but there again I don't see what spilling implementation details have to do with it.
It used to be fine to do folio_get()/folio_put(), the frozen page feature breaks this invariant of that particular type, and until you do the manual work of finding all the broken code paths, the compiler can't tell you where to look.
Posted Mar 14, 2025 23:10 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (5 responses)
It depends what you mean by "help" and how much code churn you're willing to put up with.
There are a number of different possible solutions in the Rust space, which range from trivial to rather complex, so let's start with the basics. Anything that gets refcounted in Rust is either behind one of the stdlib types Rc<T> (thread-unsafe) or Arc<T> (thread-safe), or else it is behind some custom home-grown equivalent to one of those types (e.g. if you need the ability to directly read or modify the refcount). The details of how those types work is beyond the scope of this discussion, but the short version is that they are smart pointers that manage refcounts automatically as objects go in and out of scope. They *do not* manipulate refcounts when moved (destructively passed by value), which is important because we should not have to pay extra at runtime for the convenience of automatic refcounting.
So, just to keep everything in Rust-idiomatic terms, let's say our type that we want to maybe-refcount is called Page, and usually exists in the form Arc<Lock<Page>> where Lock is some kind of lock (probably a Mutex or RwLock, or some kernel-specific equivalent). Arc may or may not be the stdlib Arc, we don't need to care, but it should at least resemble the stdlib Arc in its most important aspects. Here are some of the different types your function could hypothetically take (as a parameter) or return:
* Arc<Lock<Page>>: "This Page must be refcounted, and we're transferring ownership of a reference (without modifying its refcount)." The compiler enforces this requirement - this won't typecheck against a Page that is not actually refcounted, and the Arc disappears after you pass or return it (so you can't keep using the reference you transferred). You can create new shared references by calling Arc::clone() (which bumps the refcount and returns a new Arc).
Posted Mar 15, 2025 3:29 UTC (Sat)
by roc (subscriber, #30627)
[Link] (4 responses)
Posted Mar 15, 2025 5:54 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link] (3 responses)
* The caller might proceed to drop their Arc immediately after you return. In the case where the caller is about to drop their Arc, and you decide that you need your own Arc, you waste two refcount atomics compared to the "just take Arc<...>" case (the caller could have transferred ownership to you). In the case where the caller is about to drop their Arc, but you decide you don't need your own Arc, it's a wash (either you drop the Arc or they do, but somebody has to drop it). Technically, this problem can be solved by taking Cow<Arc<Page>> (which can be read as "the caller decides whether we get &Arc or Arc"), but in addition to some syntax issues (its Deref trait has Target=Arc<Page> instead of Target=Page), Cow also emits a conditional branch every time you dereference it, and that's probably not great for performance.
(Why is performance suddenly such a big deal? Because &Arc is a micro-optimization. You use it to save two relaxed atomics, not because it has fundamentally different semantics than Arc. If using &Arc is not improving your performance on net, or if you don't care about performance in the first place, then you're better off with the simpler and more conceptually straightforward Arc.)
Posted Mar 18, 2025 11:46 UTC (Tue)
by jtepe (subscriber, #145026)
[Link] (2 responses)
Why two? I guess because of the case of clone'ing and then dropping it in the callee?
Why relaxed? Shouldn't the refcount be visible and current to all threads all the time?
Posted Mar 18, 2025 11:50 UTC (Tue)
by farnz (subscriber, #17727)
[Link] (1 responses)
Sequential consistency, acquire, release and consume orderings make the atomic access ordered with respect to (some) non-atomic loads and stores; sequential consistency also makes the atomic access ordered with respect to other sequentially consistent atomic accesses. And relaxed just makes the atomic access ordered with respect to accesses to the atomic itself.
Posted Mar 18, 2025 17:24 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link]
You can only use relaxed atomics for refcount increments, and even then, only if you already have a strong reference. Upgrading a weak reference is an altogether more complicated operation because you have to ensure that the object is not destructed out from under you, or at least you have to detect that case and fail gracefully when it happens, and those both entail at least acquire semantics.
Posted Mar 14, 2025 14:48 UTC (Fri)
by willy (subscriber, #9762)
[Link]
The problem is that we're starting from a codebase which essentially allowed anybody to do anything to struct page, even when it made no semantic sense. This is a huge cleanup effort which is why we're five years into it.
Posted Apr 1, 2025 10:17 UTC (Tue)
by unprinted (guest, #71684)
[Link]
At least I hope it was a joke.
C spills internals all over
C spills internals all over
C spills internals all over
You could track that info with a fancy MaybeFrozen type if you want, and you can make it nicely encapsulate its internals, but that seems equivalent to passing a struct in C. If you take a mix, this is a dynamic property. Can a stronger static type system help?
I can agree that strong type systems are nice and help in general, but that's handwaving all the important details away. I don't see it here.
C spills internals all over
* &Page: "I don't care how this Page's ownership is managed, I'm just borrowing it for a little while (and the compiler will yell at me if I try to hold onto it for too long)." Can be made to work with long-lived borrows, but is significantly more of a hassle than Arc. Doesn't even know that there is a refcount, so obviously does not manipulate it.
* &mut Page: Like &Page, but you're promised to have the page exclusively and are allowed to mutate it. In most cases, this entails taking a lock, but that's not required if the compiler is satisfied that no other thread could possibly have access to it.
* Box<Page>: "This Page must not be refcounted, instead it is an exclusively-owned allocation, and we're transferring ownership of it." You can also just write Page for this, but then you would be allocating a whole Page on the stack and copying it to move it around, which is probably unwise. Box entails a heap allocation in userspace, and there is a kernel equivalent that entails something more complicated. Converting a Box<Page> into an Arc<Page> is trivial, but wrapping it in a lock is more complicated, so if you plan to do that, you probably start out with a Box<Lock<Page>> (which is cheaper than it looks, since most lock types don't enforce acquisition when you have a &mut to the lock).
* impl Deref<Target=Page>: "I don't care what you give me, as long as it is some sort of pointer to a Page that I can dereference." If in argument position, the compiler monomorphizes this just like a generic type. In return position, it's not monomorphized and must correspond to a specific type (but that type is not visible to the caller and is considered an implementation detail of the callee). Either way, refcounting works correctly when we pass or return an Arc, no refcounting is emitted if we pass or return a Box, you can pass or return more esoteric things like a MutexGuard (which will clean itself up correctly, i.e. unlock when it goes out of scope, as you would expect), and there is no dynamic dispatch since this is monomorphized at compile time.
* impl DerefMut<Target=Page>: Like Deref<Target=Page> but allows mutation (and requires exclusivity).
* An enum with two variants, one containing an Arc and one containing a Box: "I want to allow mixed refcounted/uncounted Pages at runtime, and dynamically dispatch on them." For ergonomic reasons, it would be typical to impl Deref<Target=Page> for such an enum, so that it can be used without regard for which variant it is. You still don't have to think about whether or not refcounting should be emitted as the compiler will take care of that detail for you.
C spills internals all over
C spills internals all over
* Even if you don't have to explicitly write double dereference syntax, the compiler will ultimately emit a double dereference in machine code, because in terms of actual memory layout, &Arc<...> is a pointer to a pointer. How many times are you dereferencing it before you turn it into an Arc<...> or discard it? If it's more than a few times, then that is going to add up.
C spills internals all over
Relaxed does not affect the visibility of the atomic itself (none of the atomic orderings do); relaxed just means that changes to this atomic are not ordered with respect to non-atomic loads and stores, nor are they ordered with respect to other atomics.
C spills internals all over
C spills internals all over
C spills internals all over
A dash of humour, I hope