Warming up to frozen pages for networking
The kernel uses reference counts to keep track of which pages of memory are in use. For example, a page in shared memory that is mapped into the address space of several processes will track a reference from each of those processes. As long as at least one of those processes exists and keeps the page mapped, that page will not be freed. The management of reference counts is not free, though; their manipulation requires expensive atomic operations, and the counts themselves take up memory. That has led to a desire to do without reference counting in places where it is not strictly necessary. The slab allocator, for example, tracks the objects it manages within each page and does not need a separate reference count for the page as a whole. In kernels prior to 6.14, though, slab pages are duly reference-counted anyway.
Frozen pages were introduced as a way to eliminate this overhead when possible; in a frozen page, the reference count is set to zero and stays there. Since the lifecycle of the page is tracked separately, there is no need to increment or decrement its count, so that overhead is avoided. Eventually, it will become possible to eliminate the reference count for frozen pages entirely (rather than just keeping it at zero), but there is work yet to be done to reach that point.
Reinecke encountered a kernel crash deep within the networking subsystem; after carefully bisecting the problem, he identified the commit switching the slab allocator to frozen pages as the culprit. Some extensive debugging and discussion ensued, and it eventually became clear that the networking code was trying to increase the reference count on a frozen page, leading to all kinds of internal confusion and an eventual crash.
Sending data through the network can be a complex operation involving pages scattered throughout physical memory. The networking subsystem, like others, handles this complexity by creating a vector describing the various chunks of data to be transferred. All of the pages contained within that vector need to remain present and valid while the operation is underway, so each page's reference count is incremented at the beginning, and decremented once the operation is complete. Many I/O paths within the kernel have traditionally followed that same pattern.
With the shift toward folios and the desire to avoid unneeded reference-count operations, though, that pattern has shifted. The I/O paths need to avoid reference-count manipulations whenever possible, and certainly when those manipulations cannot be done at all, so those paths have changed to adapt. At least, they have in some parts of the kernel; Matthew Wilcox expressed some surprise on learning that the job was only partially done:
I thought we'd done all the work needed to get rid of these pointless refcount bumps. Turns out that's only on the block side (eg commit e4cc64657bec). So what does networking need in order to understand that some iovecs do not need to mess with the refcount?
Reinecke answered
that this kind of change was not going to be easy; the code is complex, and
the place where a reference is taken may be far away from — and, indeed, in
a completely different layer from — where that reference must be released.
Wilcox, meanwhile, posted a patch adding
some checks within the memory-management code that prevents attempts to manipulate
reference counts on slab pages, which are the only frozen pages in the 6.14
kernel. That change, described as "a quick hack
", was intended as a
way to avoid having to revert the use of frozen pages entirely.
Even then, it took one more change from Vlastimil Babka, touching the networking code directly, to make the problem go away. Reinecke acknowledged the fix, but complained about the need to keep track of whether specific pages needed their reference counts updated or not:
Previously we could just do a bvec iteration, get a reference for each page, and start processing. Now suddenly the caller has to check if it's a slab page and don't get a reference for that. Not only that, he also has to remember to _not_ drop the reference when he's done. And, of course, tracing get_page() and the corresponding put_page() calls through all the layers. Really?
His complaint garnered little sympathy, though. Instead, Wilcox asserted that the
networking layer needs to move away from using reference counts on pages,
both to allow the memory-management hack to be removed and to improve
networking performance. He added: "What worries me is that nobody in
networking has replied to this thread yet. Do they not care?
" In an
attempt to provoke such a response, he changed the subject line to:
"Networking people smell funny and make poor life choices
".
Even that, though, has failed to motivate any sort of significant response
from the networking subsystem. The only reply was from Cong Wang, who suggested that
"using AI copilot or whatever automation tool
" might help — a
suggestion that does not appear to have gained any traction. Wilcox has posted his
workaround as a separate patch that, one would expect, will find its
way into 6.14 prior to its release.
As of this writing, that is where the situation stands. The immediate
problem should be fixed, but the wider question of the management of
reference counts for pages across the kernel remains unanswered. Perhaps
the upcoming Linux
Storage, Filesystem, Memory-Management, and BPF Summit will include a
discussion on this issue; stay tuned.
| Index entries for this article | |
|---|---|
| Kernel | Memory management/struct page |
| Kernel | Networking |
