Fixing the contiguous memory allocator
CMA works by reserving a zone of memory for large allocations. But the device needing large buffers is probably not active at all times; keeping that memory idle when the device does not need it would be wasteful. So the memory-management code will allow other parts of the kernel to allocate memory from the CMA zone, but only if those allocations are marked as being movable. That allows the kernel to move things out of the way should the need for a large allocation arise.
Laura Abbott started off the session by noting that there are a number of
problems with CMA, relating to both the reliability of large allocations
and the performance of the system as a whole. There are a couple of
proposals out there to fix it — guaranteed
CMA by SeongJae Park and ZONE_CMA from
Joonsoo Kim — but no consensus on how to proceed. Joonsoo helped to lead
the session, as did Gioh Kim.
Peter Zijlstra asked for some details on what the specific problems are. A big one appears to be the presence of pinned pages in the CMA region. All it takes is one unmovable page to prevent the allocation of a large buffer, which is why pinned pages are not supposed to exist in the CMA area. It turns out that pages are sometimes allocated as movable, but then get pinned afterward. Many of these pins are relatively short-lived, but sometimes they can stay around for quite a while. Even relatively short-lived pins can be a problem, though; delaying the startup of a device like a camera can appear as an outright failure to the user.
One particular offender, according to Gioh, appears to be the ext4 filesystem which, among other things, is putting superblocks (which are pinned for as long as the associated filesystem is mounted) in movable memory. Other code is doing similar things, though. The solution in these cases is relatively straightforward: find the erroneous code and fix it. The complication here, according to Hugh Dickins, is that a filesystem may not know that a page will need to be pinned at the time it is allocated.
Mel Gorman suggested that, whenever a page changes state in a way that could
block a CMA allocation, it should be migrated immediately. Even something
as transient as pinning a dirty page for writeback could result in that
page being shifted out of the CMA area. It would be relatively simple to
put hooks into the memory-management code to do the necessary migrations.
The various implementations of get_user_pages() would be one
example; the page fault handler when a page is first dirtied would be
another. A warning could be added when get_page() is called to pin a
page in the CMA area to call attention to other problematic uses.
This approach, it was thought, could help to avoid the need for more
complex solutions within CMA itself.
Of course, that sort of change could lead to lots of warning noise for cases when pages are pinned for extremely short periods of time. Peter suggested adding a variant of get_page() to annotate those cases. Dave Hansen suggested, instead, that put_page() could be instrumented to look at how long the page was pinned and issue warnings for excessive cases.
The second class of problems has to do with insufficient utilization of the CMA area when the large buffers are not needed. Mel initially answered that CMA was simply meant to work that way and that it would not be possible to relax the constraints on the use of the CMA area without breaking it. It eventually became clear that the situation is a bit more subtle than that, but that had to wait until the second session on the following day.
It took a while to get to the heart of the problem on the second day, but Joonsoo finally described it as something like the following. The memory-management code tries to avoid allocations from the CMA area entirely whenever possible. As soon as the non-CMA part of memory starts to fill, though, it becomes necessary to allocate movable pages from the CMA area. But, at that point, memory looks tight, so kswapd starts running and reclaiming memory. The newly reclaimed memory, probably being outside of the CMA area, will be preferentially used for new allocations. The end result is that memory in the CMA area goes mostly unused, even when the system is under memory pressure.
Gioh talked about his use case, in which Linux is embedded in televisions.
There is a limited amount of memory in a TV; some of it must be reserved
for the processing of 3D or high-resolution streams. When that is not
being done, though, it is important to be able to utilize that memory for
other purposes. But the kernel is not making much use of that memory when
it is available; this is just the problem described by Joonsoo.
Joonsoo's solution involves adding a new zone (ZONE_CMA) to the memory-management subsystem. Moving the CMA area into a separate zone makes it relatively easy to adjust the policies for allocation from that area without, crucially, adding more hooks to the allocator's fast paths. But, as Mel said, there are disadvantages to this approach. Adding a zone will change how page aging is done, making it slower and more cache-intensive since there will be more lists to search. These costs will be paid only on systems where CMA is enabled so, he said, it is ultimately a CMA issue, but people should be aware that those costs will exist. That is the reason that a separate zone was not used for CMA from the beginning.
Dave suggested combining ZONE_CMA with ZONE_MOVABLE, which is also meant for allocations that can be relocated on demand. The problem there, according to Joonsoo, is that memory in ZONE_MOVABLE can be taken offline, while memory for CMA should not be unpluggable in that way. Putting CMA memory into its own zone also makes it easier to control allocation policies and to create statistics on the utilization of CMA memory.
The session ended with Mel noting that there did not appear to be any
formal objections to the ZONE_CMA plan. But, he warned, the CMA
developers, by going down that path, will be trading one set of problems
for another. Since the tradeoff only affects CMA users, it will be up to
them to decide whether it is worthwhile.
Index entries for this article | |
---|---|
Kernel | Memory management/Large allocations |
Conference | Storage, Filesystem, and Memory-Management Summit/2015 |