By Jonathan Corbet
September 19, 2007
Most of the core virtual memory subsystem developers met for a mini-summit
just before the 2007 Kernel Summit in Cambridge. They came away feeling
that they had resolved a number of VM scalability problems. Subsequent
discussions have made it clear that, perhaps, this conclusion was a bit
premature. They may well have resolved things, but it is not clear that
everybody came to the same resolution.
All of the issues at hand relate to scalability in one way or another.
While the virtual memory subsystem has been through a great many changes
aimed at making it work well on contemporary systems, one key aspect of how
it works has remained essentially unchanged since the beginning: the
4096-byte (on most architectures) page size. Over that time, the amount of
memory installed on a typical system has grown by about three orders of
magnitude - that's 1000 times more pages that the kernel must manage and
1000 times more page faults which must be handled.
Since it does not appear that this trend will stop soon, there is a clear
scalability problem which must be managed.
This problem is complicated by the way that Linux tends to fragment its
memory. Almost all memory allocations are done in units of a single page,
with the result that system RAM tends to get scattered into large numbers
of single-page chunks. The kernel's memory allocator tries to keep larger
groups of pages together, but there are limits to how successful it can be.
The file
/proc/buddyinfo can be illustrative here; on a system which has
been running and busy for a while, the number of higher-order (larger)
pages, as shown in the rightmost columns, will be very small.
The main response to memory fragmentation has been to avoid higher-order
allocations at almost any cost. There are very few places in the kernel
where allocations of multiple contiguous pages are done. This approach has
worked for some time, but avoiding larger allocations does not always make
the need for such allocations go away. In fact, there are many things
which could benefit from larger contiguous memory areas, including:
- Applications which use large amounts of memory will be working with
large numbers of pages. The translation lookaside buffer (TLB) in the
CPU, which speeds virtual address lookups, is generally relatively
small, to the point that large applications run up a lot of
time-consuming TLB misses. Larger pages require fewer TLB entries,
and will thus result in faster execution. The hugetlbfs extension was
created for just this purpose, but it is a specialized mechanism used
by few applications, and it does not do anything special to make large
contiguous memory regions easier for the kernel to find.
- I/O operations can work better with larger contiguous chunks of data
to work with. Users trying to use "jumbo frames" (extra-large
packets) on high-performance network adapters have been experiencing
problems for a while. Many devices are limited in the number of
scatter/gather entries they support for a single operation, so small
buffers limit the overall I/O operation size. Disk devices are
pushing toward larger sector sizes which would best be supported by
larger contiguous buffers within the kernel.
- Filesystems are feeling pressure to use larger block sizes for a
number of performance reasons. This
message from David Chinner provides an excellent explanation of
why filesystems benefit from larger blocks.
But it is hard (on Linux) for a filesystem to work
with block sizes larger than the page size; XFS does it, but the
resulting code is seen as non-optimal and is not as fast as it could
be. Most other filesystems do not even try; as a result, an ext3
filesystem created on a system with 8192-byte pages cannot be mounted
on a system with smaller pages.
None of these issues are a surprise; developers have seen them coming for
some time. So there are a number of potential solutions waiting on the
wings. What is lacking is a consensus on which solution is the best way to
go.
One piece of the puzzle may be Mel Gorman's fragmentation avoidance work,
which has been discussed here more than once. Mel's patches seek to
separate allocations which can be moved in physical memory from those which
cannot. When movable allocations are grouped together, the kernel can,
when necessary, create higher-order groups of pages by relocating
allocations which are in the way. Some of Mel's work is in 2.6.23; more
may be merged for 2.6.24. The lumpy reclaim patches, also in
2.6.23, encourage the creation of large blocks by targeting adjacent pages
when memory is being reclaimed.
The immediate cause for the current discussion is a new version of
Christoph Lameter's large block
size patches. Christoph has filled in the largest remaining gap in
that patch set by implementing mmap() support. This code enables
the page cache to manage chunks of file data larger than a single page
which, in turn, addresses many of the I/O and filesystem issues. Christoph
has given a long list of
reasons why this patch should be merged, but agreement is not
universal.
At the top of the list of objections would appear to be the fact that the
large block size patches require the availability of higher-order pages to
work; there is no fallback if memory becomes sufficiently fragmented that
those allocations are not available. So a system which has filesystems
using larger block sizes will fall apart in the absence of large,
contiguous blocks of memory - and, as we have seen, that is not an uncommon
situation on Linux systems. The fragmentation avoidance patches can
improve the situation quite a bit, but there is no guarantee that
[PULL QUOTE:
If this patch set is merged, some developers want
it to include a loud warning to discourage users from
actually expecting it to work.
END QUOTE]
fragmentation will not occur, either as a result of the wrong workload or a
deliberate attack. So, if this patch set is merged, some developers want
it to include a loud warning to discourage users (and distributors) from
actually expecting it to work.
An alternative is Nick Piggin's fsblock work. People like to
complain about the buffer head layer in current kernels, but that layer has
a purpose: it tracks the mapping between page cache blocks and the
associated physical disk sectors. The fsblock patch replaces the buffer
head code with a new implementation with the goals of better performance
and cleaner abstractions.
One of the things fsblock can do is support large blocks for filesystems.
The current patch does not use higher-order allocations to implement this
support; instead, large blocks are made virtually contiguous in the
vmalloc() space through a call to vmap() - a technique
used by XFS now. The advantage of using vmap() is that the
filesystem code can see large, contiguous blocks without the need for
physical adjacency, so fragmentation is not an issue.
On the other hand, using vmap() is quite slow, the address space
available for vmap() on 32-bit systems is small enough to cause
problems, and using vmap() does nothing to help at the I/O level.
So Nick plans to extend fsblock to implement large blocks with contiguous
allocations, but with a fallback to vmap() when large allocations
are not available. In theory, this approach should be be best of both
worlds, giving the benefits of large blocks without unseemly explosions in
the presence of fragmentation. Says Nick:
However fsblock can do everything that higher order pagecache can
do in terms of avoiding vmap and giving contiguous memory to block
devices by opportunistically allocating higher orders of pages, and
falling back to vmap if they cannot be satisfied.
From the conversation, it seems that a number of developers see fsblock as
the future. But it is not something for the near future. The patch is
big, intrusive, and scary, which will slow its progress (and memory
management patches have a tendency to merge at a glacial pace to begin
with). It lacks the opportunistic large block feature. Only the Minix
filesystem has been updated to use fsblock, and that patch was rather
large. Everybody (including Nick) anticipates that more complex
filesystems - those with features like journaling - will present surprises
and require changes of unknown size. Fsblock is not a near-term solution.
One recently-posted patch
from Christoph could help fill in some of the gaps. His "virtual compound
page" patch allows kernel code to request a large, contiguous allocation;
that request will be satisfied with physically contiguous memory if
possible. If that memory is not available, virtually contiguous memory
will be returned instead. Beyond providing opportunistic large block
allocation for fsblock, this feature could conceivably be used in a
number of places where vmalloc() is called now, resulting in
better performance when memory is not overly fragmented.
Meanwhile, Andrea Arcangeli has been relatively quiet for some time, but one should
not forget that he is the author of much of the VM code in the kernel now.
He advocates a different approach entirely:
From my part I am really convinced the only sane way to approach
the VM scalability and larger-physically contiguous pages problem
is the CONFIG_PAGE_SHIFT patch (aka large PAGE_SIZE from Hugh for
2.4).
The CONFIG_PAGE_SHIFT patch
is a rework of an old idea: separate the size of a page as seen by the
operating system from the hardware's notion of the page size. Hardware
pages can be clustered together to create larger software pages which, in
turn, become the basic unit of memory management. If all pages in the
system were, say, 64KB in length, a 64KB buffer would be a single-page
allocation with no fragmentation issues at all.
If the system is to go to larger pages, creating them in software is about
the only option. Most processors support more than one hardware page size,
but the smallest of the larger page sizes tend to be too large for general
use. For example, i386 processors have no page sizes between 4KB and 2MB.
Clustering pages in software enables the use of more reasonable page sizes
and creates the flexibility needed to optimize the page size for the
expected load on the system. This approach will make large block support
easy, and it will help with the I/O performance issues as well. Page
clustering is not helpful for TLB pressure problems, but there is little to
be done there in any sort of general way.
The biggest problem, perhaps, with page clustering is that it replaces
external fragmentation with internal fragmentation. A 64KB page will, when
used as the page cache for a 1KB file, waste 63KB of memory. There are
provisions in Andrea's patch for splitting large pages to handle this
situation; Andrea claims that this splitting will not lead to the same sort
of fragmentation seen on current systems, but he has not, yet, convinced
the others of this fact.
Conclusions from this discussion are hard to come by; at one point Mel
Gorman asked: "Are we going to agree
on some sort of plan or are we just going to handwave ourselves to
death?" Linus has just called the whole
discussion "idiotic". What may happen is that the large block size
patches go in - with warnings - as a way of keeping a small subset of users
happy and providing more information about the problem space. Memory
management hacking requires a certain amount of black-magic handwaving in
the best of times; there is no reason to believe that the waving of hands
is going to slow down anytime soon this time around.
(
Log in to post comments)