|
|
Log in / Subscribe / Register

Toward better performance on large-memory systems

By Jonathan Corbet
May 2, 2018

LSFMM
Christoph Lameter works in a different computing environment than most of us; he supports high-volume trading applications that need every bit of performance that the fastest hardware can give them. Even then, it seems that isn't fast enough. In a memory-management-track session at the 2018 Linux Storage, Filesystem, and Memory-Management Summit, Lameter described some of the problems he has encountered and approaches he is considering to address them.

He is working with a large system that has four 10Gb/s network interfaces on it. Those interfaces run at full speed during bursts of activity, pouring data into the page cache, which sits in 1TB of main memory. That data is then archived onto a 100TB storage array. This system is vital for his company's operations; it must be possible to answer questions from regulators on exactly when any given packet arrived. It is important, he said, that the system works.

Unfortunately, he is having problems writing data to the archive; the maximum data rates from the page cache are 4-5GB/s. The only way the system can keep up is to remove some of the network interfaces. A much higher data rate (around 10GB/s) could be obtained by using direct I/O and avoiding the page cache altogether, but he would rather not do that. There are also a number of analysis processes running on this system, and they benefit from having the data in the page cache. So he's not sure what to do.

The problem is going to get worse; the company is upgrading to 100Gb/s interfaces and wants to achieve rates of 40GB/s to the disk. Lameter thinks that these performance issues are specific to Intel hardware; everything [Christoph
Lameter] works as desired on POWER8 processors. The root of the problem is the 4KB system page size, which is quite small on such a system. Unfortunately, filesystems and block devices in Linux do not understand huge pages.

He is considering trying to increase the base page size above 4KB. This is not a new idea; developers have been trying to make this work for years. Hugh Dickins noted that William Lee Irwin's page clustering patches worked fifteen years ago, but they proved to be too complex to be merged. A larger page size can also break some applications that expect to be able to map pages on a 4KB granularity.

Lameter has done some work on the "order N page cache", which would be able to store pages of any order. But he never got around to implementing mmap(), he said, and the patch had too many fragmentation problems to be merged. An alternative is transparent huge pages in the page cache. The problem with that is that no filesystem (other than tmpfs) supports it currently. He has looked at (ab)using the DAX mechanism to get huge-page mappings in ordinary memory; one could also use DAX for real on nonvolatile memory.

Returning to the 4KB page-size issue, Lameter pointed out that, at that size, about 2% of the system's memory is taken up by page structures. A system with 4TB of RAM must manage one-billion page structures. There are systems supporting 20TB of nonvolatile RAM coming soon, he said; the overhead is becoming unsupportable. But Dickins said that he wasn't worried about a 2% space tax.

Almost any solution to Lameter's performance problems is going to require more reliable allocation of large, physically contiguous memory areas. He has been playing with an XArray cache for memory chunks that would allow them to be moved as needed. Simple slab allocator support has been implemented, but a lot of work is needed still; allocators need to provide callbacks to allow memory chunks to be relocated. Alternatives include reserving memory at boot for large allocations, the MAP_CONTIG option to mmap(), or Java-style garbage collection. "But we are all horrified" by that last idea, he said.

Rik van Riel suggested working with the filesystem developers to support larger block sizes. But some filesystems (XFS, for example) already have that support; the real problem is in the page cache. Dickins suggested creating a mechanism like transparent huge pages that would opportunistically allocate larger chunks for the page cache; those chunks would be smaller than typical huge pages, though.

From that point on, the conversation went around in circles about whether these performance issues are truly a hardware problem or not. There was no useful outcome from that discussion, though, and it petered out as beer time approached.

Index entries for this article
KernelMemory management/Scalability
ConferenceStorage, Filesystem, and Memory-Management Summit/2018


to post comments

Toward better performance on large-memory systems

Posted May 2, 2018 14:50 UTC (Wed) by jhhaller (guest, #56103) [Link] (2 responses)

As Intel found while developing DPDK for network acceleration, page table cache eviction in hardware was to be avoided. This was one of the major impetus for using huge pages in DPDK. The 2% tax for page tables may not be an issue for RAM utilization, but the number of hardware cache entries backing the page tables does become a problem. Whether this is the problem Christoph is running into would need more work, such as monitoring page cache eviction rates.

Toward better performance on large-memory systems

Posted May 3, 2018 18:18 UTC (Thu) by hansendc (subscriber, #7363) [Link]

The 2% thing is for 'struct page', not page tables.

Toward better performance on large-memory systems

Posted May 3, 2018 22:04 UTC (Thu) by Paf (subscriber, #91811) [Link]

As someone who has worked with i/o performance a lot, the basic problem tends to be the amount of per page overhead and contention when getting/marking dirty/freeing pages, along with the file system overhead per page, which can be significant. Also, if we really got nice big pages, we could justify turning on the FPU and using those gorgeous vectorized memcopy instructions. As is, at 4K page size with recent-ish Xeons, it’s slightly worse than a wash to do that, and that’s when the FPU isn’t really being used by userspace. If it is, turning it on and off in kernel takes even longer.

I’m a little surprised his numbers are so low - that’s more like a “single file” limit. Perhaps that’s what he’s talking about. For multiple files, I’ve seen more like 6.5 GB/s, and that system was network bandwidth limited at that point. It was not limited primarily by the page cache.

Page cache needs improving, though, for sure. More batch operations would help some.


Copyright © 2018, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds