|
|
Log in / Subscribe / Register

Memory management for 400Gb/s interfaces

By Jonathan Corbet
May 8, 2019

LSFMM
Christoph Lameter has spent years improving Linux for high-performance computing tasks. During the memory-management track of the 2019 Linux Storage, Filesystem, and Memory-Management Summit, he talked about the problem of keeping up with a 400Gb/s network interface. At that speed, there simply is no time for the system to get its work done. Some ways of improving the situation are in sight, but it's a hard problem overall and, despite some progress, the situation is getting worse.

The problem is that, at those data rates, the kernel's page cache is overwhelmed and simply cannot keep up. That is not entirely the kernel's fault; there is an increasing mismatch between interface speeds and memory speeds. As a result, sites have stopped upgrading their Infiniband fabrics; there is no point in making the fabric go any faster. A PCIe 3 bus can manage 1GB/s in each lane; x86 systems have 44 lanes, all of which must be used together to keep up with a 400Gb/s interface. So extra capacity on the fabric side is not useful.

PCIe 4 offers a bit of relief in the form of a doubled transfer rate but, Lameter said, that effort is currently stalled. Meanwhile latencies are high. The whole Intel computing paradigm is in trouble, he said; it is no [Christoph
Lameter] longer suitable for high-performance computing. The OpenCAPI architecture is somewhat faster than PCIe, but it is only available on POWER9 systems. The fastest interlink available currently is NVIDIA's NVLink, which can attain 300GB/s; that too is only available on POWER9.

In the area of memory bandwidth, processor vendors are adding memory channels; Intel has six of them now, AMD has eight. But that adds more pins and complicates routing. These systems can move 20GB/s in each channel, which puts an upper bound on what any individual thread can do; a single thread cannot keep up with even a 100Gb/s network interface. So multiple cores are needed to get the job done. There is some potential in GDDR and HBM memory; those, combined with NVLink, show that it is possible to do better than current systems do.

Jesper Brouer has done a lot of work improving the performance of the kernel's network stack; he was able to get up to a rate of 10Gb/s. But when the data rate is raised to 100Gb/s, there are only 120ns available to process each packet; the system cannot take even a single cache miss and keep up. So that kind of network processing must be done in hardware. The development of the express data path (XDP) mechanism is another sign that you just cannot use the network stack at those rates. Moving some functions, such as checksums and timestamps, to the interfaces can help somewhat.

Then, there are problems with direct I/O in the kernel; it works with arrays of pointers to 4KB pages, meaning there is little opportunity for batching. 1GB transfers are thus relatively slow. The 5.1 kernel has improved the situation by allowing for larger chunks of data to be managed; that results in lower cache use, fewer memory allocations, and less out-of-band data to communicate to devices — and, thus, higher performance. But this is a new feature that will not make its way into the major distributions for some time.

The kernel's page cache, Lameter said, simply does not scale. The fact that it can't work with large pages makes things worse; users have to use direct I/O or bypass the kernel entirely, which should not be necessary. That said, there has been some progress. The XArray data structure enables handling multiple page sizes in the page cache. The slab movable objects work can help to address fragmentation. Work is being done to avoid acquiring the mmap_sem lock while handling page faults, and support for huge pages is being added to filesystems. One option that has not been pursued, he said, is to create a kernel that uses 2MB as its base page size or increasing the base size to an intermediate value by grouping 4KB pages.

There is some value in persistent memory, which is attached to the memory channels and is thus fast. The DAX mechanism can be used to avoid the page cache altogether. This storage is currently limited in size, though, and cannot be used with RDMA due to the well-discussed problems with get_user_pages().

In the future, he said, kernel developers need to be thinking about terabit streams. There is 3D hologram streaming on the horizon, he said. We increasingly need to move massive amounts of data, but everybody is busy trying to avoid the kernel's limitations to get this work done. Part of the solution, eventually, will be new hardware architectures for high-performance computing.

It would be nice, he concluded, if the memory-management subsystem had a road map showing how it plans to meet these challenges. In the brief moment before the session ended, Matthew Wilcox said that not having a road map is not necessarily a bad thing. The development community is indeed working on these problems; each developer has taken on one piece of it. Coordinating all of this work is what LSFMM is all about; he now knows what others need from the subsystem.

Index entries for this article
KernelNetworking/Performance
ConferenceStorage, Filesystem, and Memory-Management Summit/2019


to post comments

Memory management for 400Gb/s interfaces

Posted May 9, 2019 4:54 UTC (Thu) by dmaas (guest, #38073) [Link]

Hi Jonathan - Really appreciate your coverage of the Linux summit highlights. These talk summaries are one of the top reasons I subscribe to LWN. Keep up the good work!

x86 systems have 44 lanes

Posted May 9, 2019 5:14 UTC (Thu) by zdzichu (subscriber, #17118) [Link] (1 responses)

The comment about x86 maxing out at 44 lanes is misleading at best. There are CPU manufacturers with such limited processors. At the same time, server chips form AMD (EPYC) provide 128 PCIe lanes. Next generation of EPYC will move to PCIe 4.0 and could provide up to 162 lanes per processor.

x86 systems have 44 lanes

Posted May 9, 2019 8:23 UTC (Thu) by gebi (guest, #59940) [Link]

Afaik the 162 PCIe 4 Lanes is not entierly correct. AFAIK this number is for a dual epyc system which does not use the full 64 lanes for inter CPU communication (so it's not 162 lanes per CPU).

From those 162 Lanes you mentioned you have to IMHO subtract 2 PCIe lanes which are reserved strictly for low speed devices (remote management), thus leaving you with 160 Lanes for a dual CPU AMD epyc system, which is awesome!

Memory management for 400Gb/s interfaces

Posted May 9, 2019 14:38 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (1 responses)

So, people can 'engineer' network interface whose capabilities cannot be used with run-of-the-mill x86 hardware? Sounds like a problem for the sales dept :->.

Memory management for 400Gb/s interfaces

Posted May 21, 2019 4:47 UTC (Tue) by gwolf (subscriber, #14632) [Link]

...Similar to the ARM SoCs that have Gigabit Ethernet... Connected via a USB2 hub :-Þ

Memory management for 400Gb/s interfaces

Posted May 9, 2019 15:32 UTC (Thu) by nilsmeyer (guest, #122604) [Link]

> Intel has six of them now, AMD has eight.

Technically, current AMD CPUs have 4 times 2 channels.

B and b

Posted May 10, 2019 12:35 UTC (Fri) by epa (subscriber, #39769) [Link]

Does the article use GB to mean gigabyte and Gb to mean gigabit? Would be good to clarify, or better, use the same units throughout.

Memory management for 400Gb/s interfaces

Posted May 13, 2019 2:55 UTC (Mon) by anatolik (guest, #73797) [Link] (7 responses)

I wonder if unikernel frameworks like Unicycle [1] is a better option for ultra-low latency + high-throughput applications.

[1] https://github.com/libunicycle/unicycle

Memory management for 400Gb/s interfaces

Posted May 13, 2019 15:01 UTC (Mon) by rweikusat2 (subscriber, #117920) [Link] (3 responses)

I've had a (somewhat short) look at this. The text concludes with "230 microseconds mean response time is pretty impressive". That's 0.00023s. Response times in this order of magnitude (albeit for a simpler protocol) can be achieved with a perl process running atop a stock Linux kernel which has to contact a second process on the same computer for a database lookup, IOW, that's not impressive at all.

Memory management for 400Gb/s interfaces

Posted May 13, 2019 16:38 UTC (Mon) by anatolik (guest, #73797) [Link] (1 responses)

You are really comparing apples to oranges. 2 hops network path adds its own latency to the response time. To compare Unicycle demo with your example fairly you need to make your perl script to parse HTTP request, generate a static HTTP response and then run it at another desktop-grade computer connected via Ethernet bridge.

> Response times in this order of magnitude
It would be great to get numbers for similar workload (host-bridge-host with Apache) and see how much improvement unikernel solution provides.

Some time ago I've been working at Google on low-level optimizations for their latency-critical datacenter services. I know that Google's engineers would die to improve its latency by 10%. If some solution like unikernel apps can reduce the latency by 30-50 percents then it is going to be a big deal.

Memory management for 400Gb/s interfaces

Posted May 13, 2019 18:07 UTC (Mon) by rweikusat2 (subscriber, #117920) [Link]

If the response time wasn't dominated by the processing time on the server side, this "optimization" would just end up being even more useless. Application protocol processing time is also not supposed to be improved by this, hence the protocol doesn't matter that much. It's still text based and there's even another intermediate application involved. The processing path is thus roughly "Perl program receives text-based request, makes a text based request to a program written in C which makes another request using some proprietary protocol to a 3rd application which then does a database lookup, sends the result back to program #2 which creates a text-based reply to send it to program #1 which then composes a text reply to the original requester." And the combined latency of this is comfortably in the 10^-4s range.

Memory management for 400Gb/s interfaces

Posted May 15, 2019 16:38 UTC (Wed) by naptastic (guest, #60139) [Link]

Yeah, but it has to hit that latency target *every* *time* regardless of what else is happening on the system. Stock Linux suuuuuucks at reliable low-latency performance.

<rant>

My audio interface runs 32 channels at 24-bit 96KHz, which works out to a measly 10 MiB/s (rounding up.) Four orders of magnitude less data than these network adapters can move. In order to get glitch-free audio on a stock Linux kernel, I have to increase my acceptable latency into the tens of ms. Four orders of magnitude slower than what these network adapters need. Using those settings, fewer than 100 transfers per second happen between CPU and sound card. The puniest of bandwidth requirements, and extremely lenient latency requirements, and the kernel can just barely hang on.

Using the MuQSS process scheduler, I can get 1.33ms latency with no dropouts at all; that amounts to ~750 transfers per second. Performance only got really reliable once I ran root-over-NFS and disabled the on-board SATA controller; I'm not sure what to make of that. I haven't tried the -rt patchset in a few years, since tuning it is a lot of work, and the best I ever got still wasn't as reliable as MuQSS, last time I compared them.

I'm always happy when the networking stack develops latency problems, because that's when they get fixed.

</rant>

Memory management for 400Gb/s interfaces

Posted May 14, 2019 12:42 UTC (Tue) by da4089 (subscriber, #1195) [Link] (2 responses)

400 Gbps ~= 50 GBps. Let's say we had a 5GHz CPU, that's 10 bytes per clock cycle.

Even if the CPU was able to actually read 10 bytes per cycle on a sustained basis, there's no time for any processing.

At 10Gbps, with a lot of care, you can still do useful work with a general purpose CPU. At 100Gbps and beyond? Forget it.

Unless there's a 100x performance boost for servers waiting around the corner, this ship has sailed. There's not a lot of point trying to make the software cope with I/O which simply outstrips the processor.

Memory management for 400Gb/s interfaces

Posted May 14, 2019 13:59 UTC (Tue) by excors (subscriber, #95769) [Link]

I don't see why a CPU should have much trouble sustaining 10 bytes per cycle. Recent Intel desktop CPUs can read up to 64B/cycle and write 32B/cycle (per https://en.wikichip.org/wiki/intel/microarchitectures/sky...), with server chips doubling that, on a single CPU core. They can also do 64B/cycle between L1 and L2, and DRAM going at 20GB/sec per channel with 6 channels is around 25B per CPU cycle at 5GHz, so it should be able to sustain well over 10B/cycle. And you shouldn't run out of time to process the data, since the processing can run concurrently with the memory accesses, and multiple cores with AVX etc can give enormous processing throughput.

Memory management for 400Gb/s interfaces

Posted May 17, 2019 14:57 UTC (Fri) by flussence (guest, #85566) [Link]

That's why modern high end network cards and drivers have mechanisms for steering packets to specific cores/NUMAnodes. Assuming a more realistic 2.5GHz, one cycle per 20 bytes isn't useful, but 128 cores is. That's almost enough to build a router out of reasonably-priced x86 hardware (FSVO “reasonable” relative to Cisco).

Memory management for 400Gb/s interfaces

Posted May 19, 2019 5:33 UTC (Sun) by shentino (guest, #76459) [Link]

At least on architectures that support it, the kernel should provide a way for >4KiB pages to be used, whether it's in the page cache, swap cache, disk I/O or anything else.

There's no reason not to find a way to allow it to be exploited in some way.


Copyright © 2019, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds