Going big with TCP packets
Imagine, for a second, that you are trying to keep up with a 100Gb/s network adapter. As networking developer Jesper Brouer described back in 2015, if one is using the longstanding maximum packet size of 1,538 bytes, running the interface at full speed means coping with over eight-million packets per second. At that rate, CPU has all of about 120ns to do whatever is required to handle each packet, which is not a lot of time; a single cache miss can ruin the entire processing-time budget.
The situation gets better, though, if the number of packets is reduced, and that can be achieved by making packets larger. So it is unsurprising that high-performance networking installations, especially local-area networks where everything is managed as a single unit, use larger packet sizes. With proper configuration, packet sizes up to 64KB can be used, improving the situation considerably. But, in settings where data is being moved in units of megabytes or gigabytes (or more — cat videos are getting larger all the time), that still leaves the system with a lot of packets to handle.
Packet counts hurt in a number of ways. There is a significant fixed overhead associated with every packet transiting a system. Each packet must find its way through the network stack, from the upper protocol layers down to the device driver for the interface (or back). More packets means more interrupts from the network adapter. The sk_buff structure ("SKB") used to represent packets within the kernel is a large beast, since it must be able to support just about any networking feature that may be in use; that leads to significant per-packet memory use and memory-management costs. So there are good reasons to wish for the ability to move data in fewer, larger packets, at least for some types of applications.
The length of an IP packet is stored in the IP header; for both IPv4 and IPv6, that length lives in a 16-bit field, limiting the maximum packet size to 64KB. At the time these protocols were designed, a 64KB packet could take multiple seconds to transmit on the backbone Internet links that were available, so it must have seemed like a wildly large number; surely 64KB would be more than anybody would ever rationally want to put into a single packet. But times change, and 64KB can now seem like a cripplingly low limit.
Awareness of this problem is not especially recent: there is a solution (for IPv6, at least) to be found in RFC 2675, which was adopted in 1999. The IPv6 specification allows the placement of "hop-by-hop" headers with additional information; as the name suggests, a hop-by-hop header is used to communicate options between two directly connected systems. RFC 2675 enables larger packets with a couple of tweaks to the protocol. To send a "jumbo" packet, a system must set the (16-bit) IP payload length field to zero and add a hop-by-hop header containing the real payload length. The length field in that header is 32 bits, meaning that jumbo packets can contain up to 4GB of data; that should surely be enough for everybody.
The BIG TCP patch set adds the logic necessary to generate and accept jumbo packets when the maximum transmission unit (MTU) of a connection is set sufficiently high. Unsurprisingly, there were a number of details to manage to make this actually work. One of the more significant issues is that packets of any size are rarely stored in physically contiguous memory, which tends to be hard to come by in general. For zero-copy operations, where the buffers live in user space, packets are guaranteed to be scattered through physical memory. So packets are represented as a set of "fragments", which can be as short as one (4KB) page each; network interfaces handle the task of assembling packets from fragments on transmission (or fragmenting them on receipt).
Current kernels limit the number of fragments stored in an SKB to 17, which is sufficient to store a 64KB packet in single-page chunks. That limit will clearly interfere with the creation of larger packets, so the patch set raises the maximum number of fragments (to 45). But, as Alexander Duyck pointed out, many interface drivers encode assumptions about the maximum number of fragments that a packet may be split into. Increasing that limit without fixing the drivers could lead to performance regressions or even locked-up hardware, he said.
After some discussion, Dumazet proposed working around the problem by adding a configuration option controlling the maximum number of allowed fragments for any given packet. That is fine for sites that build their own kernels, which prospective users of this feature are relatively likely to do. It offers little help for distributors, though, who must pick a value for this option for all of their users.
In any case, many drivers will need to be updated to handle jumbo packets. Modern network interfaces perform segmentation offloading, meaning that much of the work of creating individual packets is done within the interface itself. Making segmentation offloading work with jumbo packets tends to involve a small number of tweaks; a few drivers are updated in the patch set.
One other minor problem has to do with the placement of the RFC 2675 hop-by-hop header. These headers, per the IPv6 standard, are placed immediately after the IP header; that can confuse software that "knows" that the TCP header can be found immediately after the IP header in a packet. The tcpdump utility has some problems in this regard; it also seems that there are a fair number of BPF programs in the wild that contain this assumption. For this reason, jumbo-packet handling is disabled by default, even if the underlying hardware and link could handle those packets.
Dumazet included some brief benchmark results with the patch posting. Enabling a packet size of 185,000 bytes increased network throughput by nearly 50% while also reducing round-trip latency significantly. So BIG TCP seems like an option worth having, at least in the sort of environments (data centers, for example) that use high-speed links and can reliably deliver large packets. If tomorrow's cat videos arrive a little more quickly, BIG TCP may be part of the reason.
See Dumazet's
2021 Netdev talk on this topic for more details.
Index entries for this article | |
---|---|
Kernel | Networking/IPv6 |
Posted Feb 14, 2022 21:39 UTC (Mon)
by dcg (subscriber, #9198)
[Link] (12 responses)
Posted Feb 15, 2022 2:19 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (11 responses)
(The above describes my experience with extremely large-scale services. For "normal-sized" services, you're probably fine with an HDD or ten, but if you've suddenly decided to build your own private YouTube, you should at least run the numbers, just in case.)
Posted Feb 15, 2022 5:21 UTC (Tue)
by zdzichu (subscriber, #17118)
[Link] (7 responses)
Posted Feb 15, 2022 10:36 UTC (Tue)
by mageta (subscriber, #89696)
[Link] (6 responses)
Posted Feb 15, 2022 14:44 UTC (Tue)
by HenrikH (subscriber, #31152)
[Link] (5 responses)
Posted Feb 15, 2022 14:59 UTC (Tue)
by Wol (subscriber, #4433)
[Link] (4 responses)
But I reorganised (as in archived last year's) loads of mailing lists, and it was noticeable that even after Thunderbird came back and said "finished", that the disk subsystem - a raid-5 array - was desperately struggling to catch up. With 32GB of ram, provided it caches everything fine I'm not worried, but it's still a little scary knowing there's loads of stuff in the disk queue flushing as fast as possible ...
Cheers,
Posted Aug 1, 2022 23:28 UTC (Mon)
by WolfWings (subscriber, #56790)
[Link] (3 responses)
Since that's limited entirely by the write speed of the 2TB drive I've been thinking about adding a single NVMe exclusively as a hot-spare just to reduce that time down to about 30 minutes TBH.
Posted Aug 1, 2022 23:39 UTC (Mon)
by Wol (subscriber, #4433)
[Link] (2 responses)
A four or five drive raid-6 reduces the danger of a disk failure. An NVMe cache will speed up write time. And the more spindles you have, the faster your reads, regardless of array size.
Cheers,
Posted Aug 2, 2022 21:13 UTC (Tue)
by bartoc (guest, #124262)
[Link] (1 responses)
Not to mention the additional cost of the bigger drives (per TB) is offset by needing less "other crap" to drive them. You need smaller RAID enclosures, fewer controllers/expanders/etc, less space, and so on.
A four drive raid6 is pretty pointless, you get the write hole and the write amplification for a total of .... zero additional space efficiency. Just use a check summing raid10 type filesystem. IMHO 8-12 disks is the sweet spot for raid6.
fun quote from the parity delustering paper, published 1992:
My last raid rebuild was I think 5 full days long, using a small array of 18 TB disks.
Posted Aug 2, 2022 23:01 UTC (Tue)
by Wol (subscriber, #4433)
[Link]
But a four-drive raid-10 is actually far more dangerous to your data ... A raid 6 will survive a double disk failure. A double failure on a raid 10 has - if I've got my maths right - a 30% chance of trashing your array.
md-raid-6 doesn't (if configured that way) have a write hole any more.
> Btrfs raid6 is said to be "unsafe" but in reality, it is probably safer than mdraid raid6 or a raid controller's raid6.
I hate to say it, but I happen to KNOW that btrfs raid6 *IS* unsafe. A lot LESS safe than md-raid-6. It'll find problems better, but it's far more likely that those problems will have trashed your data. Trust me on that ...
At the end of the day, different raids have different properties. And most of them have their flaws. Unfortunately, at the moment btrfs parity raid is more flaw and promise than reality.
Oh - and my setup - which I admit chews up disk - is 3-disk raid-5 with spare, over dm-integrity and under lvm. Okay, it'll only survive one disk failure, but it will also survive corruption, just like btrfs. But each level of protection has its own dedicated layer - the Unix "do one thing and do it well". Btrfs tries to do everything - which does have many advantages - but unfortunately it's a "jack of all trades, crap at some of them", one of which unfortunately is parity raid ...
And while I don't know whether my layers support trim, if they do, btrfs has no advantage over me on time taken to rebuild/recover an array. Btrfs knows what part of the disk is use, but so does any logical/physical device that supports trim ...
Cheers,
Posted Feb 15, 2022 10:35 UTC (Tue)
by Sesse (subscriber, #53779)
[Link]
Posted Feb 15, 2022 17:45 UTC (Tue)
by rgmoore (✭ supporter ✭, #75)
[Link] (1 responses)
I assume that with something like YouTube, the final solution involves multiple levels of caching with different tradeoffs between latency and cost. With a site that is accessed mostly by browsing, you can get advance notice of what items are likely to be looked at soon based on what's on the users' screens and can pre-populate the cache accordingly. I'm sure there are engineers whose whole job is to optimize this behavior. It also makes me wonder if some of the dreaded YouTube algorithm isn't built around trying to steer viewers into stuff that's popular right now because it's sure to be available in cache.
Posted Feb 16, 2022 10:13 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link]
[1] See https://media.wizards.com/2022/downloads/MagicCompRules%2... for the MTG rules. But don't actually try to read them, because they're intended to be used for resolving specific issues, not as a "here's how to play the game" starting point.
Posted Feb 15, 2022 5:20 UTC (Tue)
by alison (subscriber, #63752)
[Link]
Presumably "throughput" involves getting the BIG TCP packet off the NIC via DMA, probably of the scatter-gather variety for so much data. It's remarkable that the speed of these transfers is sufficient for a 50% speed-up.
Posted Feb 15, 2022 10:28 UTC (Tue)
by mageta (subscriber, #89696)
[Link]
Hmm, sounds a lot like what we have in the block layer with "block queue limits"; where each low level driver that implements an interface to the block layer also provides a set of limits for this specific queue, also - among other things - including how the scatter-gather list of this queue must look like so the underlying device can actually deal with it. For example some devices can't deal with multi-page scatter-elements, while other can; some have a upper limit of scatter-elements in a single list, and so on. This way there doesn't have to be a single config switch and/or a kernel wide knob.
Posted Feb 15, 2022 13:46 UTC (Tue)
by taladar (subscriber, #68407)
[Link]
This could also affect iptables rules using the u32 match.
Posted Feb 15, 2022 16:45 UTC (Tue)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (1 responses)
Posted Feb 16, 2022 9:34 UTC (Wed)
by geert (subscriber, #98403)
[Link]
Posted Feb 16, 2022 14:17 UTC (Wed)
by xxiao (guest, #9631)
[Link] (1 responses)
Posted Feb 16, 2022 14:33 UTC (Wed)
by farnz (subscriber, #17727)
[Link]
Legacy technology can't do more than 16 bit packet lengths; you will have to upgrade to IPv6 if you need jumbo packets on the wire.
Posted Feb 16, 2022 14:24 UTC (Wed)
by jhhaller (guest, #56103)
[Link] (4 responses)
I hope that BIG TCP does something similar, that one can first configure systems to accept BIG TCP before they are sent, which would avoid flag-day requirements to implement it. Except for point-to-point communication, it will be years before every system can operate with BIG TCP. Ideally, it could be implemented on a connection-pair basis rather than switch-controlled Ethernet frame size limitations.
Posted Feb 17, 2022 13:37 UTC (Thu)
by kleptog (subscriber, #1183)
[Link] (2 responses)
Posted Feb 18, 2022 10:10 UTC (Fri)
by geert (subscriber, #98403)
[Link] (1 responses)
Of course I was the only one having that problem, as off-the-shelf consumer grade router software just ignored any MTU values, and 1500-byte packets worked fine regardless. Interestingly, we had to fix Path MTU Discovery in one of our embedded products a few weeks before...
Posted Jul 31, 2022 23:07 UTC (Sun)
by fest3er (guest, #60379)
[Link]
Posted Feb 18, 2022 10:54 UTC (Fri)
by farnz (subscriber, #17727)
[Link]
In large part, we have the problems we have with MTU because most of our link layers now simulate a 1980s 10BASE5 network, just at very high speeds. A switched 100GBASE-LR4 network is designed to present the appearance of a single 10BASE5 segment (via switch behaviour); an 802.11 network tunnels 10BASE5 compatible networking over an RF link layer where the "true" packet size (via aggregation) is variable but can go as high as 1 MiB.
As a result, we have point to point links at the L1 level (in both WiFi and wired Ethernet), which are used to emulate a bus network topology at L2. If we'd done things very differently, we'd be presenting those P2P links directly to the L3 system, and the "switch" or "AP" equivalent would be able to offer different MTUs to different clients, and send PMTU discovery messages back instantly if you're going to a lower MTU path attached to the same switch.
It's worth noting that IPv6 (a late 1980s/early 1990s design) has vestigial support for this; a router can indicate that no other hosts are on-link, and thus force you to send everything via the router. If you're directly attached via P2P links to an IPv6 router, you could thus have different MTUs on all the P2P links, and the router would be able to control path MTU as appropriate.
Posted Feb 18, 2022 3:59 UTC (Fri)
by developer122 (guest, #152928)
[Link] (1 responses)
Posted Feb 18, 2022 9:57 UTC (Fri)
by geert (subscriber, #98403)
[Link]
Posted Feb 18, 2022 4:02 UTC (Fri)
by developer122 (guest, #152928)
[Link] (1 responses)
There seem to be a lot of old, very overloaded structures in the kernel with different bits used in different contexts. It makes me wonder if one could work up a way of generically breaking these optional bits out into sub-structures and leave the main structure mostly a table of references. One would need to macro-ify the copying and other handling of them though. Or maybe this is just making things even more complicated. :P
Posted Feb 18, 2022 9:02 UTC (Fri)
by laarmen (subscriber, #63948)
[Link]
Posted Aug 1, 2022 16:02 UTC (Mon)
by jhoblitt (subscriber, #77733)
[Link] (1 responses)
Posted Apr 24, 2023 14:31 UTC (Mon)
by edeloget (subscriber, #88392)
[Link]
Going big with TCP packets
Going big with TCP packets
Going big with TCP packets
Going big with TCP packets
Going big with TCP packets
Going big with TCP packets
Wol
Going big with TCP packets
Going big with TCP packets
Wol
Going big with TCP packets
> Since the time necessary to reconstruct the contents of a failed disk is certainly minutes and possibly hours, we focus this paper on the performance of a continuous-operation storage subsystem during on-line failure recovery.
Going big with TCP packets
Wol
Going big with TCP packets
Going big with TCP packets
Going big with TCP packets
[2]: https://blog.youtube/inside-youtube/on-youtubes-recommend...
[3]: The linked blog post is Google/YouTube's official position on the matter, and may not be reflective of my own personal views (which I have no interest in discussing). It's also from September 2021, so things may have changed since then.
Going big with TCP packets
Going big with TCP packets
> sufficient to store a 64KB packet in single-page chunks. That limit will
> clearly interfere with the creation of larger packets, so the patch set raises
> the maximum number of fragments (to 45). But, as Alexander Duyck pointed out,
> many interface drivers encode assumptions about the maximum number of fragments
> that a packet may be split into. Increasing that limit without fixing the
> drivers could lead to performance regressions or even locked-up hardware, he
> said.
>
> After some discussion, Dumazet proposed working around the problem by adding a
> configuration option controlling the maximum number of allowed fragments for
> any given packet. That is fine for sites that build their own kernels, which
> prospective users of this feature are relatively likely to do. It offers little
> help for distributors, though, who must pick a value for this option for all of
> their users.
Going big with TCP packets
Going big with TCP packets
Going big with TCP packets
Going big with TCP packets
Going big with TCP packets
Going big with TCP packets
Going big with TCP packets
Going big with TCP packets
Turned out that the VPN software enabled a firewall rule to block all incoming ICMP traffic. This included the "fragmentation needed" messages sent from my OpenWRT router, which strictly obeyed the 576-byte MTU supplied by my ISP's DHCP server.
Going big with TCP packets
Going big with TCP packets
Going big with TCP packets
Going big with TCP packets
Going big with TCP packets
Going big with TCP packets
Going big with TCP packets
Going big with TCP packets