Going big with TCP packets

By Jonathan Corbet
February 14, 2022

Like most components in the computing landscape, networking hardware has grown steadily faster over time. Indeed, today's high-end network interfaces can often move data more quickly than the systems they are attached to can handle. The networking developers have been working for years to increase the scalability of their subsystem; one of the current projects is the BIG TCP patch set from Eric Dumazet and Coco Li. BIG TCP isn't for everybody, but it has the potential to significantly improve networking performance in some settings.

Imagine, for a second, that you are trying to keep up with a 100Gb/s network adapter. As networking developer Jesper Brouer described back in 2015, if one is using the longstanding maximum packet size of 1,538 bytes, running the interface at full speed means coping with over eight-million packets per second. At that rate, CPU has all of about 120ns to do whatever is required to handle each packet, which is not a lot of time; a single cache miss can ruin the entire processing-time budget.

The situation gets better, though, if the number of packets is reduced, and that can be achieved by making packets larger. So it is unsurprising that high-performance networking installations, especially local-area networks where everything is managed as a single unit, use larger packet sizes. With proper configuration, packet sizes up to 64KB can be used, improving the situation considerably. But, in settings where data is being moved in units of megabytes or gigabytes (or more — cat videos are getting larger all the time), that still leaves the system with a lot of packets to handle.

Packet counts hurt in a number of ways. There is a significant fixed overhead associated with every packet transiting a system. Each packet must find its way through the network stack, from the upper protocol layers down to the device driver for the interface (or back). More packets means more interrupts from the network adapter. The sk_buff structure ("SKB") used to represent packets within the kernel is a large beast, since it must be able to support just about any networking feature that may be in use; that leads to significant per-packet memory use and memory-management costs. So there are good reasons to wish for the ability to move data in fewer, larger packets, at least for some types of applications.

The length of an IP packet is stored in the IP header; for both IPv4 and IPv6, that length lives in a 16-bit field, limiting the maximum packet size to 64KB. At the time these protocols were designed, a 64KB packet could take multiple seconds to transmit on the backbone Internet links that were available, so it must have seemed like a wildly large number; surely 64KB would be more than anybody would ever rationally want to put into a single packet. But times change, and 64KB can now seem like a cripplingly low limit.

Awareness of this problem is not especially recent: there is a solution (for IPv6, at least) to be found in RFC 2675, which was adopted in 1999. The IPv6 specification allows the placement of "hop-by-hop" headers with additional information; as the name suggests, a hop-by-hop header is used to communicate options between two directly connected systems. RFC 2675 enables larger packets with a couple of tweaks to the protocol. To send a "jumbo" packet, a system must set the (16-bit) IP payload length field to zero and add a hop-by-hop header containing the real payload length. The length field in that header is 32 bits, meaning that jumbo packets can contain up to 4GB of data; that should surely be enough for everybody.

The BIG TCP patch set adds the logic necessary to generate and accept jumbo packets when the maximum transmission unit (MTU) of a connection is set sufficiently high. Unsurprisingly, there were a number of details to manage to make this actually work. One of the more significant issues is that packets of any size are rarely stored in physically contiguous memory, which tends to be hard to come by in general. For zero-copy operations, where the buffers live in user space, packets are guaranteed to be scattered through physical memory. So packets are represented as a set of "fragments", which can be as short as one (4KB) page each; network interfaces handle the task of assembling packets from fragments on transmission (or fragmenting them on receipt).

Current kernels limit the number of fragments stored in an SKB to 17, which is sufficient to store a 64KB packet in single-page chunks. That limit will clearly interfere with the creation of larger packets, so the patch set raises the maximum number of fragments (to 45). But, as Alexander Duyck pointed out, many interface drivers encode assumptions about the maximum number of fragments that a packet may be split into. Increasing that limit without fixing the drivers could lead to performance regressions or even locked-up hardware, he said.

After some discussion, Dumazet proposed working around the problem by adding a configuration option controlling the maximum number of allowed fragments for any given packet. That is fine for sites that build their own kernels, which prospective users of this feature are relatively likely to do. It offers little help for distributors, though, who must pick a value for this option for all of their users.

In any case, many drivers will need to be updated to handle jumbo packets. Modern network interfaces perform segmentation offloading, meaning that much of the work of creating individual packets is done within the interface itself. Making segmentation offloading work with jumbo packets tends to involve a small number of tweaks; a few drivers are updated in the patch set.

One other minor problem has to do with the placement of the RFC 2675 hop-by-hop header. These headers, per the IPv6 standard, are placed immediately after the IP header; that can confuse software that "knows" that the TCP header can be found immediately after the IP header in a packet. The tcpdump utility has some problems in this regard; it also seems that there are a fair number of BPF programs in the wild that contain this assumption. For this reason, jumbo-packet handling is disabled by default, even if the underlying hardware and link could handle those packets.

Dumazet included some brief benchmark results with the patch posting. Enabling a packet size of 185,000 bytes increased network throughput by nearly 50% while also reducing round-trip latency significantly. So BIG TCP seems like an option worth having, at least in the sort of environments (data centers, for example) that use high-speed links and can reliably deliver large packets. If tomorrow's cat videos arrive a little more quickly, BIG TCP may be part of the reason.

See Dumazet's 2021 Netdev talk on this topic for more details.

Index entries for this article
Kernel	Networking/IPv6

Going big with TCP packets

Posted Feb 14, 2022 21:39 UTC (Mon) by dcg (subscriber, #9198) [Link] (12 responses)

This is not too different from the problems that modern file systems have using the full bandwidth of modern storage. Isn't the root problem the same? CPUs are not getting faster, but other hardware pieces are, so abstraction layers have to do as little as possible and algorithms have to be extremely efficient. Very interesting times for operating system development.

Going big with TCP packets

Posted Feb 15, 2022 2:19 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (11 responses)

HDD I/O speeds are not getting significantly faster, either. It's actually starting to become a bit of a problem, because the ratio of storage space to I/O bandwidth has gotten extremely biased in favor of the former as HDDs get bigger but not faster, so a large fleet of HDDs doesn't have enough overall bandwidth to be useful at scale. You could work around this by buying a lot of small HDDs, but that's so expensive (per gigabyte) that you're probably better off just buying SSDs instead. As a result, we're starting to see increased use of SSDs even in domains where HDD seek time would otherwise be acceptable, and HDDs are sometimes getting relegated to "cool" storage and batch processing.

(The above describes my experience with extremely large-scale services. For "normal-sized" services, you're probably fine with an HDD or ten, but if you've suddenly decided to build your own private YouTube, you should at least run the numbers, just in case.)

Going big with TCP packets

Posted Feb 15, 2022 5:21 UTC (Tue) by zdzichu (guest, #17118) [Link] (7 responses)

I believe "modern storage" in 2022 are NVMe drives capable of 5 million IOPS each.

Going big with TCP packets

Posted Feb 15, 2022 10:36 UTC (Tue) by mageta (subscriber, #89696) [Link] (6 responses)

For many use-cases thats still just way too expensive (for the moment); and there is plenty of development still happening in HDDs (even tapes). For example, some vendors starting to deploy multi-actuator HDDs (we recently got support for that in Linux), so you can have multiple reads/writes concurrently - obviously that's still slower than flash.

Going big with TCP packets

Posted Feb 15, 2022 14:44 UTC (Tue) by HenrikH (subscriber, #31152) [Link] (5 responses)

Well so are 100Gb/s NICs and switches so neither one is consumer level driven at the moment.

Going big with TCP packets

Posted Feb 15, 2022 14:59 UTC (Tue) by Wol (subscriber, #4433) [Link] (4 responses)

Well, admittedly my home system isn't configured like a lot of them ...

But I reorganised (as in archived last year's) loads of mailing lists, and it was noticeable that even after Thunderbird came back and said "finished", that the disk subsystem - a raid-5 array - was desperately struggling to catch up. With 32GB of ram, provided it caches everything fine I'm not worried, but it's still a little scary knowing there's loads of stuff in the disk queue flushing as fast as possible ...

Cheers,
Wol

Going big with TCP packets

Posted Aug 1, 2022 23:28 UTC (Mon) by WolfWings (subscriber, #56790) [Link] (3 responses)

That's a large reason my home NAS is a lot of smaller spindles when I built it last, using 2.5" 2TB HDD drives currently. Yeah there's single 3.5" drives in the next year or two that can approach the capacity of the array, but the throughput craters in that case especially for random I/O in comparison, and if I lose a 2TB drive it's a bit under 4 hours for a rebuild not days.

Since that's limited entirely by the write speed of the 2TB drive I've been thinking about adding a single NVMe exclusively as a hot-spare just to reduce that time down to about 30 minutes TBH.

Going big with TCP packets

Posted Aug 1, 2022 23:39 UTC (Mon) by Wol (subscriber, #4433) [Link] (2 responses)

What raid level?

A four or five drive raid-6 reduces the danger of a disk failure. An NVMe cache will speed up write time. And the more spindles you have, the faster your reads, regardless of array size.

Cheers,
Wol

Going big with TCP packets

Posted Aug 2, 2022 21:13 UTC (Tue) by bartoc (guest, #124262) [Link] (1 responses)

Once the drives get big enough it makes sense to just use something like btrfs raid10, instead of something like raid6, rebuilds still take a long time but don't have to read all of every drive anymore. There are also fewer balancing issues if you add more drives. Actually, even with raid6 you should probably use something like btrfs or zfs (zfs can have some creeping performance problems, and is harder to expand/contract, but is better tested). Btrfs raid6 is said to be "unsafe" but in reality, it is probably safer than mdraid raid6 or a raid controller's raid6.

Not to mention the additional cost of the bigger drives (per TB) is offset by needing less "other crap" to drive them. You need smaller RAID enclosures, fewer controllers/expanders/etc, less space, and so on.

A four drive raid6 is pretty pointless, you get the write hole and the write amplification for a total of .... zero additional space efficiency. Just use a check summing raid10 type filesystem. IMHO 8-12 disks is the sweet spot for raid6.

fun quote from the parity delustering paper, published 1992:
> Since the time necessary to reconstruct the contents of a failed disk is certainly minutes and possibly hours, we focus this paper on the performance of a continuous-operation storage subsystem during on-line failure recovery.

My last raid rebuild was I think 5 full days long, using a small array of 18 TB disks.

Going big with TCP packets

Posted Aug 2, 2022 23:01 UTC (Tue) by Wol (subscriber, #4433) [Link]

> A four drive raid6 is pretty pointless, you get the write hole and the write amplification for a total of .... zero additional space efficiency. Just use a check summing raid10 type filesystem. IMHO 8-12 disks is the sweet spot for raid6.

But a four-drive raid-10 is actually far more dangerous to your data ... A raid 6 will survive a double disk failure. A double failure on a raid 10 has - if I've got my maths right - a 30% chance of trashing your array.

md-raid-6 doesn't (if configured that way) have a write hole any more.

> Btrfs raid6 is said to be "unsafe" but in reality, it is probably safer than mdraid raid6 or a raid controller's raid6.

I hate to say it, but I happen to KNOW that btrfs raid6 *IS* unsafe. A lot LESS safe than md-raid-6. It'll find problems better, but it's far more likely that those problems will have trashed your data. Trust me on that ...

At the end of the day, different raids have different properties. And most of them have their flaws. Unfortunately, at the moment btrfs parity raid is more flaw and promise than reality.

Oh - and my setup - which I admit chews up disk - is 3-disk raid-5 with spare, over dm-integrity and under lvm. Okay, it'll only survive one disk failure, but it will also survive corruption, just like btrfs. But each level of protection has its own dedicated layer - the Unix "do one thing and do it well". Btrfs tries to do everything - which does have many advantages - but unfortunately it's a "jack of all trades, crap at some of them", one of which unfortunately is parity raid ...

And while I don't know whether my layers support trim, if they do, btrfs has no advantage over me on time taken to rebuild/recover an array. Btrfs knows what part of the disk is use, but so does any logical/physical device that supports trim ...

Cheers,
Wol

Going big with TCP packets

Posted Feb 15, 2022 10:35 UTC (Tue) by Sesse (subscriber, #53779) [Link]

This was a problem even when I started in Google in 2007; disks were getting so large that anything like recovery or the likes was getting problematic. So the problem has been there all along, it's just moving slowly into more “normal” problem domains as the problem gets worse and worse.

Going big with TCP packets

Posted Feb 15, 2022 17:45 UTC (Tue) by rgmoore (✭ supporter ✭, #75) [Link] (1 responses)

I assume that with something like YouTube, the final solution involves multiple levels of caching with different tradeoffs between latency and cost. With a site that is accessed mostly by browsing, you can get advance notice of what items are likely to be looked at soon based on what's on the users' screens and can pre-populate the cache accordingly. I'm sure there are engineers whose whole job is to optimize this behavior. It also makes me wonder if some of the dreaded YouTube algorithm isn't built around trying to steer viewers into stuff that's popular right now because it's sure to be available in cache.

Going big with TCP packets

Posted Feb 16, 2022 10:13 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

Speaking with my Google hat on: Saying that the final solution involves "multiple levels of caching" is like saying that a game of Magic: The Gathering involves multiple rules.[1] Beyond that I'm not really at liberty to comment, but I can point you at [2] for the official company line on how the recommendation system works.[3]

[1] See https://media.wizards.com/2022/downloads/MagicCompRules%2... for the MTG rules. But don't actually try to read them, because they're intended to be used for resolving specific issues, not as a "here's how to play the game" starting point.
[2]: https://blog.youtube/inside-youtube/on-youtubes-recommend...
[3]: The linked blog post is Google/YouTube's official position on the matter, and may not be reflective of my own personal views (which I have no interest in discussing). It's also from September 2021, so things may have changed since then.

Going big with TCP packets

Posted Feb 15, 2022 5:20 UTC (Tue) by alison (subscriber, #63752) [Link]

> Enabling a packet size of 185,000 bytes increased network throughput by nearly 50%

Presumably "throughput" involves getting the BIG TCP packet off the NIC via DMA, probably of the scatter-gather variety for so much data. It's remarkable that the speed of these transfers is sufficient for a 50% speed-up.

Going big with TCP packets

Posted Feb 15, 2022 10:28 UTC (Tue) by mageta (subscriber, #89696) [Link]

> Current kernels limit the number of fragments stored in an SKB to 17, which is
> sufficient to store a 64KB packet in single-page chunks. That limit will
> clearly interfere with the creation of larger packets, so the patch set raises
> the maximum number of fragments (to 45). But, as Alexander Duyck pointed out,
> many interface drivers encode assumptions about the maximum number of fragments
> that a packet may be split into. Increasing that limit without fixing the
> drivers could lead to performance regressions or even locked-up hardware, he
> said.
>
> After some discussion, Dumazet proposed working around the problem by adding a
> configuration option controlling the maximum number of allowed fragments for
> any given packet. That is fine for sites that build their own kernels, which
> prospective users of this feature are relatively likely to do. It offers little
> help for distributors, though, who must pick a value for this option for all of
> their users.

Hmm, sounds a lot like what we have in the block layer with "block queue limits"; where each low level driver that implements an interface to the block layer also provides a set of limits for this specific queue, also - among other things - including how the scatter-gather list of this queue must look like so the underlying device can actually deal with it. For example some devices can't deal with multi-page scatter-elements, while other can; some have a upper limit of scatter-elements in a single list, and so on. This way there doesn't have to be a single config switch and/or a kernel wide knob.

Going big with TCP packets

Posted Feb 15, 2022 13:46 UTC (Tue) by taladar (subscriber, #68407) [Link]

> software that "knows" that the TCP header can be found immediately after the IP header in a packet.

This could also affect iptables rules using the u32 match.

Going big with TCP packets

Posted Feb 15, 2022 16:45 UTC (Tue) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (1 responses)

I hereby confirm that 64KB packets seemed insanely large to me in the 1980s. ;-)

Going big with TCP packets

Posted Feb 16, 2022 9:34 UTC (Wed) by geert (subscriber, #98403) [Link]

However, Commodore managed to send more than 10 million ;-)

Going big with TCP packets

Posted Feb 16, 2022 14:17 UTC (Wed) by xxiao (guest, #9631) [Link] (1 responses)

RFC2675 adds 32bit field for 4GB packets in IPv6, what about IPv4, how can IPv4 ever do more that 16-bit length of packages? I still don't understand how that can work.

Going big with TCP packets

Posted Feb 16, 2022 14:33 UTC (Wed) by farnz (subscriber, #17727) [Link]

Legacy technology can't do more than 16 bit packet lengths; you will have to upgrade to IPv6 if you need jumbo packets on the wire.

Going big with TCP packets

Posted Feb 16, 2022 14:24 UTC (Wed) by jhhaller (guest, #56103) [Link] (4 responses)

One of the problems with implementing changes like this is the use of global settings which control too much for one setting. Take, for example, the problem of increasing MTU. It's hard to change the MTU because it's used both to control receive packet size and transmit packet size. Everyone on a LAN segment has to agree on the MTU because switches don't operate in the ICMP or (for IPv4) fragmentation domain. So, if one's corporate backbone has evolved to significantly higher bandwidth, but still drops a small fraction of packets, TCP flow control algorithms limit transmission based on latency. Larger MTUs would help, but implementing that requires a flag day in every subnet participating in a conversation. If one could change systems to accept larger MTUs, but not send larger MTUs, a flag day isn't required, as every system can be configured to accept larger MTUs, but not transmit them. Once every system has been changed, then the router port and endpoint systems can be changed to larger MTUs.

I hope that BIG TCP does something similar, that one can first configure systems to accept BIG TCP before they are sent, which would avoid flag-day requirements to implement it. Except for point-to-point communication, it will be years before every system can operate with BIG TCP. Ideally, it could be implemented on a connection-pair basis rather than switch-controlled Ethernet frame size limitations.

Going big with TCP packets

Posted Feb 17, 2022 13:37 UTC (Thu) by kleptog (subscriber, #1183) [Link] (2 responses)

It would be nice if Path MTU discovery actually worked reliably everywhere. Nowadays with layers and layers of tunnelling and encapsulation being fairly normal, reducing MTUs on interfaces to make things reliable is still required far too often.

Going big with TCP packets

Posted Feb 18, 2022 10:10 UTC (Fri) by geert (subscriber, #98403) [Link] (1 responses)

I used to be "blessed" with a work laptop with Windows and VPN software. When using the VPN from home, the network stopped working very soon.
Turned out that the VPN software enabled a firewall rule to block all incoming ICMP traffic. This included the "fragmentation needed" messages sent from my OpenWRT router, which strictly obeyed the 576-byte MTU supplied by my ISP's DHCP server.

Of course I was the only one having that problem, as off-the-shelf consumer grade router software just ignored any MTU values, and 1500-byte packets worked fine regardless. Interestingly, we had to fix Path MTU Discovery in one of our embedded products a few weeks before...

Going big with TCP packets

Posted Jul 31, 2022 23:07 UTC (Sun) by fest3er (guest, #60379) [Link]

You weren't the only one. I dutifully obeyed Comcast's MTU size. And things broke. I eventually chose to ignore that incorrect, erroneous setting. That 'handwave' solved the problem. For me.

Going big with TCP packets

Posted Feb 18, 2022 10:54 UTC (Fri) by farnz (subscriber, #17727) [Link]

In large part, we have the problems we have with MTU because most of our link layers now simulate a 1980s 10BASE5 network, just at very high speeds. A switched 100GBASE-LR4 network is designed to present the appearance of a single 10BASE5 segment (via switch behaviour); an 802.11 network tunnels 10BASE5 compatible networking over an RF link layer where the "true" packet size (via aggregation) is variable but can go as high as 1 MiB.

As a result, we have point to point links at the L1 level (in both WiFi and wired Ethernet), which are used to emulate a bus network topology at L2. If we'd done things very differently, we'd be presenting those P2P links directly to the L3 system, and the "switch" or "AP" equivalent would be able to offer different MTUs to different clients, and send PMTU discovery messages back instantly if you're going to a lower MTU path attached to the same switch.

It's worth noting that IPv6 (a late 1980s/early 1990s design) has vestigial support for this; a router can indicate that no other hosts are on-link, and thus force you to send everything via the router. If you're directly attached via P2P links to an IPv6 router, you could thus have different MTUs on all the P2P links, and the router would be able to control path MTU as appropriate.

Going big with TCP packets

Posted Feb 18, 2022 3:59 UTC (Fri) by developer122 (guest, #152928) [Link] (1 responses)

Why in the world is it called "big TCP" when it's the IP packet that's bigger, not the TCP bit riding along inside it?

Going big with TCP packets

Posted Feb 18, 2022 9:57 UTC (Fri) by geert (subscriber, #98403) [Link]

Because "Big IP" would attract too many investors? ;-)

Going big with TCP packets

Posted Feb 18, 2022 4:02 UTC (Fri) by developer122 (guest, #152928) [Link] (1 responses)

> The sk_buff structure ("SKB") used to represent packets within the kernel is a large beast, since it must be able to support just about any networking feature that may be in use; that leads to significant per-packet memory use and memory-management costs.

There seem to be a lot of old, very overloaded structures in the kernel with different bits used in different contexts. It makes me wonder if one could work up a way of generically breaking these optional bits out into sub-structures and leave the main structure mostly a table of references. One would need to macro-ify the copying and other handling of them though. Or maybe this is just making things even more complicated. :P

Going big with TCP packets

Posted Feb 18, 2022 9:02 UTC (Fri) by laarmen (subscriber, #63948) [Link]

One thing to consider is that not having these extra bits in the skb means another memory load, which might not be affordable in the per-packet CPU budget.

Going big with TCP packets

Posted Aug 1, 2022 16:02 UTC (Mon) by jhoblitt (subscriber, #77733) [Link] (1 responses)

Does anyone know what layer2 hardware is being used to test this work on? I don't believe I've ever seen an Ethernet NIC that supports an MTU > 16KiB.

Going big with TCP packets

Posted Apr 24, 2023 14:31 UTC (Mon) by edeloget (subscriber, #88392) [Link]

Looking at the patches, I would give a look to Mellanox NICs :)