LWN: Comments on "Going big with TCP packets"

Going big with TCP packets

edeloget — Mon, 24 Apr 2023 14:31:01 +0000

Looking at the patches, I would give a look to Mellanox NICs :)

Going big with TCP packets

Wol — Tue, 02 Aug 2022 23:01:39 +0000

> A four drive raid6 is pretty pointless, you get the write hole and the write amplification for a total of .... zero additional space efficiency. Just use a check summing raid10 type filesystem. IMHO 8-12 disks is the sweet spot for raid6.

But a four-drive raid-10 is actually far more dangerous to your data ... A raid 6 will survive a double disk failure. A double failure on a raid 10 has - if I've got my maths right - a 30% chance of trashing your array.

md-raid-6 doesn't (if configured that way) have a write hole any more.

> Btrfs raid6 is said to be "unsafe" but in reality, it is probably safer than mdraid raid6 or a raid controller's raid6.

I hate to say it, but I happen to KNOW that btrfs raid6 *IS* unsafe. A lot LESS safe than md-raid-6. It'll find problems better, but it's far more likely that those problems will have trashed your data. Trust me on that ...

At the end of the day, different raids have different properties. And most of them have their flaws. Unfortunately, at the moment btrfs parity raid is more flaw and promise than reality.

Oh - and my setup - which I admit chews up disk - is 3-disk raid-5 with spare, over dm-integrity and under lvm. Okay, it'll only survive one disk failure, but it will also survive corruption, just like btrfs. But each level of protection has its own dedicated layer - the Unix "do one thing and do it well". Btrfs tries to do everything - which does have many advantages - but unfortunately it's a "jack of all trades, crap at some of them", one of which unfortunately is parity raid ...

And while I don't know whether my layers support trim, if they do, btrfs has no advantage over me on time taken to rebuild/recover an array. Btrfs knows what part of the disk is use, but so does any logical/physical device that supports trim ...

Cheers,
Wol

Going big with TCP packets

bartoc — Tue, 02 Aug 2022 21:13:34 +0000

Once the drives get big enough it makes sense to just use something like btrfs raid10, instead of something like raid6, rebuilds still take a long time but don't have to read all of every drive anymore. There are also fewer balancing issues if you add more drives. Actually, even with raid6 you should probably use something like btrfs or zfs (zfs can have some creeping performance problems, and is harder to expand/contract, but is better tested). Btrfs raid6 is said to be "unsafe" but in reality, it is probably safer than mdraid raid6 or a raid controller's raid6.

Not to mention the additional cost of the bigger drives (per TB) is offset by needing less "other crap" to drive them. You need smaller RAID enclosures, fewer controllers/expanders/etc, less space, and so on.

A four drive raid6 is pretty pointless, you get the write hole and the write amplification for a total of .... zero additional space efficiency. Just use a check summing raid10 type filesystem. IMHO 8-12 disks is the sweet spot for raid6.

fun quote from the parity delustering paper, published 1992:
> Since the time necessary to reconstruct the contents of a failed disk is certainly minutes and possibly hours, we focus this paper on the performance of a continuous-operation storage subsystem during on-line failure recovery.

My last raid rebuild was I think 5 full days long, using a small array of 18 TB disks.

Going big with TCP packets

Wol — Mon, 01 Aug 2022 23:39:22 +0000

What raid level?

A four or five drive raid-6 reduces the danger of a disk failure. An NVMe cache will speed up write time. And the more spindles you have, the faster your reads, regardless of array size.

Cheers,
Wol

Going big with TCP packets

WolfWings — Mon, 01 Aug 2022 23:28:27 +0000

That's a large reason my home NAS is a lot of smaller spindles when I built it last, using 2.5" 2TB HDD drives currently. Yeah there's single 3.5" drives in the next year or two that can approach the capacity of the array, but the throughput craters in that case especially for random I/O in comparison, and if I lose a 2TB drive it's a bit under 4 hours for a rebuild not days.

Since that's limited entirely by the write speed of the 2TB drive I've been thinking about adding a single NVMe exclusively as a hot-spare just to reduce that time down to about 30 minutes TBH.

Going big with TCP packets

jhoblitt — Mon, 01 Aug 2022 16:02:41 +0000

Does anyone know what layer2 hardware is being used to test this work on? I don't believe I've ever seen an Ethernet NIC that supports an MTU > 16KiB.

Going big with TCP packets

fest3er — Sun, 31 Jul 2022 23:07:28 +0000

You weren't the only one. I dutifully obeyed Comcast's MTU size. And things broke. I eventually chose to ignore that incorrect, erroneous setting. That 'handwave' solved the problem. For me.

Going big with TCP packets

farnz — Fri, 18 Feb 2022 10:54:21 +0000

In large part, we have the problems we have with MTU because most of our link layers now simulate a 1980s 10BASE5 network, just at very high speeds. A switched 100GBASE-LR4 network is designed to present the appearance of a single 10BASE5 segment (via switch behaviour); an 802.11 network tunnels 10BASE5 compatible networking over an RF link layer where the "true" packet size (via aggregation) is variable but can go as high as 1 MiB.

As a result, we have point to point links at the L1 level (in both WiFi and wired Ethernet), which are used to emulate a bus network topology at L2. If we'd done things very differently, we'd be presenting those P2P links directly to the L3 system, and the "switch" or "AP" equivalent would be able to offer different MTUs to different clients, and send PMTU discovery messages back instantly if you're going to a lower MTU path attached to the same switch.

It's worth noting that IPv6 (a late 1980s/early 1990s design) has vestigial support for this; a router can indicate that no other hosts are on-link, and thus force you to send everything via the router. If you're directly attached via P2P links to an IPv6 router, you could thus have different MTUs on all the P2P links, and the router would be able to control path MTU as appropriate.

Going big with TCP packets

geert — Fri, 18 Feb 2022 10:10:37 +0000

I used to be "blessed" with a work laptop with Windows and VPN software. When using the VPN from home, the network stopped working very soon.
Turned out that the VPN software enabled a firewall rule to block all incoming ICMP traffic. This included the "fragmentation needed" messages sent from my OpenWRT router, which strictly obeyed the 576-byte MTU supplied by my ISP's DHCP server.

Of course I was the only one having that problem, as off-the-shelf consumer grade router software just ignored any MTU values, and 1500-byte packets worked fine regardless. Interestingly, we had to fix Path MTU Discovery in one of our embedded products a few weeks before...

Going big with TCP packets

geert — Fri, 18 Feb 2022 09:57:48 +0000

Because "Big IP" would attract too many investors? ;-)

Going big with TCP packets

laarmen — Fri, 18 Feb 2022 09:02:25 +0000

One thing to consider is that not having these extra bits in the skb means another memory load, which might not be affordable in the per-packet CPU budget.

Going big with TCP packets

developer122 — Fri, 18 Feb 2022 04:02:28 +0000

> The sk_buff structure ("SKB") used to represent packets within the kernel is a large beast, since it must be able to support just about any networking feature that may be in use; that leads to significant per-packet memory use and memory-management costs.

There seem to be a lot of old, very overloaded structures in the kernel with different bits used in different contexts. It makes me wonder if one could work up a way of generically breaking these optional bits out into sub-structures and leave the main structure mostly a table of references. One would need to macro-ify the copying and other handling of them though. Or maybe this is just making things even more complicated. :P

Going big with TCP packets

developer122 — Fri, 18 Feb 2022 03:59:40 +0000

Why in the world is it called "big TCP" when it's the IP packet that's bigger, not the TCP bit riding along inside it?

Going big with TCP packets

kleptog — Thu, 17 Feb 2022 13:37:18 +0000

It would be nice if Path MTU discovery actually worked reliably everywhere. Nowadays with layers and layers of tunnelling and encapsulation being fairly normal, reducing MTUs on interfaces to make things reliable is still required far too often.

Going big with TCP packets

farnz — Wed, 16 Feb 2022 14:33:04 +0000

Legacy technology can't do more than 16 bit packet lengths; you will have to upgrade to IPv6 if you need jumbo packets on the wire.

Going big with TCP packets

jhhaller — Wed, 16 Feb 2022 14:24:06 +0000

One of the problems with implementing changes like this is the use of global settings which control too much for one setting. Take, for example, the problem of increasing MTU. It's hard to change the MTU because it's used both to control receive packet size and transmit packet size. Everyone on a LAN segment has to agree on the MTU because switches don't operate in the ICMP or (for IPv4) fragmentation domain. So, if one's corporate backbone has evolved to significantly higher bandwidth, but still drops a small fraction of packets, TCP flow control algorithms limit transmission based on latency. Larger MTUs would help, but implementing that requires a flag day in every subnet participating in a conversation. If one could change systems to accept larger MTUs, but not send larger MTUs, a flag day isn't required, as every system can be configured to accept larger MTUs, but not transmit them. Once every system has been changed, then the router port and endpoint systems can be changed to larger MTUs.

I hope that BIG TCP does something similar, that one can first configure systems to accept BIG TCP before they are sent, which would avoid flag-day requirements to implement it. Except for point-to-point communication, it will be years before every system can operate with BIG TCP. Ideally, it could be implemented on a connection-pair basis rather than switch-controlled Ethernet frame size limitations.

Going big with TCP packets

xxiao — Wed, 16 Feb 2022 14:17:44 +0000

RFC2675 adds 32bit field for 4GB packets in IPv6, what about IPv4, how can IPv4 ever do more that 16-bit length of packages? I still don't understand how that can work.

Going big with TCP packets

NYKevin — Wed, 16 Feb 2022 10:13:06 +0000

Speaking with my Google hat on: Saying that the final solution involves "multiple levels of caching" is like saying that a game of Magic: The Gathering involves multiple rules.[1] Beyond that I'm not really at liberty to comment, but I can point you at [2] for the official company line on how the recommendation system works.[3]

[1] See https://media.wizards.com/2022/downloads/MagicCompRules%2... for the MTG rules. But don't actually try to read them, because they're intended to be used for resolving specific issues, not as a "here's how to play the game" starting point.
[2]: https://blog.youtube/inside-youtube/on-youtubes-recommend...
[3]: The linked blog post is Google/YouTube's official position on the matter, and may not be reflective of my own personal views (which I have no interest in discussing). It's also from September 2021, so things may have changed since then.

Going big with TCP packets

geert — Wed, 16 Feb 2022 09:34:37 +0000

However, Commodore managed to send more than 10 million ;-)

Going big with TCP packets

rgmoore — Tue, 15 Feb 2022 17:45:16 +0000

I assume that with something like YouTube, the final solution involves multiple levels of caching with different tradeoffs between latency and cost. With a site that is accessed mostly by browsing, you can get advance notice of what items are likely to be looked at soon based on what's on the users' screens and can pre-populate the cache accordingly. I'm sure there are engineers whose whole job is to optimize this behavior. It also makes me wonder if some of the dreaded YouTube algorithm isn't built around trying to steer viewers into stuff that's popular right now because it's sure to be available in cache.

Going big with TCP packets

PaulMcKenney — Tue, 15 Feb 2022 16:45:58 +0000

I hereby confirm that 64KB packets seemed insanely large to me in the 1980s. ;-)

Going big with TCP packets

Wol — Tue, 15 Feb 2022 14:59:39 +0000

Well, admittedly my home system isn't configured like a lot of them ...

But I reorganised (as in archived last year's) loads of mailing lists, and it was noticeable that even after Thunderbird came back and said "finished", that the disk subsystem - a raid-5 array - was desperately struggling to catch up. With 32GB of ram, provided it caches everything fine I'm not worried, but it's still a little scary knowing there's loads of stuff in the disk queue flushing as fast as possible ...

Cheers,
Wol

Going big with TCP packets

HenrikH — Tue, 15 Feb 2022 14:44:02 +0000

Well so are 100Gb/s NICs and switches so neither one is consumer level driven at the moment.

Going big with TCP packets

taladar — Tue, 15 Feb 2022 13:46:18 +0000

> software that "knows" that the TCP header can be found immediately after the IP header in a packet.

This could also affect iptables rules using the u32 match.

Going big with TCP packets

mageta — Tue, 15 Feb 2022 10:36:42 +0000

For many use-cases thats still just way too expensive (for the moment); and there is plenty of development still happening in HDDs (even tapes). For example, some vendors starting to deploy multi-actuator HDDs (we recently got support for that in Linux), so you can have multiple reads/writes concurrently - obviously that's still slower than flash.

Going big with TCP packets

Sesse — Tue, 15 Feb 2022 10:35:34 +0000

This was a problem even when I started in Google in 2007; disks were getting so large that anything like recovery or the likes was getting problematic. So the problem has been there all along, it's just moving slowly into more “normal” problem domains as the problem gets worse and worse.

Going big with TCP packets

mageta — Tue, 15 Feb 2022 10:28:07 +0000

> Current kernels limit the number of fragments stored in an SKB to 17, which is
> sufficient to store a 64KB packet in single-page chunks. That limit will
> clearly interfere with the creation of larger packets, so the patch set raises
> the maximum number of fragments (to 45). But, as Alexander Duyck pointed out,
> many interface drivers encode assumptions about the maximum number of fragments
> that a packet may be split into. Increasing that limit without fixing the
> drivers could lead to performance regressions or even locked-up hardware, he
> said.
>
> After some discussion, Dumazet proposed working around the problem by adding a
> configuration option controlling the maximum number of allowed fragments for
> any given packet. That is fine for sites that build their own kernels, which
> prospective users of this feature are relatively likely to do. It offers little
> help for distributors, though, who must pick a value for this option for all of
> their users.

Hmm, sounds a lot like what we have in the block layer with "block queue limits"; where each low level driver that implements an interface to the block layer also provides a set of limits for this specific queue, also - among other things - including how the scatter-gather list of this queue must look like so the underlying device can actually deal with it. For example some devices can't deal with multi-page scatter-elements, while other can; some have a upper limit of scatter-elements in a single list, and so on. This way there doesn't have to be a single config switch and/or a kernel wide knob.

Going big with TCP packets

zdzichu — Tue, 15 Feb 2022 05:21:05 +0000

I believe "modern storage" in 2022 are NVMe drives capable of 5 million IOPS each.

Going big with TCP packets

alison — Tue, 15 Feb 2022 05:20:10 +0000

> Enabling a packet size of 185,000 bytes increased network throughput by nearly 50%

Presumably "throughput" involves getting the BIG TCP packet off the NIC via DMA, probably of the scatter-gather variety for so much data. It's remarkable that the speed of these transfers is sufficient for a 50% speed-up.

Going big with TCP packets

NYKevin — Tue, 15 Feb 2022 02:19:56 +0000

HDD I/O speeds are not getting significantly faster, either. It's actually starting to become a bit of a problem, because the ratio of storage space to I/O bandwidth has gotten extremely biased in favor of the former as HDDs get bigger but not faster, so a large fleet of HDDs doesn't have enough overall bandwidth to be useful at scale. You could work around this by buying a lot of small HDDs, but that's so expensive (per gigabyte) that you're probably better off just buying SSDs instead. As a result, we're starting to see increased use of SSDs even in domains where HDD seek time would otherwise be acceptable, and HDDs are sometimes getting relegated to "cool" storage and batch processing.

(The above describes my experience with extremely large-scale services. For "normal-sized" services, you're probably fine with an HDD or ten, but if you've suddenly decided to build your own private YouTube, you should at least run the numbers, just in case.)

Going big with TCP packets

dcg — Mon, 14 Feb 2022 21:39:29 +0000

This is not too different from the problems that modern file systems have using the full bandwidth of modern storage. Isn't the root problem the same? CPUs are not getting faster, but other hardware pieces are, so abstraction layers have to do as little as possible and algorithms have to be extremely efficient. Very interesting times for operating system development.