LWN: Comments on "Checksum offloads and protocol ossification"

Checksum offloads and protocol ossification

marcH — Thu, 14 Jan 2016 06:53:27 +0000

> If you want to 'really scale' then the best thing you can do is ignore the kernel and perform as much networking as possible in your application. Other people have mentioned some of the userspace network drivers that by-pass the kernel implementation in the comments in this article already.

Another (closed source) example: http://ats.aeroflex.com/virtualized-ip-test-solutions/pro...

Checksum offloads and protocol ossification

Lennie — Mon, 21 Dec 2015 08:53:35 +0000

Dan Kaminsky did a talk ones about using HTTP to transport packets:
https://www.youtube.com/watch?v=YwbpnZe74ds

;-)

Checksum offloads and protocol ossification

moltonel — Fri, 18 Dec 2015 13:34:23 +0000

To me that looks like the right thing to do. Open up the network hardware like has been done for graphic cards. CPU, GPU... NPU ? It might be a big initial R&D investment for network hardware manufacturers, but the software development savings should make up for it quickly enough. Plus, whoever's first to market gets to design the generic API that just happens to fit their hardware best :p

Checksum offloads and protocol ossification

drag — Sun, 13 Dec 2015 07:02:33 +0000

This was posted a while ago to a lwn comments. Sorry I forget who posted it, but it's a good one:

http://highscalability.com/blog/2013/5/13/the-secret-to-1...

Basically:

If you want to 'really scale' then the best thing you can do is ignore the kernel and perform as much networking as possible in your application. Other people have mentioned some of the userspace network drivers that by-pass the kernel implementation in the comments in this article already.

Seems to me that if people really want the 'bestest fastest lowest latencinest performance' from their TCP stack for specialized application then going about it by a hardware-based offload of TCP/IP seems to be the wrong approach. The right approach is to use a application-level network driver and let the application due the calculations. If you want to throw hardware acceleration at the problem then have it be something that applications can use to help accelerate the calculations they need to do rather then something that hides in a nic card.

Then if the kernel is involved at all then all it should do is provide a reasonable method for those applications to access the 'acceleration hardware' via some sort of mechanism like DRM drivers do.

As far as tunneling goes.. as much as I love things like vxlan they really seem to be mostly used to work around IPv4 addressing limits. A much better approach seems to be things like let your virtualmachines/containers/etc get their own ipv6 address automatically and then rely on level 3 routing to deliver packets to everything. Any tunneling going on should just be IPv6 over IPv4 udp as a stop gap solution to deal with shitty 'cloud' networks. Otherwise you can just end up with tunnels in tunnels in tunnels and nobody wants that. Anything 'container/virt' infrastructure that doesn't integrate service discovery (and/or 'VIPS' or whatever) to help services and clients find things automatically is just a half-assed solution anyways, which means that it makes the difficulty of dealing with 'static' ipv6 addresses and dhcp mute. If done correctly there is no reason at all that end users should be aware that they are using ipv6 or ipv4.

oh well. Never had good luck with 'offload engines' anyways.

Checksum offloads and protocol ossification

marcH — Sat, 12 Dec 2015 05:26:11 +0000

> so that it can be routed efficiently and not be blocked by middleboxes that don't understand it [... ] We have slowly optimized ourselves into a situation where the development and deployment of new protocols (or even significant enhancements to existing protocols) is increasingly difficult; even well defined protocols like SCTP and DCCP are hard to deploy in real-world settings.

I think one of the main reasons, and maybe even the main one, is the complete "black box" aspect of IP networking. More opaque than the most closed source software.

The end to end principle was great and all but it did not anticipate that the network would fight back and grow a lot of smarts (firewalls et al.) anyway even when it was not supposed to. Since they never were and are still not supposed to exist, these smarts are not required to provide any feedback, so when they fail they just fail silently/stealthily and can be neither identified nor pinpointed. This tends to please network administrators in their basement more than happy to dodge support calls[*] since they can't even be located.

When even the most opaque application fails, one can typically still dig out somewhere some error message than can be Googled. Worst case the behaviour can be described. With networking it's dead end road every way. Hidden so well, IP networking never changes, never gets fixed,... ossifies.

[*] https://www.youtube.com/watch?v=rksCTVFtjM4

> An attempt to push hardware designers in a different direction may seem a bit like throwing Linux's weight around

Turning things around just once.

Checksum offloads and protocol ossification

jezuch — Fri, 11 Dec 2015 11:19:52 +0000

> Basically, you have people using websockets over HTTPS to open tunnels between services.

Though my professor at the university was not really amused when I started insisting that the Internet's protocol stack is 8-layer (instead of the "traditional" 7 layers), where HTTP(S) is the top-most layer :) And it was a nontrivial number of years ago already.

Checksum offloads and protocol ossification

kleptog — Fri, 11 Dec 2015 07:34:33 +0000

The RFC is supposed to be a joke, but it's surprisingly close to the truth. Basically, you have people using websockets over HTTPS to open tunnels between services. Bypasses firewalls, proxies, load balancers, everything. Evolution in action: by punishing anything that looks out of the ordinary, all network traffic evolves to becoming indistinguishable from eachother.

Checksum offloads and protocol ossification

BenHutchings — Fri, 11 Dec 2015 00:03:45 +0000

The latency differences are in the microseconds. But aside from latency, it is also possible to achieve much higher packet rates with a more restricted user-space network stack.

Checksum offloads and protocol ossification

alexl — Thu, 10 Dec 2015 09:21:59 +0000

The cost is not that huge for a shared memory architecture like the intel gpus. And if you're doing crc32 for instance you could run parallel crcs on different substrings and then combine on the CPU (like crc32_combine from zlib).

Still, I dunno if it is faster, it may be memory bandwidth bound.

Checksum offloads and protocol ossification

marcH — Thu, 10 Dec 2015 08:38:02 +0000

> We increasingly find ourselves on an Internet that can only manage TCP and UDP, and relatively unchanging versions of TCP and UDP at that.

Of course you meant HTTP.

https://tools.ietf.org/html/rfc3093 Firewall Enhancement Protocol (FEP)

Checksum offloads and protocol ossification

eternaleye — Thu, 10 Dec 2015 04:33:02 +0000

That's pretty much exactly what the Dragonet networking architecture is about; it's a fascinating design at least in part because, from the perspective of userspace, the kernel can then be viewed as just such a NIC.

Checksum offloads and protocol ossification

nysan — Wed, 09 Dec 2015 15:01:06 +0000

Don't forget 6WIND.
And there is now an open source project at www.openfastpath.org

Checksum offloads and protocol ossification

ballombe — Wed, 09 Dec 2015 12:38:35 +0000

If you want low latency, bypassing the kernel entirely will always save you some milliseconds, some there is incentive to do it.

Checksum offloads and protocol ossification

mjthayer — Wed, 09 Dec 2015 11:00:50 +0000

If the kernel just provided a full generic software stack which let driver override selected parts in hardware at minimal complexity cost to the kernel stack, is there anecdotal evidence that people are likely to limit their use of protocols to ones which are accelerated in hardware? Especially if finding out which those are for any particular network set-up requires additional effort on their part, and things just work (slightly more slowly) for their preferred choice? Of course if pieces of hardware along the network actively prevented use of protocols that would be a different matter, but I don't see how not allowing selected hardware acceleration would prevent that.

Checksum offloads and protocol ossification

xav — Wed, 09 Dec 2015 09:50:36 +0000

Nope. Passing data to/from the GPU has an enormous fixed cost, and GPUs are slow for non-parallel computations, so this is the exact case of what NOT to offload to a GPU.

Checksum offloads and protocol ossification

paulj — Wed, 09 Dec 2015 09:49:46 +0000

Agreed on the TCP checksum. E.g., it doesn't detect re-ordering. Which has bitten me in the past with dodgy hardware.

Checksum offloads and protocol ossification

paulj — Wed, 09 Dec 2015 09:13:51 +0000

They can protect against bugs in between the checksum being calculated and the packet passing through the L2 CRC engine.

I've seen weird driver bugs where chunks of packets were being dropped after being sent by userspace. The L2 CRC was fine, but the kernel applied header checksum was wrong. Turned out to be a subtle bug in the proprietary forwarding hardware driver, on raw sockets, iirc.

Checksum offloads and protocol ossification

alexl — Wed, 09 Dec 2015 07:46:34 +0000

Would it not be possible to use the GPU to offload some of these calculations?

Checksum offloads and protocol ossification

luto — Wed, 09 Dec 2015 03:49:00 +0000

I'd love to see some focus shift from extremely weak checksums like UDP's to stronger ones like CRC. CRC has all the magic properties needed: it's linear, so you can subtract parts off, and it can be shifted, so you can take a CRC of some suffix or middle chunk of a packet and extend it to the CRC of the whole thing.

And yes, I have seen bad packets over TCP that survive the checksum check. It's just too weak.

Checksum offloads and protocol ossification

josh — Tue, 08 Dec 2015 21:45:36 +0000

Bad checksums do happen; among other things, I've seen case studies of them happening due to memory errors. Also see http://dinaburg.org/bitsquatting.html , and notice the mentions that bit errors at some phases of the process will get rejected due to checksums.

Some protocol in the stack needs to have *cryptographic* integrity; for instance, TLS provides cryptographic integrity guarantees. However, at the lower levels, a quick checksum to confirm valid packet delivery allows the network stack to say "didn't get that, send it again", transparently to the application, as part of the normal ACK/NAK process.

Also see https://en.wikipedia.org/wiki/End-to-end_principle .

Checksum offloads and protocol ossification

josh — Tue, 08 Dec 2015 21:35:45 +0000

> For a good discussion of why, look up what happened with Van Jacobson channels.

I found the LWN article presenting those, but the only reference I have on their disposition suggests that the code never got published and remained slideware.

> In short, Linux's stack is bigger and/or slower than some alternatives because it does [much] more than those alternatives, and by the time you add $FeatureX to the alternatives it's no longer as small or fast as it used to be.

That's not an argument that the Linux stack *can't* match the size or performance of those alternatives. Given that people successfully use those alternatives, clearly $FeatureX is not essential for them.

For example, matching the size of lwIP would clearly require compiling out large parts of the stack. And matching the performance of DPDK would require large parts of the kernel to stop touching packets, and the moment you touch a packet in a way that requires additional software processing, performance properties would nosedive.

Checksum offloads and protocol ossification

yootis — Tue, 08 Dec 2015 21:19:58 +0000

Is there even value in checksums in headers anymore? All of the transport mechanisms like ethernet already have much more powerful CRCs. I've never heard of packets getting delivered with bad checksums, so why are they even used?

Checksum offloads and protocol ossification

flussence — Tue, 08 Dec 2015 20:12:28 +0000

Keeping everything but generic number-crunching in the kernel is probably a very good idea, for much the same reasons as RAID.

I was experimenting with `ethtool -k` settings on my LAN the other day; the hardware doesn't have much of a feature set and it's all off-by-default, but on one end (a RTL8168e) enabling any of the interesting offloading features it claims to support... breaks everything. Nothing more fun than silent failures caused by buggy hardware!

Admittedly that experience is based on $0.10 desktop Realtek chips, but at the same time, paying 10-1000× more for any type of hardware doesn't have a linear correlation to quality.

Checksum offloads and protocol ossification

pizza — Tue, 08 Dec 2015 19:34:27 +0000

> I don't see an *obvious* reason why Linux's networking stack needs to be significantly larger than lwIP, or significantly slower than DPDK. Today it is, but that doesn't seem like an innate property.

For a good discussion of why, look up what happened with Van Jacobson channels.

In short, Linux's stack is bigger and/or slower than some alternatives because it does [much] more than those alternatives, and by the time you add $FeatureX to the alternatives it's no longer as small or fast as it used to be.

Checksum offloads and protocol ossification

SEJeff — Tue, 08 Dec 2015 19:28:47 +0000

And DPKT isn't even the only one. Mellanox's VMA (from before it bought Voltaire), and Solarflare's OpenOnload, both have been around much longer than DPDK. There are entire industries (finance) which rely on things like this for extremely low latency.

Checksum offloads and protocol ossification

josh — Tue, 08 Dec 2015 19:12:37 +0000

> convincing the networking maintainer that the hardware designers have heard his complaint may take a little longer.

That's going to cause problems if the only acceptable indication of "heard his complaint" is "decided he's right and done what he's demanding". If the answer turns out to be "no", I expect an ongoing demonstration of selective hearing difficulties towards any answer that doesn't sound like "yes". And even if the answer is "yes", hardware development cycles are long; hopefully all work on hardware offload won't stall until a new generation of hardware exists with these features.

To quote another mail from the thread (http://thread.gmane.org/gmane.linux.network/388085/focus=...):

"So we (as a kernel community) have users *NOW* who want this
feature, and hardware that is available *now* that has this feature.
Do you think we should wait for a unicorn to arrive that has a fully
programmable de-ossified checksum engine? How long?

[...]

I think that trying to force an agenda with no fore-warning and also
punishing the users in order to get hardware vendors to change is the
wrong way to go about this. All you end up with is people just asking
you why their hardware doesn't work in the kernel.

You have a proposal, let's codify it and enable it for the future, and
especially be *really* clear what you want hardware vendors to
implement so that they get it right."

The statement in the article that the networking developers "are, instead, developing a simpler, protocol-independent mechanism by which the hardware can support any protocol with checksum offloading." does not give any indication of the degree of overlap or discussion between the developers of that mechanism and the set of people who design networking hardware. Developing a mechanism for offloading functionality to networking hardware without working with hardware developers is like developing a specification for a new syscall without talking to kernel developers.

One question that Linux networking needs to be dealing with is "why are an increasing number of users bypassing the Linux networking stack entirely, such as to get more performance or smaller size?". DPDK and its performance, and lwIP/uIP and their size, are demonstrations that the Linux networking stack fails to meet the requirements of many potential users. In an ideal world, either those shouldn't exist at all because Linux already meets their requirements, or the Linux network stack should be designed to better integrate frameworks like those and bring them into the fold.

I don't see an *obvious* reason why Linux's networking stack needs to be significantly larger than lwIP, or significantly slower than DPDK. Today it is, but that doesn't seem like an innate property.