Missing the AF_BUS

Posted Jul 5, 2012 5:47 UTC (Thu) by daniel (guest, #3181)
In reply to: Missing the AF_BUS by alonz
Parent article: Missing the AF_BUS

Dave seems to be labouring under the misapprehension that his TCP stack is efficient. It isn't. It is a big rambling, inefficient pile of spaghetti.

Missing the AF_BUS

Posted Jul 5, 2012 6:00 UTC (Thu) by alonz (subscriber, #815) [Link] (5 responses)

In Dave's defense will note that the Linux TCP stack does appear to be extremely efficient compared to other OS'es… It's just not always the perfect hammer for the screws you may be using.

Missing the AF_BUS

Posted Jul 5, 2012 6:10 UTC (Thu) by daniel (guest, #3181) [Link] (4 responses)

Then the others must suck even more, but the Linux TCP stack still sucks.

Missing the AF_BUS

Posted Jul 5, 2012 17:10 UTC (Thu) by jond (subscriber, #37669) [Link] (3 responses)

It's the quality of discourse that keeps me coming back to LWN.

Missing the AF_BUS

Posted Jul 7, 2012 1:40 UTC (Sat) by daniel (guest, #3181) [Link] (2 responses)

Good point. But what exactly is the correct technical response to an argument of the form "it's ok if we suck because somebody else sucks even worse". Never mind that that premise is stated without support, whereas the premise that the Linux TCP stack is a big pile of spaghetti is easily verified.

Missing the AF_BUS

Posted Jul 13, 2012 5:43 UTC (Fri) by Tov (subscriber, #61080) [Link]

Easy! Instead of waving your hand and present your unfounded opinion, you present some facts...

Missing the AF_BUS

Posted Jul 15, 2012 7:55 UTC (Sun) by philomath (guest, #84172) [Link]

How easy? can you just give me a starter, please?

Missing the AF_BUS

Posted Jul 5, 2012 18:26 UTC (Thu) by josh (subscriber, #17465) [Link] (9 responses)

Would you mind providing (links to) more information, for people interested in learning about the purported inefficiencies in Linux's TCP stack?

Missing the AF_BUS

Posted Jul 6, 2012 15:03 UTC (Fri) by pspinler (subscriber, #2922) [Link] (1 responses)

I'm not sure if linux's tcp stack is inefficient or not compared to other tcp stacks, but the networking stack is certainly is complex and multi-layered. Consider all the basic tcp protocol code (reliability, packet frag and reassembly, etc), then layer on top netfilter, underneath it routing logic, the ip stack, and etc, and it's easy to construct packets that go through possibly significant code paths.

Certainly all that complexity can't be great for performance.

It's the argument I make for fibre channel v. iscsi. It's true that iscsi hardware (being just standard networking stuff) is a lot cheaper and does the job 90-95% of the time. But in the edge case, especially w.r.t latency, fibre still wins, largely because it's simple in comparison.

-- Pat

Missing the AF_BUS

Posted Jul 9, 2012 2:35 UTC (Mon) by raven667 (subscriber, #5198) [Link]

> Certainly all that complexity can't be great for performance.

That's something worth testing, scientifically.

> It's the argument I make for fibre channel v. iscsi. It's true that iscsi hardware (being just standard networking stuff) is a lot cheaper and does the job 90-95% of the time. But in the edge case, especially w.r.t latency, fibre still wins, largely because it's simple in comparison.

One thing about this example that I would like to point out. FC implements much of the features of Ethernet and TCP/IP ... differently, so in that sense the complexity is at least comparable though probably not equal. As far as the implementation complexity I think that FC can get off easier because as a practical matter it is used in closed networks often with all components from the same vendor. Ethernet and TCP/IP have to deal with a lot more varied equipment and varied networks and have to be battle tested against _anything_ happening, all that extra implementation complexity has a real reason for being there.

Missing the AF_BUS

Posted Jul 9, 2012 6:02 UTC (Mon) by daniel (guest, #3181) [Link] (6 responses)

I'll have to tell you about it, because the actual code is buried deep in somebody's trading engine and they would likely take issue with me posting it on the web. Profiling turned up some really bad CPU bumps in places you would not immediately suspect, like UDP send, which was taking nearly a microsecond per packet more than it should. I thought there would actually be some deep reason for that, but when I dug in I found that the reason was just sloppy, rambling code, pure and simple. I straightened it all out and cut the CPU overhead in half, consequently reducing the hop latency by that amount. I went on to analyze the rest of the stack to some extent and found it was all like that. You can too, all you need to do is go look at the code.

Here's a lovely bit:

http://lxr.linux.no/#linux+v3.4.4/net/ipv4/tcp_output.c#L796

This is part of a call chain that goes about 20 levels deep. There is much worse in there. See, that stuff looks plausible and if you listen to the folklore it sounds fast. But it actually isn't, which I know beyond a shadow of a doubt.

Missing the AF_BUS

Posted Jul 9, 2012 6:53 UTC (Mon) by daniel (guest, #3181) [Link] (3 responses)

Here's a better example:

http://lxr.linux.no/#linux+v3.4.4/net/ipv4/ip_output.c#L799

This code just kills efficiency by a thousand cuts. There is no single culprit, it is just that all that twisting and turning, calling lots of little helpers and layering everything through an skb editing API that successfully confuses the optimizer adds up to an embarrassing amount of overhead. First rule to remember? Function calls are not free. Not at the speeds networks operate these days.

Missing the AF_BUS

Posted Jul 9, 2012 8:18 UTC (Mon) by nix (subscriber, #2304) [Link] (1 responses)

Actually, predicted function calls *are* nearly free on modern CPUs. Of course, function calls stuck deep inside conditionals are less likely to be successfully predicted as taken -- and unpredicted/mispredicted function calls (like all other mispredicted, non-speculated branches) are expensive as hell. However, these days I don't believe there is much more reason to be concerned about function calls than there is to be concerned about any other conditional. (Specialists in deep x86 lore, which I am very much not and who I am merely reiterating from dim and vague memory, are welcome to contradict me, and probably will!)

Missing the AF_BUS

Posted Jul 9, 2012 23:06 UTC (Mon) by daglwn (guest, #65432) [Link]

The call is cheap. The saving/restoring of registers and lost optimization opportunities are not.

Missing the AF_BUS

Posted Jul 9, 2012 18:40 UTC (Mon) by butlerm (subscriber, #13312) [Link]

>This code just kills efficiency by a thousand cuts. There is no single culprit, it is just that all that twisting and turning, calling lots of little helpers...

Much of the complexity of that function has to do with kernel support for fragmented skbs, which is required for packets that are larger than the page size. That is the sort of thing that would go away if the kernel adopted a kernel page size larger than the hardware page size in cases where the latter is ridiculously small.

I am not sure what the real benefits are of managing everything in terms of 4K pages is on a system with modern memory sizes. Perhaps the idea of managing everything in terms of 64K pages (i.e. in groups of 16 hardware pages) could be revisited. That would dramatically simplify much of the networking code, because support for fragmented skbs could be dropped. No doubt it would have other benefits as well.

Missing the AF_BUS

Posted Jul 9, 2012 9:11 UTC (Mon) by gioele (subscriber, #61675) [Link]

> I straightened it all out and cut the CPU overhead in half, consequently reducing the hop latency by that amount. I went on to analyze the rest of the stack to some extent and found it was all like that. You can too, all you need to do is go look at the code.

> This is part of a call chain that goes about 20 levels deep. There is much worse in there. See, that stuff looks plausible and if you listen to the folklore it sounds fast. But it actually isn't, which I know beyond a shadow of a doubt.

Don't you have some notes, implementation ideas or performance tests that you want to share with the rest of the kernel community? I'm pretty sure that they would love to hear how to cut in half the CPU overhead of UDP messages without regressions in functionalities.

This kind of impact would surely reduce the battery consumption of mobile applications, so, maybe the main developers will not interested, but devs of mobile-oriented forks like Android will surely be.

Missing the AF_BUS

Posted Jul 9, 2012 20:26 UTC (Mon) by butlerm (subscriber, #13312) [Link]

I should add that fragmented skbs are used for zero copy support too, so if the idea is to simplify the networking stack by dropping them, zero copy would be out. On the other hand, zero copy seems to be usable for sendfile() and not much else, so that doesn't sound like much of a loss if it improves the much more common case.