|
|
Subscribe / Log in / New account

Missing the AF_BUS

Missing the AF_BUS

Posted Jul 9, 2012 6:02 UTC (Mon) by daniel (guest, #3181)
In reply to: Missing the AF_BUS by josh
Parent article: Missing the AF_BUS

I'll have to tell you about it, because the actual code is buried deep in somebody's trading engine and they would likely take issue with me posting it on the web. Profiling turned up some really bad CPU bumps in places you would not immediately suspect, like UDP send, which was taking nearly a microsecond per packet more than it should. I thought there would actually be some deep reason for that, but when I dug in I found that the reason was just sloppy, rambling code, pure and simple. I straightened it all out and cut the CPU overhead in half, consequently reducing the hop latency by that amount. I went on to analyze the rest of the stack to some extent and found it was all like that. You can too, all you need to do is go look at the code.

Here's a lovely bit:

http://lxr.linux.no/#linux+v3.4.4/net/ipv4/tcp_output.c#L796

This is part of a call chain that goes about 20 levels deep. There is much worse in there. See, that stuff looks plausible and if you listen to the folklore it sounds fast. But it actually isn't, which I know beyond a shadow of a doubt.


to post comments

Missing the AF_BUS

Posted Jul 9, 2012 6:53 UTC (Mon) by daniel (guest, #3181) [Link] (3 responses)

Here's a better example:

http://lxr.linux.no/#linux+v3.4.4/net/ipv4/ip_output.c#L799

This code just kills efficiency by a thousand cuts. There is no single culprit, it is just that all that twisting and turning, calling lots of little helpers and layering everything through an skb editing API that successfully confuses the optimizer adds up to an embarrassing amount of overhead. First rule to remember? Function calls are not free. Not at the speeds networks operate these days.

Missing the AF_BUS

Posted Jul 9, 2012 8:18 UTC (Mon) by nix (subscriber, #2304) [Link] (1 responses)

Actually, predicted function calls *are* nearly free on modern CPUs. Of course, function calls stuck deep inside conditionals are less likely to be successfully predicted as taken -- and unpredicted/mispredicted function calls (like all other mispredicted, non-speculated branches) are expensive as hell. However, these days I don't believe there is much more reason to be concerned about function calls than there is to be concerned about any other conditional. (Specialists in deep x86 lore, which I am very much not and who I am merely reiterating from dim and vague memory, are welcome to contradict me, and probably will!)

Missing the AF_BUS

Posted Jul 9, 2012 23:06 UTC (Mon) by daglwn (guest, #65432) [Link]

The call is cheap. The saving/restoring of registers and lost optimization opportunities are not.

Missing the AF_BUS

Posted Jul 9, 2012 18:40 UTC (Mon) by butlerm (subscriber, #13312) [Link]

>This code just kills efficiency by a thousand cuts. There is no single culprit, it is just that all that twisting and turning, calling lots of little helpers...

Much of the complexity of that function has to do with kernel support for fragmented skbs, which is required for packets that are larger than the page size. That is the sort of thing that would go away if the kernel adopted a kernel page size larger than the hardware page size in cases where the latter is ridiculously small.

I am not sure what the real benefits are of managing everything in terms of 4K pages is on a system with modern memory sizes. Perhaps the idea of managing everything in terms of 64K pages (i.e. in groups of 16 hardware pages) could be revisited. That would dramatically simplify much of the networking code, because support for fragmented skbs could be dropped. No doubt it would have other benefits as well.

Missing the AF_BUS

Posted Jul 9, 2012 9:11 UTC (Mon) by gioele (subscriber, #61675) [Link]

> I straightened it all out and cut the CPU overhead in half, consequently reducing the hop latency by that amount. I went on to analyze the rest of the stack to some extent and found it was all like that. You can too, all you need to do is go look at the code.

> This is part of a call chain that goes about 20 levels deep. There is much worse in there. See, that stuff looks plausible and if you listen to the folklore it sounds fast. But it actually isn't, which I know beyond a shadow of a doubt.

Don't you have some notes, implementation ideas or performance tests that you want to share with the rest of the kernel community? I'm pretty sure that they would love to hear how to cut in half the CPU overhead of UDP messages without regressions in functionalities.

This kind of impact would surely reduce the battery consumption of mobile applications, so, maybe the main developers will not interested, but devs of mobile-oriented forks like Android will surely be.

Missing the AF_BUS

Posted Jul 9, 2012 20:26 UTC (Mon) by butlerm (subscriber, #13312) [Link]

I should add that fragmented skbs are used for zero copy support too, so if the idea is to simplify the networking stack by dropping them, zero copy would be out. On the other hand, zero copy seems to be usable for sendfile() and not much else, so that doesn't sound like much of a loss if it improves the much more common case.


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds