I'll have to tell you about it, because the actual code is buried deep in somebody's trading engine and they would likely take issue with me posting it on the web. Profiling turned up some really bad CPU bumps in places you would not immediately suspect, like UDP send, which was taking nearly a microsecond per packet more than it should. I thought there would actually be some deep reason for that, but when I dug in I found that the reason was just sloppy, rambling code, pure and simple. I straightened it all out and cut the CPU overhead in half, consequently reducing the hop latency by that amount. I went on to analyze the rest of the stack to some extent and found it was all like that. You can too, all you need to do is go look at the code.
This is part of a call chain that goes about 20 levels deep. There is much worse in there. See, that stuff looks plausible and if you listen to the folklore it sounds fast. But it actually isn't, which I know beyond a shadow of a doubt.