TCP friends
A buffer full of data sent on the network does not travel alone. Instead, the TCP layer must split that buffer into reasonably-sized packets, prepend a set of TCP headers to it, and, possibly, calculate a checksum. The packets are then passed to the IP layer, which throws its own headers onto the beginning of the buffer, finds a suitable network interface, and hands the result off to that interface for transmission. At the receiving end the process is reversed: the IP and TCP headers are stripped, checksums are compared, and the data is merged back into a seamless stream for the receiving process.
It is all a fair amount of work, but it allows the two processes to communicate without having to worry about all that happens in between. But, if the two processes are on the same physical machine, much of that work is not really necessary. The bulk of the overhead in the network stack is there to ensure that packets do not get lost on their way to the destination, that the data does not get corrupted in transit, and that nothing gets forgotten or reordered. Most of these perils do not threaten data that never leaves the originating system, so much of the work done by the networking stack is entirely wasted in this case.
That much has been understood by developers for many years, of course. That is why many programs have been written specifically to use Unix-domain sockets when communicating with local peers. Unix-domain sockets ("pipes") provide the same sort of stream abstraction, but, since they do not communicate between systems, they avoid all of the overhead added by a full network stack. So faster communications between local processes is possible now, but it must be coded explicitly in any program that wishes to use it.
What if local TCP communications could be accelerated to the point that they are competitive with Unix-domain sockets? That is the objective of this patch from Bruce Curtis. The idea is simple enough to explain: when both endpoints of a TCP connection are on the same machine, the two sockets are marked as being "friends" in the kernel. Data written to such a socket will be immediately queued for reading on the friend socket, bypassing the network stack entirely. The TCP, IP, and loopback device layers are simply shorted out. The actual patch, naturally enough, is rather more complicated than this simple description would suggest; friend sockets must still behave like TCP sockets to the point that applications cannot tell the difference, so friend-handling tweaks must be applied to many places in the TCP stack.
One would hope that this approach would yield local networking speeds that are at least close to competitive with those achieved using Unix-domain sockets. Interestingly, Bruce's patch not only achieves that, but it actually does better than Unix-domain sockets in almost every benchmark he ran. "Better" means both higher data transmission rates and lower latencies on round-trip tests. Bruce does not go into why that is; perhaps the amount of attention that has gone into scalability in the networking stack pays off in his 16-core testing environment.
There is one important test for which Bruce posted no results: does the TCP friends patch make things any slower for non-local connections where the stack bypass cannot be used? Some of the network stack hot paths can be sensitive to even small changes, so one can imagine that the networking developers will want some assurance that the non-bypass case will not be penalized if this patch goes in. There are various other little issues that need to be dealt with, but this patch looks like it is on track for merging in the relatively near future.
If it is merged, the result should be faster local communications between
processes without the need for special-case code using Unix-domain
sockets. It could also be most useful on systems hosting containerized
guests where cross-container communications are needed; one suspects that
Google's use case looks somewhat like that. In the end, it is hard to
argue against a patch that can speed local communications by as much as a
factor of five, so chances are this change will go into the mainline before
too long.
| Index entries for this article | |
|---|---|
| Kernel | Networking |
