TCP friends
A buffer full of data sent on the network does not travel alone. Instead, the TCP layer must split that buffer into reasonably-sized packets, prepend a set of TCP headers to it, and, possibly, calculate a checksum. The packets are then passed to the IP layer, which throws its own headers onto the beginning of the buffer, finds a suitable network interface, and hands the result off to that interface for transmission. At the receiving end the process is reversed: the IP and TCP headers are stripped, checksums are compared, and the data is merged back into a seamless stream for the receiving process.
It is all a fair amount of work, but it allows the two processes to communicate without having to worry about all that happens in between. But, if the two processes are on the same physical machine, much of that work is not really necessary. The bulk of the overhead in the network stack is there to ensure that packets do not get lost on their way to the destination, that the data does not get corrupted in transit, and that nothing gets forgotten or reordered. Most of these perils do not threaten data that never leaves the originating system, so much of the work done by the networking stack is entirely wasted in this case.
That much has been understood by developers for many years, of course. That is why many programs have been written specifically to use Unix-domain sockets when communicating with local peers. Unix-domain sockets ("pipes") provide the same sort of stream abstraction, but, since they do not communicate between systems, they avoid all of the overhead added by a full network stack. So faster communications between local processes is possible now, but it must be coded explicitly in any program that wishes to use it.
What if local TCP communications could be accelerated to the point that they are competitive with Unix-domain sockets? That is the objective of this patch from Bruce Curtis. The idea is simple enough to explain: when both endpoints of a TCP connection are on the same machine, the two sockets are marked as being "friends" in the kernel. Data written to such a socket will be immediately queued for reading on the friend socket, bypassing the network stack entirely. The TCP, IP, and loopback device layers are simply shorted out. The actual patch, naturally enough, is rather more complicated than this simple description would suggest; friend sockets must still behave like TCP sockets to the point that applications cannot tell the difference, so friend-handling tweaks must be applied to many places in the TCP stack.
One would hope that this approach would yield local networking speeds that are at least close to competitive with those achieved using Unix-domain sockets. Interestingly, Bruce's patch not only achieves that, but it actually does better than Unix-domain sockets in almost every benchmark he ran. "Better" means both higher data transmission rates and lower latencies on round-trip tests. Bruce does not go into why that is; perhaps the amount of attention that has gone into scalability in the networking stack pays off in his 16-core testing environment.
There is one important test for which Bruce posted no results: does the TCP friends patch make things any slower for non-local connections where the stack bypass cannot be used? Some of the network stack hot paths can be sensitive to even small changes, so one can imagine that the networking developers will want some assurance that the non-bypass case will not be penalized if this patch goes in. There are various other little issues that need to be dealt with, but this patch looks like it is on track for merging in the relatively near future.
If it is merged, the result should be faster local communications between
processes without the need for special-case code using Unix-domain
sockets. It could also be most useful on systems hosting containerized
guests where cross-container communications are needed; one suspects that
Google's use case looks somewhat like that. In the end, it is hard to
argue against a patch that can speed local communications by as much as a
factor of five, so chances are this change will go into the mainline before
too long.
Index entries for this article | |
---|---|
Kernel | Networking |
Posted Aug 16, 2012 10:23 UTC (Thu)
by Fowl (subscriber, #65667)
[Link] (10 responses)
Posted Aug 16, 2012 11:44 UTC (Thu)
by gb (subscriber, #58328)
[Link] (7 responses)
Posted Aug 16, 2012 12:54 UTC (Thu)
by hummassa (subscriber, #307)
[Link] (6 responses)
Posted Aug 16, 2012 13:45 UTC (Thu)
by Kioob (subscriber, #56482)
[Link] (5 responses)
For example, you can deny some local users to connect to a local TCP socket.
Posted Aug 16, 2012 15:33 UTC (Thu)
by paravoid (subscriber, #32869)
[Link] (3 responses)
Posted Aug 16, 2012 23:12 UTC (Thu)
by BenHutchings (subscriber, #37955)
[Link] (1 responses)
But you can turn most of these offloads/optimisations off if you want (though you can't force checksum generation for loopback). The same goes for tcp_friends.
Posted Aug 18, 2012 0:00 UTC (Sat)
by intgr (subscriber, #39733)
[Link]
The lack of checksums is a very trivial issue. The lack of segmentation is also irrelevant -- it appears as an interface with an infinitely large MTU. The packets captured on "lo" are still basically valid TCP packets.
Not seeing any packets and being unable to debug network interactions *at all* is a huge deal compared to these.
Posted Aug 17, 2012 14:31 UTC (Fri)
by Creideiki (subscriber, #38747)
[Link]
Posted Aug 16, 2012 23:04 UTC (Thu)
by BenHutchings (subscriber, #37955)
[Link]
Posted Aug 18, 2012 5:52 UTC (Sat)
by dashesy (guest, #74652)
[Link]
Posted Aug 26, 2012 17:59 UTC (Sun)
by philomath (guest, #84172)
[Link]
Posted Aug 16, 2012 12:16 UTC (Thu)
by ededu (guest, #64107)
[Link] (1 responses)
Posted Aug 16, 2012 18:49 UTC (Thu)
by zlynx (guest, #2285)
[Link]
Posted Aug 16, 2012 23:56 UTC (Thu)
by ras (subscriber, #33059)
[Link] (1 responses)
Posted Aug 18, 2012 22:41 UTC (Sat)
by giraffedata (guest, #1954)
[Link]
I don't think it's reliable transport per se that makes simplified communication possible, but direct transport. The TCP complexity is because the packets travel through a complex network. They get switched around here and there and compete with streams between other independent nodes for resources.
So I think friend sockets would be appropriate for any two sockets connnected via a dedicated link. As well as an in-kernel link or an SSH-based TCP link, that could be a PPP link or an ethernet with only two nodes on it.
In all those cases, a socket friendship protocol like this would be useful.
In fact, even without the ability to use existing TCP/IP applications, it would be nice to have stream sockets that exploit such direct connection, but I've never seen them.
Posted Aug 17, 2012 6:37 UTC (Fri)
by arkaitzj (subscriber, #80462)
[Link] (3 responses)
Posted Aug 17, 2012 8:16 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Posted Aug 18, 2012 19:58 UTC (Sat)
by rvfh (guest, #31018)
[Link] (1 responses)
Completely OT, but I would love it if you could explain how to pass fd's through UNIX sockets and why this is not possible with INET sockets :-)
Posted Aug 18, 2012 21:01 UTC (Sat)
by viro (subscriber, #7872)
[Link]
as to why it's not possible for AF_INET... it's obviously not going to work between different hosts (we are, in effect, passing pointers to opened struct file instances; descriptors are converted to such pointers on the sending end and pointers are inserted into descriptor table by recepient; resulting descriptor numbers are stored in ancillary data returned to userland). And doing that when both ends are on the same host would result in a headache from hell, since we'd need all garbage-collecting machinery for AF_UNIX to apply to AF_INET as well.
Posted Aug 17, 2012 21:03 UTC (Fri)
by Baylink (guest, #755)
[Link]
Posted Aug 19, 2012 10:13 UTC (Sun)
by istenrot (subscriber, #69564)
[Link]
Posted Aug 23, 2012 13:50 UTC (Thu)
by reddit (guest, #86331)
[Link]
TCP friends
TCP friends
TCP friends
TCP friends
TCP friends
TCP friends
TCP friends
TCP friends
TCP friends
TCP friends
TCP friends
x.org
x.org
What about other reliable transports?
What about other reliable transports?
TCP friends
TCP friends
TCP friends
TCP friends
man 2 recvmsg
man 7 unix and see the part about SCM_RIGHTS in there
man 3 cmsg for how to set/parse the ancillary data
TCP friends
TCP friends
TCP friends