TCP friends

By Jonathan Corbet
August 15, 2012

One of the many advantages of the TCP network protocol is that the process at one end of a connection need not have any idea of where the other side is. A process could be talking with a peer on the other side of the world, in the same town, or, indeed, on the same machine. That last case may be irrelevant to the processes involved, but it can be important for performance-sensitive users. A new patch from Google seems likely to speed that case up in the near future.

A buffer full of data sent on the network does not travel alone. Instead, the TCP layer must split that buffer into reasonably-sized packets, prepend a set of TCP headers to it, and, possibly, calculate a checksum. The packets are then passed to the IP layer, which throws its own headers onto the beginning of the buffer, finds a suitable network interface, and hands the result off to that interface for transmission. At the receiving end the process is reversed: the IP and TCP headers are stripped, checksums are compared, and the data is merged back into a seamless stream for the receiving process.

It is all a fair amount of work, but it allows the two processes to communicate without having to worry about all that happens in between. But, if the two processes are on the same physical machine, much of that work is not really necessary. The bulk of the overhead in the network stack is there to ensure that packets do not get lost on their way to the destination, that the data does not get corrupted in transit, and that nothing gets forgotten or reordered. Most of these perils do not threaten data that never leaves the originating system, so much of the work done by the networking stack is entirely wasted in this case.

That much has been understood by developers for many years, of course. That is why many programs have been written specifically to use Unix-domain sockets when communicating with local peers. Unix-domain sockets ("pipes") provide the same sort of stream abstraction, but, since they do not communicate between systems, they avoid all of the overhead added by a full network stack. So faster communications between local processes is possible now, but it must be coded explicitly in any program that wishes to use it.

What if local TCP communications could be accelerated to the point that they are competitive with Unix-domain sockets? That is the objective of this patch from Bruce Curtis. The idea is simple enough to explain: when both endpoints of a TCP connection are on the same machine, the two sockets are marked as being "friends" in the kernel. Data written to such a socket will be immediately queued for reading on the friend socket, bypassing the network stack entirely. The TCP, IP, and loopback device layers are simply shorted out. The actual patch, naturally enough, is rather more complicated than this simple description would suggest; friend sockets must still behave like TCP sockets to the point that applications cannot tell the difference, so friend-handling tweaks must be applied to many places in the TCP stack.

One would hope that this approach would yield local networking speeds that are at least close to competitive with those achieved using Unix-domain sockets. Interestingly, Bruce's patch not only achieves that, but it actually does better than Unix-domain sockets in almost every benchmark he ran. "Better" means both higher data transmission rates and lower latencies on round-trip tests. Bruce does not go into why that is; perhaps the amount of attention that has gone into scalability in the networking stack pays off in his 16-core testing environment.

There is one important test for which Bruce posted no results: does the TCP friends patch make things any slower for non-local connections where the stack bypass cannot be used? Some of the network stack hot paths can be sensitive to even small changes, so one can imagine that the networking developers will want some assurance that the non-bypass case will not be penalized if this patch goes in. There are various other little issues that need to be dealt with, but this patch looks like it is on track for merging in the relatively near future.

If it is merged, the result should be faster local communications between processes without the need for special-case code using Unix-domain sockets. It could also be most useful on systems hosting containerized guests where cross-container communications are needed; one suspects that Google's use case looks somewhat like that. In the end, it is hard to argue against a patch that can speed local communications by as much as a factor of five, so chances are this change will go into the mainline before too long.

Index entries for this article
Kernel	Networking

TCP friends

Posted Aug 16, 2012 10:23 UTC (Thu) by Fowl (subscriber, #65667) [Link] (10 responses)

Wow, I'd always just assumed that this is the way that it worked from the beginning.

TCP friends

Posted Aug 16, 2012 11:44 UTC (Thu) by gb (subscriber, #58328) [Link] (7 responses)

Hmmm. One of use cases for local sockets is software testing. In this case everyone expects that local sockets behave exactly as remote, except latency introduced by transferring over network is close to zero. Will this patch change something for this use-case?

TCP friends

Posted Aug 16, 2012 12:54 UTC (Thu) by hummassa (subscriber, #307) [Link] (6 responses)

In principle, the only change is that latency is now even closer to zero.

TCP friends

Posted Aug 16, 2012 13:45 UTC (Thu) by Kioob (subscriber, #56482) [Link] (5 responses)

And what abouts netfilter ? That code seems also skipped, no ?

For example, you can deny some local users to connect to a local TCP socket.

TCP friends

Posted Aug 16, 2012 15:33 UTC (Thu) by paravoid (subscriber, #32869) [Link] (3 responses)

"tcpdump -i lo" too I guess. Which I found quite useful in some cases.

TCP friends

Posted Aug 16, 2012 23:12 UTC (Thu) by BenHutchings (subscriber, #37955) [Link] (1 responses)

tcpdump hasn't reliably shown 'packets on the wire', even for physical devices, since checksum offload was implemented. The loopback device already skips checksum generation, TCP segmentation and UDP fragmentation.

But you can turn most of these offloads/optimisations off if you want (though you can't force checksum generation for loopback). The same goes for tcp_friends.

TCP friends

Posted Aug 18, 2012 0:00 UTC (Sat) by intgr (subscriber, #39733) [Link]

> tcpdump hasn't reliably shown 'packets on the wire', even for physical devices, since checksum offload was implemented

The lack of checksums is a very trivial issue. The lack of segmentation is also irrelevant -- it appears as an interface with an infinitely large MTU. The packets captured on "lo" are still basically valid TCP packets.

Not seeing any packets and being unable to debug network interactions *at all* is a huge deal compared to these.

TCP friends

Posted Aug 17, 2012 14:31 UTC (Fri) by Creideiki (subscriber, #38747) [Link]

The fact that Solaris doesn't allow that is a constant source of frustration in my day job.

TCP friends

Posted Aug 16, 2012 23:04 UTC (Thu) by BenHutchings (subscriber, #37955) [Link]

Only data transfer on a connected socket bypasses the full receive and transmit path. So you can still block connections. NATting of loopback connections might go horribly wrong though.

TCP friends

Posted Aug 18, 2012 5:52 UTC (Sat) by dashesy (guest, #74652) [Link]

Interesting, I had exactly the same thought. Maybe short-circuit below the IP in the stack. I guess winsock does that, would be interesting to test a few things and compare the performance (TCP/IP) with this new patch.

TCP friends

Posted Aug 26, 2012 17:59 UTC (Sun) by philomath (guest, #84172) [Link]

Exactly. I wonder how many more "low hanging fruit" are out there.

x.org

Posted Aug 16, 2012 12:16 UTC (Thu) by ededu (guest, #64107) [Link] (1 responses)

This could be very interesting for x.org, isn't it?

x.org

Posted Aug 16, 2012 18:49 UTC (Thu) by zlynx (guest, #2285) [Link]

X already uses Unix domain sockets for messaging and local shared memory for large things like bitmaps.

What about other reliable transports?

Posted Aug 16, 2012 23:56 UTC (Thu) by ras (subscriber, #33059) [Link] (1 responses)

It's not just the kernel that can bypass most of TCP's overhead, I presume it's any reliable transport. Like tunneling TCP over ssh, for instance. TCP tunneled over TCP usually doesn't work so well, but with this in place it should work just as well as TCP over a datagram service.

What about other reliable transports?

Posted Aug 18, 2012 22:41 UTC (Sat) by giraffedata (guest, #1954) [Link]

I don't think it's reliable transport per se that makes simplified communication possible, but direct transport. The TCP complexity is because the packets travel through a complex network. They get switched around here and there and compete with streams between other independent nodes for resources.

So I think friend sockets would be appropriate for any two sockets connnected via a dedicated link. As well as an in-kernel link or an SSH-based TCP link, that could be a PPP link or an ethernet with only two nodes on it.

In all those cases, a socket friendship protocol like this would be useful.

In fact, even without the ability to use existing TCP/IP applications, it would be nice to have stream sockets that exploit such direct connection, but I've never seen them.

TCP friends

Posted Aug 17, 2012 6:37 UTC (Fri) by arkaitzj (subscriber, #80462) [Link] (3 responses)

Shouldn't this fix the problems DBus guys had with unix sockets not having multicast and local TCP sockets being too slow? They don't seem to need to create another socket implementation in the kernel asthey wanted to.

TCP friends

Posted Aug 17, 2012 8:16 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

The problem is not in raw bandwidth but in number of context switches. Also, TCP sockets can't be used to pass file descriptors.

TCP friends

Posted Aug 18, 2012 19:58 UTC (Sat) by rvfh (guest, #31018) [Link] (1 responses)

> Also, TCP sockets can't be used to pass file descriptors.

Completely OT, but I would love it if you could explain how to pass fd's through UNIX sockets and why this is not possible with INET sockets :-)

TCP friends

Posted Aug 18, 2012 21:01 UTC (Sat) by viro (subscriber, #7872) [Link]

man 2 sendmsg
man 2 recvmsg
man 7 unix and see the part about SCM_RIGHTS in there
man 3 cmsg for how to set/parse the ancillary data

as to why it's not possible for AF_INET... it's obviously not going to work between different hosts (we are, in effect, passing pointers to opened struct file instances; descriptors are converted to such pointers on the sending end and pointers are inserted into descriptor table by recepient; resulting descriptor numbers are stored in ancillary data returned to userland). And doing that when both ends are on the same host would result in a headache from hell, since we'd need all garbage-collecting machinery for AF_UNIX to apply to AF_INET as well.

TCP friends

Posted Aug 17, 2012 21:03 UTC (Fri) by Baylink (guest, #755) [Link]

So this would optimize between OpenVZ guests but not between Xen/KVM guests, then?

TCP friends

Posted Aug 19, 2012 10:13 UTC (Sun) by istenrot (subscriber, #69564) [Link]

Well, at least there is a sysctl knob included for disabling TCP friends in case you need to debug loopback TCP sockets with tcpdump. So it's all good to me.

TCP friends

Posted Aug 23, 2012 13:50 UTC (Thu) by reddit (guest, #86331) [Link]

The fact that the traffic is invisible to tcpdump is of course unacceptable and needs to be fixed.