Multipath TCP: an overview
Things have gotten rather more complicated in the years since TCP was first deployed. Connections to multiple networks, once the province of large server systems, are now ubiquitous; a smartphone, for example, can have separate, simultaneous interfaces to a cellular network, a WiFi network, and, possibly, other networks via Bluetooth or USB ports. Each of those networks provides a possible way to reach a remote host, but any given TCP session will use only one of them. That leads to obvious policy considerations (which interface should be used when) and operational difficulties: most handset users are familiar with how a WiFi-based TCP session will be broken if the device moves out of range of the access point, for example.
What if a TCP session could make use of all of the available paths between the two endpoints at any given time? There would be performance improvements, since each of the paths could carry data in parallel, and congested paths could be avoided in favor of faster paths at any given time. Sessions could also be more robust. Imagine a video stream that is established over both WiFi and cellular networks; if the watcher leaves the house (one hopes somebody else is driving), the stream would shift transparently to the cellular connection without interruption. Data centers, where multiple paths between systems and variable congestion are both common, could also make use of a multipath-capable transport protocol.
The problem is that TCP does not work that way. Enter MPTCP, which is designed to work that way.
How it works
A TCP session is normally set up by way of a three-way handshake. The initiating host sends a packet with the SYN flag set, the receiving host, if it is amenable to the connection, responds with a packet containing both the SYN and ACK flags. The final ACK packet sent by the initiator puts the connection into the "established" state; after that, data can be transferred in either direction.
An MPTCP session starts in the same way, with one change: the initiator adds the new MP_CAPABLE option to the SYN packet. If the receiving host supports MPTCP, it will add that option to its SYN-ACK reply; the two hosts will also include cryptographic keys in these packets for later use. The final ACK (which must also carry the MP_CAPABLE option) establishes a multipath session, albeit a session using a single path just like traditional TCP.
When MPTCP is in use, both sides recognize a distinction between the session itself and any specific "subflow" used by that session. So, at any point, either party to the session can initiate another TCP connection to the other side, with the proviso that the address and/or port at one end or the other of the connection must differ. So, if a smartphone has initiated an MPTCP connection to a server using its WiFi interface:
It can add another subflow at any time by connecting to the same server by way of its cellular interface:
That subflow is added by sending a SYN packet with the MP_JOIN option; it also includes information on which MPTCP session is to be joined. Needless to say, the protocol designers are concerned that a hostile party might try to join somebody else's session; the previously-exchanged cryptographic keys are used to prevent such attacks from succeeding. If the receiving server is amenable to adding the subflow, it will allow the establishment of the new TCP connection and add it to the MPTCP session.
Once a session has more than one subflow, it is up to the systems on each end to decide how to split traffic between them (though it is possible to mark a specific subflow for use only when any others no longer work). A single receive window applies to the session as a whole. Each subflow looks like a normal TCP connection, with its own sequence numbers, but the session as a whole has a separate sequence number; there is another TCP option (DSS, or "Data Sequence Signal") which is used to inform the other end how data on each subflow fits into the overall stream.
Subflows can come and go over the life of an MPTCP connection. They can be explicitly closed by either end, or they can simply vanish if one of the paths becomes unavailable. If the underlying machinery is working well, applications should not even notice these changes. Just like IP can hide routing changes, MPTCP can hide the details of which paths it is using at any given time. It should, from an application's point of view, just work.
Needless to say, there are vast numbers of details that have been glossed over here. Making a protocol extension like this work requires thinking about issues like congestion control, how to manage retransmissions over a different path, how one party can tell the other about additional addresses (paths) it could use, how to decide when setting up multiple subflows is worth the expense, and so on. The MPTCP designers have done much of that thinking; see RFC 6824 for the details.
The dreaded middlebox
One set of details merits a closer look, though. The designers of MPTCP are not interested in going through an idle academic exercise; they want to create a solution to real problems that will be deployed on the existing Internet. And that means designing something that will function with the net as it exists now. At one level, that means making things work transparently for TCP-based applications. But there is an entire section in the RFC that is concerned with "middleboxes" and how they can sabotage any attempt to introduce a new protocol.
Middleboxes are routers that impose some sort of constraint or transformation on network traffic passing through them. Network address translation (NAT) boxes are one example: they hide an entire network behind a translation layer that will change the address and port of a connection on its way through. NAT boxes can also insert data into a stream — adding commands to make FTP work, for example. Some boxes will acknowledge data on its way through, well before it arrives at the real destination, in an attempt to increase pipelining. Some routers will drop packets with unknown options; that behavior made the rollout of the selective acknowledgment (SACK) feature much harder than it needed to be. Firewalls will kill connections with holes in the sequence number stream; they will also, sometimes, transform sequence numbers on the way through. Splitting and coalescing of segments can cause options to be dropped or duplicated. And so on; the list of potential problems is impressive.
On top of that, anybody trying to introduce an entirely new transport-layer is likely to discover that it will not make it across the Internet at all. Much of the routing infrastructure on the net assumes that TCP and UDP are all there is; anything else has a poor chance of making it through.
Working around these issues drove the design of MPTCP at all levels. TCP was never designed for multiple subflows; rather than bolting that idea onto the protocol, it might well have been better to start over. One could have incorporated the lessons learned from TCP in all ways — including doing things entirely differently where it made sense. But the resulting protocol would not work on today's Internet, so the designers had no choice but to create a protocol that, to almost every middlebox out there, looks like plain old TCP.
So every subflow is an independent TCP connection in every respect. Since holes in sequence numbers can cause problems, each subflow has its own sequence and a mapping layer must be added on top. That mapping layer uses relative sequence numbers because some middlebox may have changed those numbers as they passed through. The two sides assign "address identifiers" to the IP addresses of their interfaces and use those identifiers to communicate about those interfaces, since the addresses themselves may be changed by a NAT box in the middle. When one side tells the other about an available interface, it adds an "address identifier" to be used in future messages because a NAT box might change the visible address of that interface. Special checks exist for subflows that corrupt data, insert preemptive acknowledgments, or strip unknown options; such subflows will not be used. And the whole thing is designed to fall back gracefully to ordinary TCP if the interference is too strong to overcome.
It is all a clever bit of design on the part of the MPTCP developers, but it also highlights an area of concern: the "dumb" Internet with end-to-end transparent routing of data is a thing of the distant past. What we have now is inflexible and somewhat hostile to the deployment of new technologies. The MPTCP developers have been able to work around these limitations, but the effort required was considerable. In the future, we may find that the net is broken in fundamental ways and it simply cannot be fixed; some might say that the difficulties in moving to IPv6 show that this has already happened.
Future directions
The current MPTCP code can be found at the MPTCP github repository; it adds a good 10,000 lines to the mainline kernel's networking subtree. While it has apparently been the subject of discussions with various networking developers, it has not, yet, been posted for public review or inclusion into the mainline. It does, however, seem to work: the MPTCP developers claim to have implemented the fastest TCP connection ever by transmitting at a rate of 51.8Gb/s over six 10Gb links.
MPTCP is still relatively young, so there is almost certainly quite a bit of work yet to be done before it is ready for mainline merging or production use. There is also some thinking to be done on the application side; it may be possible for MPTCP-aware applications to make better use of the available paths. Projects like this are arguably never finished (we are still refining TCP, after all), but MPTCP does seem to have reached the point where more users may want to start experimenting with it.
Anybody wanting to play with this code can grab the project's kernel
repository and build a custom kernel. For those who are not up to that
level of effort, the project offers a number of other
options, including a Debian repository, instructions for running MPTCP
on Amazon's EC2, and kernels for a handful of Android-based handsets.
Needless to say, the developers are highly interested in hearing bug
reports or other testing results.
| Index entries for this article | |
|---|---|
| Kernel | Networking/Protocols |
