More accurate congestion notification for TCP
TCP, from the beginning, has included a couple of window counters used by each side of a connection to specify how much data it is willing to accept from the other at any given time. The windows work well to prevent the endpoints from being overwhelmed with packets, but early TCP did not consider the problem of congestion in the routers between the endpoints. That shortcoming made itself known in the form of severe congestion problems in the mid-to-late 1980s.
Around that time, Van Jacobson and Mike Karels took on the problem of preventing congestion collapse. Their key insight was that dropped packets were almost never a result of corruption of the packets themselves. Instead, they were a signal that some system between the endpoints was experiencing congestion; indeed, dropped packets were the only way that a router could signal congestion. Jacobson implemented the first congestion-control algorithms that would slowly ramp up the transmission rate until packet loss was experienced, indicating the point where the capacity of the channel had been exceeded. Jacobson's classic paper describes this work in detail.
Using packet-loss events in this way made the net work again, but it was never going to be the most efficient way to regulate transmission speeds. It takes time to realize that a packet has been dropped, and each dropped packet represents a waste of resources. It would be a far better if the TCP endpoints could be informed of congestion, and moderate their transmission speeds, before the congestion reaches the point of packet loss.
Explicit congestion notification
Around the end of the 1990s, work was started on what eventually became RFC 3168, describing explicit congestion notification (ECN), a means by which routers can inform the endpoints of a connection that they are experiencing congestion. It required changes at both the IP and TCP layers of the stack.
At the IP level, two bits were allocated from the IPv4 and IPv6 headers; they were named ECT and CE. The setting of either of those bits (but not both) in an IP packet is an indication that the endpoints understand the ECN protocol and are willing to implement it. When a router that is experiencing congestion receives a packet with exactly one of those bits set, it can choose to set the other bit to indicate "congestion experienced" in the hope that the endpoints will respond by slowing their transmission rates.
In a typical TCP connection, one side will be transmitting at a rather higher rate than the other. If the heavy transmitter is causing congestion, the ECN signal will arrive at the receiving end, where it is not entirely useful. So TCP had to be enhanced to relay that signal back to the transmitting side. Two bits were allocated in the TCP header as well with the names ECE (ECN echo) and CWR (congestion window reduced). If both of those bits are set in the initial SYN packet starting a connection, they are interpreted as a signal that the initiating side implements ECN. If the peer also supports ECN, it sends its SYN-ACK response with only the ECE bit set. When both of those things happen, the connection will use ECN.
When one side of a connection receives a packet with the two IP-level congestion-mark bits set, indicating congestion in the path, it will start setting the TCP ECE bit in every ACK packet it sends back to the other side. An endpoint, on receiving a packet with ECE set, is supposed to respond in the same way it would if a packet had been dropped; it will reduce its congestion window (and thus the transmission speed). It will also set the CWR bit in the TCP header in the next packet it sends to indicate that the ECE signal has been received. Once the CWR bit is observed at the other end, the recipient will stop setting ECE.
The Linux kernel gained support for ECN in the 2.4.0-test7 release in September 2000. The immediate result was an early lesson on the problem of protocol ossification. As was noted in LWN at the time, many of the routers on the Internet not only did not support ECN, but they also actively dropped SYN packets with the TCP ECN bits set, making communication impossible. So, while Linux had ECN support from an early date, it was many years before it could be safely enabled on most systems, and it still is not fully enabled even in current kernels.
More accurate ECN
ECN was an improvement over what came before, but there is room to do even better. The design of the ECN protocol means that it can only communicate a single "congestion experienced" event during each round-trip time for the connection; that is how long it will take between the transmission of the first ACK with ECE set and the reception of a packet with CWR set. That will slow the response to heavy congestion, with the likely result that packets will still be dropped. AccECN was designed to provide faster and more detailed feedback on congestion to the TCP endpoints.
AccECN makes minimal changes to ECN at the IP level; the two bits are used as before. At the TCP level, it grabs another header bit that had, back in 2003, been assigned by RFC 3540 for a "robust ECN" mechanism that was never deployed. That bit, renamed AE, is used in a couple of ways with the new protocol. At connection time, an AccECN-capable site should set the AE bit along with ECE and CWR; if the other side also supports AccECN, it will respond with ECE and AE set. If the receiving side does not understand AccECN and ignores the AE bit, it will see what looks like a "classic ECN" configuration and respond accordingly. (Note that the connection protocol, like everything else, is somewhat more complex than described here; see the RFC draft for the gory details).
When AccECN is in use, each side maintains a set of counters, one of which is the number of packets received with the congestion-experienced marker. After the connection is established, the AE, CWR, and ECE bits are combined into a single three-bit field, inevitably called ACE. The contents of that field will be the three least-significant bits of the packet counter, giving the other side a continually updated view of how many congestion-marked packets have been seen. When the ACE count changes, a transmitting side can get a sense for just how many packets have been stamped with the congestion mark in transit and respond accordingly.
Three bits do not allow for a large count, needless to say. The RFC draft provides a set of complicated rules for determining whether the count may have wrapped and guessing how many times that may have happened. ACKs are sent relatively frequently — perhaps one for every two data packets in an ongoing stream — leaving little opportunity for multiple wraps of the ACE counter most of the time. In any case, eight counter values that can change with every ACK (rather than one bit that can only change once per round-trip time) provide much higher-resolution information on the presence of congestion on the path between the two endpoints.
AccECN, as described so far, was clearly designed to avoid as many protocol-ossification problems as possible. Even so, it includes a number of provisions for the detection of middlebox interference with the ACE bits and the count as a whole. The nature of the modern Internet is such that protocol changes must be done with a lot of care, even when the changes are within the specification of the protocols themselves.
There is more to AccECN, though, if the connection will support it. Each side of the connection is required to maintain three other counters for incoming data. There are two counters to track the number of bytes received with either (but not both) of the IP-level ECN bits set, and a counter for the number of bytes received with both bits set (indicating congestion). There is a pair of TCP options that can be used to communicate these counters (more precisely, the bottom 24 bits of each counter) to the other side. These counters give a far more accurate indication of how much congestion is actually occurring, and they can be profitably be put to use by a number of advanced congestion-control algorithms.
The problem with TCP options, of course, is again middleboxes, which often will not pass packets that contain unrecognized options. The connection-establishment dance thus includes a couple of attempts to send packets with the AccECN options to see whether they make it unmolested to the other end; the options will not be used unless these tests pass. The chances of successfully using the new options over the Internet may be relatively small, but AccECN is also intended for use within data centers, where any middleboxes are under the owners' control and can be coerced into letting the options through.
AccECN in Linux
Support for AccECN in the Linux kernel first started arriving in the 6.15 development cycle, with additional pieces following in subsequent releases. In 7.0, a number of final cases have been fixed, and the use of AccECN is enabled by default — for some connections. Specifically, as described in Documentation/networking/ip-sysctl.rst, the use of AccECN (and ECN in general) is controlled by the net/ipv4/tcp_ecn sysctl knob. In previous kernels, the value of tcp_ecn is, by default, two, meaning to use classic ECN when requested for incoming connections, but to not attempt to use it with outgoing connections. AccECN is disabled entirely in that configuration. The new default value is five, which enables AccECN for incoming connections, but still leaves all forms of ECN disabled for outgoing connections. In other words, the fear of protocol ossification remains, so Linux systems will, by default, not attempt to use either type of ECN for connections they initiate.
Some highly scientific "screw around on the net for a while" tests
conducted here suggest that, 25 years or so after its inception,
classic ECN is safe to enable for outgoing connections. It may take some
time to determine whether the same is true for AccECN. It will also be a
while before AccECN-enabled servers are widespread on the Internet, though
they may be deployed within data centers rather more quickly. Decades may be
required, but there should eventually come a point where more accurate
explicit congestion notification is making the net work more smoothly on a
wide scale.
| Index entries for this article | |
|---|---|
| Kernel | Networking/Congestion control |
