|
|
Log in / Subscribe / Register

More accurate congestion notification for TCP

By Jonathan Corbet
February 18, 2026
The "More Accurate Explicit Congestion Notification" (AccECN) mechanism is defined by this RFC draft. The Linux kernel has been gaining support for AccECN with TCP over the last few releases; the 7.0 release will enable it by default for general use. AccECN is a subtle change to how TCP works, but it has the potential to improve how traffic flows over both public and private networks.

TCP, from the beginning, has included a couple of window counters used by each side of a connection to specify how much data it is willing to accept from the other at any given time. The windows work well to prevent the endpoints from being overwhelmed with packets, but early TCP did not consider the problem of congestion in the routers between the endpoints. That shortcoming made itself known in the form of severe congestion problems in the mid-to-late 1980s.

Around that time, Van Jacobson and Mike Karels took on the problem of preventing congestion collapse. Their key insight was that dropped packets were almost never a result of corruption of the packets themselves. Instead, they were a signal that some system between the endpoints was experiencing congestion; indeed, dropped packets were the only way that a router could signal congestion. Jacobson implemented the first congestion-control algorithms that would slowly ramp up the transmission rate until packet loss was experienced, indicating the point where the capacity of the channel had been exceeded. Jacobson's classic paper describes this work in detail.

Using packet-loss events in this way made the net work again, but it was never going to be the most efficient way to regulate transmission speeds. It takes time to realize that a packet has been dropped, and each dropped packet represents a waste of resources. It would be a far better if the TCP endpoints could be informed of congestion, and moderate their transmission speeds, before the congestion reaches the point of packet loss.

Explicit congestion notification

Around the end of the 1990s, work was started on what eventually became RFC 3168, describing explicit congestion notification (ECN), a means by which routers can inform the endpoints of a connection that they are experiencing congestion. It required changes at both the IP and TCP layers of the stack.

At the IP level, two bits were allocated from the IPv4 and IPv6 headers; they were named ECT and CE. The setting of either of those bits (but not both) in an IP packet is an indication that the endpoints understand the ECN protocol and are willing to implement it. When a router that is experiencing congestion receives a packet with exactly one of those bits set, it can choose to set the other bit to indicate "congestion experienced" in the hope that the endpoints will respond by slowing their transmission rates.

In a typical TCP connection, one side will be transmitting at a rather higher rate than the other. If the heavy transmitter is causing congestion, the ECN signal will arrive at the receiving end, where it is not entirely useful. So TCP had to be enhanced to relay that signal back to the transmitting side. Two bits were allocated in the TCP header as well with the names ECE (ECN echo) and CWR (congestion window reduced). If both of those bits are set in the initial SYN packet starting a connection, they are interpreted as a signal that the initiating side implements ECN. If the peer also supports ECN, it sends its SYN-ACK response with only the ECE bit set. When both of those things happen, the connection will use ECN.

When one side of a connection receives a packet with the two IP-level congestion-mark bits set, indicating congestion in the path, it will start setting the TCP ECE bit in every ACK packet it sends back to the other side. An endpoint, on receiving a packet with ECE set, is supposed to respond in the same way it would if a packet had been dropped; it will reduce its congestion window (and thus the transmission speed). It will also set the CWR bit in the TCP header in the next packet it sends to indicate that the ECE signal has been received. Once the CWR bit is observed at the other end, the recipient will stop setting ECE.

The Linux kernel gained support for ECN in the 2.4.0-test7 release in September 2000. The immediate result was an early lesson on the problem of protocol ossification. As was noted in LWN at the time, many of the routers on the Internet not only did not support ECN, but they also actively dropped SYN packets with the TCP ECN bits set, making communication impossible. So, while Linux had ECN support from an early date, it was many years before it could be safely enabled on most systems, and it still is not fully enabled even in current kernels.

More accurate ECN

ECN was an improvement over what came before, but there is room to do even better. The design of the ECN protocol means that it can only communicate a single "congestion experienced" event during each round-trip time for the connection; that is how long it will take between the transmission of the first ACK with ECE set and the reception of a packet with CWR set. That will slow the response to heavy congestion, with the likely result that packets will still be dropped. AccECN was designed to provide faster and more detailed feedback on congestion to the TCP endpoints.

AccECN makes minimal changes to ECN at the IP level; the two bits are used as before. At the TCP level, it grabs another header bit that had, back in 2003, been assigned by RFC 3540 for a "robust ECN" mechanism that was never deployed. That bit, renamed AE, is used in a couple of ways with the new protocol. At connection time, an AccECN-capable site should set the AE bit along with ECE and CWR; if the other side also supports AccECN, it will respond with ECE and AE set. If the receiving side does not understand AccECN and ignores the AE bit, it will see what looks like a "classic ECN" configuration and respond accordingly. (Note that the connection protocol, like everything else, is somewhat more complex than described here; see the RFC draft for the gory details).

When AccECN is in use, each side maintains a set of counters, one of which is the number of packets received with the congestion-experienced marker. After the connection is established, the AE, CWR, and ECE bits are combined into a single three-bit field, inevitably called ACE. The contents of that field will be the three least-significant bits of the packet counter, giving the other side a continually updated view of how many congestion-marked packets have been seen. When the ACE count changes, a transmitting side can get a sense for just how many packets have been stamped with the congestion mark in transit and respond accordingly.

Three bits do not allow for a large count, needless to say. The RFC draft provides a set of complicated rules for determining whether the count may have wrapped and guessing how many times that may have happened. ACKs are sent relatively frequently — perhaps one for every two data packets in an ongoing stream — leaving little opportunity for multiple wraps of the ACE counter most of the time. In any case, eight counter values that can change with every ACK (rather than one bit that can only change once per round-trip time) provide much higher-resolution information on the presence of congestion on the path between the two endpoints.

AccECN, as described so far, was clearly designed to avoid as many protocol-ossification problems as possible. Even so, it includes a number of provisions for the detection of middlebox interference with the ACE bits and the count as a whole. The nature of the modern Internet is such that protocol changes must be done with a lot of care, even when the changes are within the specification of the protocols themselves.

There is more to AccECN, though, if the connection will support it. Each side of the connection is required to maintain three other counters for incoming data. There are two counters to track the number of bytes received with either (but not both) of the IP-level ECN bits set, and a counter for the number of bytes received with both bits set (indicating congestion). There is a pair of TCP options that can be used to communicate these counters (more precisely, the bottom 24 bits of each counter) to the other side. These counters give a far more accurate indication of how much congestion is actually occurring, and they can be profitably be put to use by a number of advanced congestion-control algorithms.

The problem with TCP options, of course, is again middleboxes, which often will not pass packets that contain unrecognized options. The connection-establishment dance thus includes a couple of attempts to send packets with the AccECN options to see whether they make it unmolested to the other end; the options will not be used unless these tests pass. The chances of successfully using the new options over the Internet may be relatively small, but AccECN is also intended for use within data centers, where any middleboxes are under the owners' control and can be coerced into letting the options through.

AccECN in Linux

Support for AccECN in the Linux kernel first started arriving in the 6.15 development cycle, with additional pieces following in subsequent releases. In 7.0, a number of final cases have been fixed, and the use of AccECN is enabled by default — for some connections. Specifically, as described in Documentation/networking/ip-sysctl.rst, the use of AccECN (and ECN in general) is controlled by the net/ipv4/tcp_ecn sysctl knob. In previous kernels, the value of tcp_ecn is, by default, two, meaning to use classic ECN when requested for incoming connections, but to not attempt to use it with outgoing connections. AccECN is disabled entirely in that configuration. The new default value is five, which enables AccECN for incoming connections, but still leaves all forms of ECN disabled for outgoing connections. In other words, the fear of protocol ossification remains, so Linux systems will, by default, not attempt to use either type of ECN for connections they initiate.

Some highly scientific "screw around on the net for a while" tests conducted here suggest that, 25 years or so after its inception, classic ECN is safe to enable for outgoing connections. It may take some time to determine whether the same is true for AccECN. It will also be a while before AccECN-enabled servers are widespread on the Internet, though they may be deployed within data centers rather more quickly. Decades may be required, but there should eventually come a point where more accurate explicit congestion notification is making the net work more smoothly on a wide scale.

Index entries for this article
KernelNetworking/Congestion control


to post comments

Are router/switch vendors likely to respond?

Posted Feb 18, 2026 15:48 UTC (Wed) by davecb (subscriber, #1574) [Link] (3 responses)

In a previous life, they resisted everything (:-))
Of course, if they lazily refrain from *doing* anything to these bits and options, we're in a good state.

Are router/switch vendors likely to respond?

Posted Feb 18, 2026 17:09 UTC (Wed) by Wol (subscriber, #4433) [Link] (2 responses)

What we need to do is make "RFC compliant" a valuable marketing tool - at which point if they start breaking protocols they run foul of "fit for purpose" legislation.

Actually, there might be an argument that gear is not CRA-compliant if it isn't RFC-compliant :-)

Cheers,
Wol

Are router/switch vendors likely to respond?

Posted Feb 19, 2026 5:19 UTC (Thu) by wtarreau (subscriber, #51152) [Link] (1 responses)

It's most often firewall vendors that rely on the usual "we block X% of attacks" argument to sell their crap, unfortunately. For them, better be less compliant and block more legitimate traffic and more attacks at the same time than fail compared to a competitor during a benchmark that floods packets with random bits set :-(

Are router/switch vendors likely to respond?

Posted Feb 19, 2026 10:02 UTC (Thu) by Wol (subscriber, #4433) [Link]

Urk. "We have fewer false negatives - who cares about false positives" :-(

This is where you want old-fashioned trustworthy journalists who can write a proper article / benchtest.

Cheers,
Wol

TIL

Posted Feb 27, 2026 8:51 UTC (Fri) by Hi-Angel (guest, #110915) [Link] (1 responses)

TIL that "ossification" is a word.

I was reading the article and I though "ossification" means "open-sourcing", and I was like "why would open-sourcing a protocol be a problem for protocol adoption? 🤷‍♂️".

So I read the QUIC article referred to here, and it also used "ossification" with no further explanation. And then it hit me — it's probably a word of it's own, with no relation to OSS! And so it is

TIL

Posted Feb 27, 2026 9:25 UTC (Fri) by Wol (subscriber, #4433) [Link]

Iirc, the word "os" is latin for bone. So "ossification" means converting to bone / stone. The word long pre-dates computers, let alone FLOSS.

Cheers,
Wol

per-connection? fallback?

Posted Mar 4, 2026 17:42 UTC (Wed) by pj (subscriber, #4506) [Link] (1 responses)

Can AccECN be enabled per-connection? If it's enabled, and the outbound connect attempt with AccECN fails, will it fall back and try again without AccECN (maybe with plain ECN, maybe with bog-standard TCP) ? Such behavior might encourage better/faster adoption rates

per-connection? fallback?

Posted Mar 4, 2026 19:06 UTC (Wed) by corbet (editor, #1) [Link]

I don't believe there is per-connection control. It will definitely fall back, though, if an AccECN attempt fails; the procedure for doing that was designed into the protocol.

Funny docs

Posted Mar 11, 2026 9:25 UTC (Wed) by safari (guest, #96021) [Link]

On 6.18.16, 2 is maximum value I can write to tcp_ecn sysctl (otherwise I get EINVAL).
Documentation/networking/ip-sysctl.rst documents values 0-5.
However,
$ grep tcp_ecn_mode_max net/ipv4/sysctl_net_ipv4.c
static int tcp_ecn_mode_max = 2;
.extra2 = &tcp_ecn_mode_max,

However, this was changed in commit 8ae3e8e6ceedfb3cf74ca18169c942e073586a39 from 2 to 5.

Is 6.18.16 safe to run if I change that to 5?


Copyright © 2026, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds