LWN.net Logo

Pluggable congestion avoidance modules

Many years ago, when the TCP/IP protocols were young, the early Internet went through a bad period. As the number of systems on the net grew, the high-speed (56K) long-haul links which tied the backbone sites together became clogged and the net became very difficult to use. The TCP implementations in use at that time did not understand how to deal with (or even detect) congestion, and, as a result, made the problem worse. Some people began to ask if TCP was going to work at all.

Van Jacobson saved the situation with a simple observation: there is no point in sending data faster than the slowest link between the endpoints can handle it, even if the local network connection is very fast. Overwhelming the long-haul link just causes lots of dropped packets, retransmissions, and even more congestion. The solution was to start transmitting data slowly on a new connection, then to ramp up the speed until packets started getting dropped. The optimal speed was deemed to be one at which just a very small number of packets would fail to arrive. That speed would be adjusted over the life of the connection as conditions on the network changed. With TCP tweaked in this way, communicating systems would scale back their transmissions as the network got more congested, but would ramp up when the bandwidth became available. The result was a net which actually worked for everybody involved. It became possible, for example, to download the entire GNU emacs distribution without having to split it into dozens of small pieces first.

We had to content ourselves with what we could get in those days.

Since then, the net has become much larger, more complex, and faster. The congestion avoidance problem has grown as well, to the point that there are several competing algorithms seeking to provide the best TCP performance while being fair to other network users. Several of these algorithms have found their way into Linux, with a corresponding increase in the complexity of the TCP code. As a way of helping those experimenting with congestion avoidance and eliminating the need to patch the TCP code directly, Stephen Hemminger has posted a new infrastructure which allows congestion avoidance algorithms to be written as pluggable modules. He has also reworked the existing algorithms in the kernel to use the new infrastructure. The result is, among other things, an opportunity to look at how these algorithms work.

The core of the TCP protocol is the concept of a "window," being the amount of data which one side is willing to accept from the other at any given time. The window size reflects what the receiving system can handle - how much buffer space it has available - but it says nothing about what the routers in between can deal with. Congestion avoidance algorithms try to account for the slowest link serving a connection with a "congestion window," which is the maximum amount of data which can be in transit without an acknowledgment from the remote end. An ideal congestion window setting will allow a system to maximize throughput on a connection without excessive packet loss rates, and without taking an unfair amount of the shared network bandwidth. Finding that setting is still more of an art than a science.

Stephen's patches create a new structure to identify a congestion avoidance algorithm:

    struct tcp_ca_type {
	void (*start)(struct tcp_sock *tp);
	u32 (*ssthresh)(struct tcp_sock *tp);
	u32 (*min_cwnd)(struct tcp_sock *tp);
	void (*cong_avoid)(struct tcp_sock *tp, u32 ack, 
			   u32 rtt, u32 in_flight, int good);
	void (*rtt_sample)(struct tcp_sock *tp, u32 rtt);
	void (*set_state)(struct tcp_sock *tp, u8 new_state);

	void (*cwnd_event)(struct tcp_sock *tp, enum tcp_ca_event ev);
	u32  (*undo_cwnd)(struct tcp_sock *tp);
	void (*get_info)(struct tcp_sock *tp, u32 ext, struct sk_buff *skb);

	struct list_head	list;
	struct module 		*owner;
	const char 		*name;
    };

Each of the methods in this structure is a hook into the TCP code which allows the algorithm to obtain information on network conditions and react accordingly:

  • The start() method initializes the algorithm when a new batch of data is being transmitted; this can happen for new sockets, or when one has been idle for a while.

  • The ssthresh() method calculates the "slow start threshold"; when the congestion window is below that threshold, the connection is in slow start mode rather than full congestion avoidance mode. This method is called when congestion occurs.

  • The actual initial window may be set by min_cwnd() to be less than the threshold value as a starting point for the slow start algorithm.

  • When an acknowledgment arrives from the remote end, the cong_avoid() method is invoked; it may respond to successful packet delivery by enlarging the congestion window.

  • rtt_sample() tells the algorithm about a measured round-trip time - the time taken between sending a packet and receiving the corresponding acknowledgment.

  • set_state() indicates that the TCP state of the socket has changed.

  • Various events of interest can be communicated to the algorithm via cwnd_event().

  • Sometimes, transient situations can cause the congestion window to be reduced; the undo_cwnd() method can be called when such a situation is detected to restore a larger window.

  • The get_info() method can be used to make congestion avoidance information available to user space.

The TCP "Reno" algorithm is Van Jacobson's original; it remains wired into the kernel in a non-pluggable form (though it can be overridden). The congestion window starts at the min_cwnd() value, and increases by one segment for each sequential acknowledgment received from the remote end until it hits the slow-start threshold. At that point, the congestion window increases much more slowly until it either hits the TCP window size or packet loss happens. When congestion is detected, the congestion window is cut in half (to a minimum of two segments) and the process starts over.

The Westwood algorithm is a tweak to the Reno approach. The Westwood code carefully tracks the round-trip times of the packets sent, and uses that information to estimate the effective bandwidth of the network connection. When packets get dropped, the congestion window and slow start thresholds are set relative to that bandwidth estimate. As a result, Westwood tends to back off more slowly than Reno, and should, thus, get better bandwidth overall. Its authors claim that Westwood is especially good for wireless links or other situations where the loss of an occasional packet may have nothing to do with congestion.

TCP Vegas also makes use of detailed round-trip time information. In particular, it tries to address a perceived failure in the Reno algorithm, which determines the optimal packet rate by increasing the congestion window until that rate is exceeded. Vegas, instead, monitors changes to the packet round-trip time as the congestion window is increased. If a larger window leads to longer round-trip times, the algorithm concludes that congestion is about to set in and the window is reduced slightly. The Vegas algorithm (or at least the Linux implementation thereof) does not perform well in all environments, so it is not enabled by default.

Binary Increase Congestion Control (BIC) [PDF] tries to be smarter about how the congestion window size is adjusted. Among other things, it is aimed at high-performance networks where the TCP window may be quite large. The other algorithms may, in congestion avoidance mode, make large changes to the congestion window which can result in abrupt increases in network traffic. The BIC code combines two algorithms as a way of trying to quickly converge on the proper congestion window while avoiding massive packet dumps. The core technique is a binary search; if the window is to be increased, the point midway between the current value and the maximum size is chosen. Decreases are handled by picking the midpoint between the current value and the threshold. If, however, the endpoints are too far apart, an "additive increase" is done instead - the congestion window is resized by a constant value.

The High-speed TCP algorithm is optimized for very fat pipes - 10G Ethernet and such. When things are congested, it behaves much like the Reno algorithm. When the congestion window is being increased, however, the high-speed algorithm makes use of a table to pick large increment values. This approach lets the congestion window get very large (i.e. tens of thousands of segments) quickly, and to stay large, without requiring that the network function for long periods of time without a single dropped packet.

The last of the pluggable modules is the TCP Hybla implementation. Hybla is based on the observation that the other algorithms, which use round-trip times heavily in their calculations, tend to be biased against satellite links and other high-latency connections. So Hybla includes a calculation which allows the congestion window to become larger, more quickly, when the round-trip time is very high. In this way, it tries to keep the pipe full enough to make use of the available bandwidth, even though the time to turn around any individual packet is long.

Stephen is currently suggesting that this patch set should go into 2.6.13, after a good shakedown period in the -mm tree. There does not seem to be a whole lot of opposition, so things may well happen just that way.


(Log in to post comments)

Pluggable congestion avoidance modules

Posted Mar 24, 2005 3:35 UTC (Thu) by jwb (guest, #15467) [Link]

Are all these algorithms actually implemented in the proposed patch? It could be interesting to
play with, because it makes TCP a game. Individual sysops may choose an algorithm that gives
them a slight advantage while ruining aggregate performance. That sounds pretty fun.

Pluggable congestion avoidance modules

Posted Mar 24, 2005 5:16 UTC (Thu) by corbet (editor, #1) [Link]

Yes, Stephen's patch include all of those algorithms - most of them are already in the kernel, he just repackaged them into modules.

Very nice article!

Posted Mar 24, 2005 8:09 UTC (Thu) by geertj (subscriber, #4116) [Link]

Compliments for the article. The background and history given here is very useful, and to me makes the difference between any Linux news site and the quality articles at LWN.

Very nice article!

Posted Mar 24, 2005 15:05 UTC (Thu) by Duncan (guest, #6647) [Link]

Indeed. I had in mind to post something very similar. Unlike some of the
other kernel articles, which sometimes go over my head (altho I'm sure
some find them as incredibly useful as I and apparently you found this,
and if that someone is the guy/gal writing the driver for that peripheral
I'm going to be buying...), this is exactly the type of thing that I can
make use of, therefore, exactly the type of thing that I have in mind when
it comes time to fill in that subscription reauthorization.

Duncan

Very nice article!

Posted Mar 24, 2005 15:20 UTC (Thu) by hmh (subscriber, #3838) [Link]

Agreed. The explanations are very very good, and this is what sets LWN appart from the other publications.

As a side note, after reading the article, I tried to select Westwood instead of BIC (which the kernel selects by default as I compiled all of them in), and got instant higher TCP performance on a network with an WiFi backbone link.

Thank you LWN!

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds