The debloat-testing kernel tree [LWN.net]

The debloat-testing kernel tree

Posted Feb 25, 2011 23:25 UTC (Fri) by louie (guest, #3285) [Link] (14 responses)

Has LWN done a no-bloat summary of the bufferbloat problem? I admit I couldn't read Jim's treatises, aka "blog posts." :)

The debloat-testing kernel tree

Posted Feb 25, 2011 23:33 UTC (Fri) by corbet (editor, #1) [Link] (3 responses)

No, not yet. Jim does explain it fairly well if you stick with him, so I wasn't sure how much we could improve on his stuff.

The debloat-testing kernel tree

Posted Feb 25, 2011 23:53 UTC (Fri) by mtaht (subscriber, #11087) [Link] (2 responses)

Some more accessible materials:

http://nex-6.taht.net/posts/Did_Bufferbloat_Kill_My_Net_R...

http://digital-rag.com/article.php/Buffer-Bloat-Packet-Loss

http://www.enterprisenetworkingplanet.com/netsp/article.p...

but to well and truly "get bufferbloat", and the internet-wide scope,I'd recommend 25 minutes with the brief audio while looking at these slides here:

http://mirrors.bufferbloat.net/Talks/BellLabs01192011/

The debloat-testing kernel tree

Posted Feb 26, 2011 0:00 UTC (Sat) by mtaht (subscriber, #11087) [Link] (1 responses)

I must also mention several EXCELLENT lwn discussions of bufferbloat:

http://lwn.net/Articles/419714/

http://lwn.net/Articles/421463/

quite a few more out there.

The debloat-testing kernel tree

Posted Feb 27, 2011 8:03 UTC (Sun) by faassen (guest, #1676) [Link]

Thanks for the various links, appreciated!

Bufferbloat: the summary

Posted Feb 25, 2011 23:35 UTC (Fri) by mikesalib (guest, #17162) [Link] (9 responses)

Yeah, I found Getty's writing to be...challenging to get through. So here's my summary:

Memory is cheap and adding more packet buffers to a networked system helps improve throughput a very small bit. Since ISPs and hardware/software vendors focus on total throughput, they've got an incentive to add lots of packet buffering.

The problem though is that huge amounts of packet buffering prevent TCP from dealing with congestion appropriately by throttling connections: the larger buffers means that TCP has to wait much longer than it should to detect congestion. As with all feedback control loops, increasing the delay seriously harms the system. As a result, we end up with systems where total throughput might be high but latency under load becomes absurd. So you can't transfer a large file while running a VOIP call: the file transfer will get good throughput, but all the packets will see absurd latencies so that VOIP is no longer feasible. That's the real problem: we all need to be using latency-under-load in addition to total-throughput as our metrics for network performance.

The problem is compounded by having excess buffers damn near everywhere. They're in your router, in your ISP's terminal equipment, in your NIC hardware and probably in a few places in the kernel.

Bufferbloat: the summary

Posted Feb 25, 2011 23:53 UTC (Fri) by jg (guest, #17537) [Link] (8 responses)

It's actually the case that the additional buffering has destroyed congestion avoidance altogether.

So we've actually *increased* packet loss much of the time, by trying to avoid packet loss, and in doing so, make the speed (latency) of doing anything when your connection is loaded often extremely long.

This occurs in all of the OS's to differing degrees, in home routers, in broadband systems, in 3g networks and some ISP's networks.

In Linux, there are usually two such buffers currently; and several more I have examples of, and one other I hypothesize. The two common disasters are the Linux transmit queue, and then most modern network hardware has insanely large transmit/receive rings.

Bufferbloat: the summary

Posted Feb 26, 2011 0:30 UTC (Sat) by zlynx (guest, #2285) [Link] (7 responses)

It's not insanely large when transmitting on the LAN they're designed for.

Bufferbloat: the summary

Posted Feb 26, 2011 2:05 UTC (Sat) by jg (guest, #17537) [Link]

Sometimes the buffer sizes insane even on the medium it was designed for.

Memory has gotten so big/so cheap that people often use values much much larger than makes sense under any circumstances.

For example, I've heard of DSL hardware with > 6 seconds of buffering.

Take a look at the netalyzr plots in: http://gettys.wordpress.com/2010/12/06/whose-house-is-of-...

The diagonal lines are latency in *seconds*.

Bufferbloat: the summary

Posted Feb 26, 2011 2:15 UTC (Sat) by jg (guest, #17537) [Link] (5 responses)

Also note there is never a "right" answer.

Example: A gigabit ethernet.

Say you have supposedly sized your buffers "correctly" presuming a global length path, at your maximum speed, presuming some number of flows, by the usual rule of thumb: bandwidth x delay x sqrt (#flows).

Now you plug this gigabit NIC, into your 100Mbps switch. Right off the bat, your system should be using 1/10th the # of buffers.

And you don't know how many flows.

So even if you did it "right", you have the *wrong answer*, and do so most of the time.

Example 2: 802.11n

Size your buffers for, say, 100Mbps over continental US delays, presuming some number of flows.

Now, go to a conference with 802.11g, and sit in a quiet corner. Your wireless might be running at a few megabits/second; but you are sharing the channel with 50 other people.

Your "right answer" for buffering can easily be off by 2-3 *orders magnitude*. At that low speed, it can take a *very long time* for your packets to finally get transmitted.

***There is no single right answer for the amount of buffering in most network environments.***

Right now, our systems' buffers are typically sized for the maximum amount of buffering they might ever need, even though we seldom operate them in that regime (if the buffer sizes were thought about by the engineers involved).

So the buffers aren't just over sized, they are downright bloated.

Bufferbloat: the summary

Posted Feb 26, 2011 12:20 UTC (Sat) by hmh (subscriber, #3838) [Link] (2 responses)

Actually, the answer is, and has always been, AQM. You can and should have a dynamically-sized queue, even on hosts (NOTE: socket buffers often should be rather large, this has nothing to do with the queues).

The queue should be able to grow large, but only for flows where the bandwidth-delay product requires it. And it should early-drop.

And the driver DMA ring-buffer size really should be considered part of the queue for any calculations, although you probably have to consider that part of the queue a "done deal" and not drop/reorder anything there. Otherwise, you can get even fast-ethernet to feel like a very badly behaved LFN (long fat network). However, reducing DMA ring-buffer size can have several drawbacks on high-throughput hosts.

Using latency-aware, priority-aware AQM (even if it is not flow-aware) should fix the worst issues, without downgrading throughput on bursty links or long fat networks. Teaching it about the hardware buffers would let it autotune better.

Bufferbloat: the summary

Posted Feb 26, 2011 13:37 UTC (Sat) by jg (guest, #17537) [Link] (1 responses)

Yes, AQM is the answer, including on hosts.

What AQM algorithm is a different question.

Van Jacobson says that RED is fundamentally broken, and has no hope of working in the environments we have to operate in. And Van was one of the
inventors of RED...

SFB may or may not hack it. Van has an algorithm he is finishing up the write up of that he thinks may work. Hopefully will be available soon. We have fundamentally interesting problem here. And testing this is going to be much more work than implementing, by orders of magnitude.

It isn't clear the AQM needs to be priority aware; wherever the queues are building, you are more likely to choose a packet to drop (literally drop, or ECN mark) just by running an algorithm across all the queues. I haven't seen arguments that makes me believe the AQM must be per queue (that doesn't mean there aren't any! just I haven't seen them).

And there are good reasons why the choice of packet to drop should have randomness in it; time based congestion can occur if you don't. Different packet types also have different leverage to them (acks, vs. data, vs. syn, etc.).

Bufferbloat: the summary

Posted Feb 26, 2011 16:14 UTC (Sat) by hmh (subscriber, #3838) [Link]

You're likely not going to get anywhere above "acceptable" using a simple AQM, even if it is SFB. It is not going to get to "good" or "excelent" marks.

The Diffserv model got it right, in the sense that even on a simple host, there are flows for which you do NOT want to drop packets (DNS, NTP) if you can help it, and that there is naturally an hierarchy of priorities of which services you'd rather suffer more packet drops than others during congestion.

I've also found that "socializing" the available bandwidth among flows of the same class is a damn convenient thing (SFQ). SFB does this well, AFAIK.

So, I'd say that what we should aim for hosts is an auto-tuned flow-aware AQM that at least pays attention to the bare minimum of priority ordering (802.1p/DSCP class selectors) and does a good job of keeping latency under wraps without killing throughput on high bandwidth-delay product flows. Such a beast could be enabled by default on a distro [for desktops] with little fear.

This doesn't mean you need multiple queues. However, you will want multiple queues in many cases because that's how hardware-assisted QoS works, such as what you find on any 802.11n device or non-el-cheap-o gigabit ethernet NIC.

Routers are a different deal altogether.

Bufferbloat: the summary

Posted Feb 26, 2011 17:58 UTC (Sat) by kleptog (subscriber, #1183) [Link] (1 responses)

Note: you have to be very careful to distinguish different kinds of buffers. The TCP windows and hence buffers at the *endpoints* of TCP connections do need to be scaled according to the bandwidth-delay product. However, the *routers* in between don't need anywhere near that much, they're routing IP packets and the buffers are just for getting good utilisation from the link.

Any delay in the routers adds to the overall delay and thus adds to the bandwidth-delay product. In essence your endpoint's memory usage is the sum of the memory used by all the routers in between. Packets spend more time in memory than they do in-flight.

There was a paper somewhere about buffers and streams and the more streams you have the few buffers you need. So your endpoints need big buffers, your modem smaller buffers and internet routers practically none.

Bufferbloat: the summary

Posted Feb 27, 2011 16:33 UTC (Sun) by mtaht (subscriber, #11087) [Link]

This paper has been very influential on my thinking about routers, home gateways and personal computers and coping with bufferbloat.

http://www.cs.clemson.edu/~jmarty/papers/PID1154937.pdf

Wireless is not part of this study and has special problems (retries)...

The debloat-testing kernel tree

Posted Feb 26, 2011 0:36 UTC (Sat) by zlynx (guest, #2285) [Link] (12 responses)

After reading about buffer bloat again I'm not sure the problem is in the buffers. It sure sounds like the problem is in TCP/IP.

As the buffers start to fill up the ACK timestamps should get further and further behind. This should tell the sender to slow down even if it is still getting ACKs for every sent packet.

Why doesn't this happen?

The debloat-testing kernel tree

Posted Feb 26, 2011 1:43 UTC (Sat) by drag (guest, #31333) [Link]

I donno.

Probably something like TCP/IP does not assume that latency is a good measurement of available bandwidth. Over a high latency link it's not going to be a reliable indicator of anything. Packets can take different paths, arrive out of order, and all sorts of fun stuff like that.

The fact that latency may change randomly may mean that your running on a different path over the internet or a dozen other things. If your not losing packets it should be a safe indication that you can continue to send at the current rate*.

* That is unless your router has a gigantic bit bucket and can cache everything, of course. Maybe Linux should have a dial on it that says 'Ok you have hit 100% network bandwidth, now dial it back 20% to avoid potential issues if you see latency changes' or something like that.

Connections don't always have the same properties in both directions

Posted Feb 26, 2011 2:57 UTC (Sat) by david.a.wheeler (subscriber, #72896) [Link]

One reason, I believe, is that connections don't always have the same properties in both directions.

The debloat-testing kernel tree

Posted Feb 26, 2011 3:12 UTC (Sat) by jg (guest, #17537) [Link]

Well, it does get further and further behind.

So you end up using 1 more buffer per RTT. The RTT does continue to increase, and increase, until you've filled the buffer at the bottleneck (in each direction). You can see this in the TCP traces I took.

But having added soooo much latency (many times the natural RTT), the servo system that is TCP can no longer be stable. Bad Things Happen (TM). No servo system can be stable with those delays; things oscillate.

Remember, you should have seen a packet drop much sooner; you don't because of these bloated buffers. At some point, TCP thinks the path may have changed, and goes searching for a potential higher operating point, and goes back to multiplicative increase.

So the buffers then rapidly fill and drop lots of packets, and you get these horrible bursts (that you can see in my traces), with periods of order 10 seconds.

I still don't understand all of what I see in those traces; I need to sit down with a real TCP expert. Why the pattern is a double peak, I don't understand.

The debloat-testing kernel tree

Posted Feb 26, 2011 5:08 UTC (Sat) by walken (subscriber, #7089) [Link] (8 responses)

> After reading about buffer bloat again I'm not sure the problem is in the buffers. It sure sounds like the problem is in TCP/IP.

In a sense you are right. The TCP/IP congestion control algorithms used on the internet today ("reno", "newreno", "bic", "cubic") are all based on detecting dropped packets. Alternatives have been proposed, such as "vegas", that are based on detecting increasing latency when buffers fill up.

The problem is that nobody has figured out yet how to mix drop based algorithms with delay based ones: if vegas and cubic streams share the same connection, vegas backs off as soon as the buffers fill up but cubic keeps going until packets start dropping. This leaves the vegas streams starved - no good. So delay based algorithms have never been able to take off, given the lack of a clean transition mechanism.

Technically it seems easier to fix the drop based algorithms by making sure you don't let buffer grow too much before dropping the first packets; but this is a fix that needs to be deployed in all routers rather than just in the endpoints.

The debloat-testing kernel tree

Posted Feb 26, 2011 22:53 UTC (Sat) by k3ninho (subscriber, #50375) [Link] (7 responses)

>The problem is that nobody has figured out yet how to mix drop based algorithms with delay based ones

It may be the wrong question. Sure, I get that noting dropped packets gives you an idea what's going on, but for the network links connected to your hardware -- the ones you can affect -- the rate at which your buffer changes tells you how congested you are.

I know it's Saturday night (in my timezone) and I'm speculating on a web forum, but why not inform the feedback control loop with the rate (or lowest recent rate) of outflow from your buffers, and throttle to match that outflow for a given delta-t time period after sampling?

You have the buffers, you don't need to lose packets enqueue'd and you use the most meaningful local measure of congestion without having to do a lot of calculations.

K3n.

The debloat-testing kernel tree

Posted Feb 26, 2011 23:15 UTC (Sat) by dlang (guest, #313) [Link] (6 responses)

the problem is that the buffers that are the biggest part of the problem are not on the equipment that you control.

The debloat-testing kernel tree

Posted Feb 27, 2011 3:46 UTC (Sun) by jg (guest, #17537) [Link]

Actually, much of the problem *is* on hardware you control.

Your host.
Your home router.
Your broadband gear (which may be your home router).

The problem is, you don't (with the possible exception of your host if you are a Linux geek) typically have ability to get this gear fixed.

Note that since many home routers run Linux, fixing it in Linux also means that (if you run an open source home router, such as OpenWRT), we might get much of the problem fixed quite quickly.

So please come help out at bufferbloat.net.

The debloat-testing kernel tree

Posted Feb 27, 2011 3:49 UTC (Sun) by jg (guest, #17537) [Link]

Much of it is on equipment you control.

On your host
On your home router
On your broadband gear

So if you come help fix it in Linux, and run an open source router such as OpenWRT, you can get help much of this fixed quickly...

The debloat-testing kernel tree

Posted Feb 28, 2011 9:24 UTC (Mon) by k3ninho (subscriber, #50375) [Link] (1 responses)

Then, my suggestion could still be implemented on the internet's routers on links I don't personally control, couldn't it?

(I note no technical objection. I also think you may have missed the context of my phrase "the network links connected to your hardware -- the ones you can affect": it is actually talking as if you are the piece of hardware, sending TCP/IP at any point within the internetwork. Not just your laptop sharing wifi at a conference or up and down an asynchronous subscriber line.)

K3n.

The debloat-testing kernel tree

Posted Mar 1, 2011 3:54 UTC (Tue) by kevinm (guest, #69913) [Link]

The technical objection is that those routers out on the Internet aren't in control of the feedback control loop. The control is entirely in the sending endpoint - the routers are just a dumb source of feedback (by dropping or delaying packets).

The debloat-testing kernel tree

Posted Feb 28, 2011 10:36 UTC (Mon) by etienne (guest, #25256) [Link] (1 responses)

Ask available bandwidth by standard SNMP calls?

The debloat-testing kernel tree

Posted Feb 28, 2011 16:28 UTC (Mon) by jzbiciak (guest, #5246) [Link]

That seems rather unlikely to work in an end-to-end scenario. It also wouldn't respond to transient conditions unless you kept asking over and over. And what about trust issues? Do you trust the information you're getting?

The only thing you can truly rely on is the observed behavior of your own TCP streams--ie. how quickly you get your ACKs back, whether or not you experience traffic loss, and so on.

If you're willing to give up on end-to-end architecture, then yeah, you could coordinate bandwidth-controlled "virtual circuits", but I thought the whole point of IP was to keep as much of the network as state-less as possible with respect to individual connections, letting the endpoints do most of the throttling and letting the network just focus on routing and providing feedback on the current conditions.

The debloat-testing kernel tree

Posted Feb 28, 2011 4:43 UTC (Mon) by butlerm (subscriber, #13312) [Link] (2 responses)

The right way to fix this problem is with ECN (explicit congestion notification), or something like ECN. Far better than dropping packets.

So if there is a problem with broken devices filtering out ECN packets, some sort of fallback strategy should be adopted so that everything can be ECN by default, and give a moderate incentive for the defective devices to be replaced. Of course the network device queues still need to start ECN marking at relatively low queue lengths. It is just a lot easier to do that without problems if you are not dropping packets until absolutely necessary.

It is unfortunate that we have a bandwidth cult in our industry. Very low or no packet loss at low latency makes a much bigger difference for interactive applications than squeezing out the highest possible bandwidth.

The debloat-testing kernel tree

Posted Feb 28, 2011 10:27 UTC (Mon) by nix (subscriber, #2304) [Link] (1 responses)

I've had ECN turned on for years, and a small transmit queue (10 packets) on my (wshaper-shaped) ADSL-router-bound network link. I still see oscillation with period ~1.5s, and VOIP in conjunction with large git pulls is slow and crackly (though barely tolerable). I am certain that my TCP/IP-clued ISP is not flipping ECN off.

So ECN is not a panacea. I suspect I am bufferbloated by the buffers in my (ISP-supplied) zyxel ADSL router, but it appears impossible to inspect, let alone change, the size of those buffers. (There is no useful manual for this router's command line interface, so it is very hard to tell. Typical, really: hundreds of pages documenting the hopelessly uninformative web interface, five pages describing almost nothing about the command line. I could experiment, only if I get it wrong enough I could lock myself out of the router and lose my network connection forever. No way.)

(btw, when people figure out the fix for this, if it involves changes to traffic shaping, can someone package the whole thing up in something like the long-maintenance-dead wshaper? The whole traffic shaping stuff in Linux is woefully underdocumented: I tried reading lartc.org but it just left me more confused than when I started, and there appear to be no useful manpages yet, thanks Alexey. So I just leave wshaper to do its job and hope, even though bits of it appear to throw errors on recent kernels. Consider me a normal geek who isn't a TCP/IP specialist. If I can't do it, no non-geek is going to have a hope.)

The debloat-testing kernel tree

Posted Mar 1, 2011 5:26 UTC (Tue) by butlerm (subscriber, #13312) [Link]

I suspect the problem is with your DSL router. One of the challenges with ECN is that every bottleneck router must implement it, with reasonable queue sizes / mark thresholds, for it to work well. If the queue size or mark threshold is ridiculously large or high (especially for a WAN uplink), ECN isn't going to help very much even when the bottleneck router supports it.

Ideally, of course, individual packets would be tagged with the round trip latency, and routers would bin them into different queues and ECN mark accordingly. That would help work around the defective router problem. End hosts could do the marking, or even provide direct feedback to the transport layer, for example. No need to wait a whole round trip. And where the end hosts do it, no in-PDU data would be required. Just pass the estimated RTT to the network layer.