The debloat-testing kernel tree
The purpose of this tree is to provide a reasonably stable base for the development and testing of new algorithms, miscellaneous fixes, and maybe a few hacks intended to advance the cause of eliminating or at least mitigating bufferbloat in the Linux world." Current patches include the CHOKe packet scheduler, the SFB flow scheduler, some driver patches, and more.
Posted Feb 25, 2011 23:25 UTC (Fri)
by louie (guest, #3285)
[Link] (14 responses)
Posted Feb 25, 2011 23:33 UTC (Fri)
by corbet (editor, #1)
[Link] (3 responses)
Posted Feb 25, 2011 23:53 UTC (Fri)
by mtaht (subscriber, #11087)
[Link] (2 responses)
Some more accessible materials:
http://nex-6.taht.net/posts/Did_Bufferbloat_Kill_My_Net_R...
http://digital-rag.com/article.php/Buffer-Bloat-Packet-Loss
http://www.enterprisenetworkingplanet.com/netsp/article.p...
but to well and truly "get bufferbloat", and the internet-wide scope,I'd recommend 25 minutes with the brief audio while looking at these slides here:
http://mirrors.bufferbloat.net/Talks/BellLabs01192011/
Posted Feb 26, 2011 0:00 UTC (Sat)
by mtaht (subscriber, #11087)
[Link] (1 responses)
http://lwn.net/Articles/419714/
http://lwn.net/Articles/421463/
quite a few more out there.
Posted Feb 27, 2011 8:03 UTC (Sun)
by faassen (guest, #1676)
[Link]
Posted Feb 25, 2011 23:35 UTC (Fri)
by mikesalib (guest, #17162)
[Link] (9 responses)
Memory is cheap and adding more packet buffers to a networked system helps improve throughput a very small bit. Since ISPs and hardware/software vendors focus on total throughput, they've got an incentive to add lots of packet buffering.
The problem though is that huge amounts of packet buffering prevent TCP from dealing with congestion appropriately by throttling connections: the larger buffers means that TCP has to wait much longer than it should to detect congestion. As with all feedback control loops, increasing the delay seriously harms the system. As a result, we end up with systems where total throughput might be high but latency under load becomes absurd. So you can't transfer a large file while running a VOIP call: the file transfer will get good throughput, but all the packets will see absurd latencies so that VOIP is no longer feasible. That's the real problem: we all need to be using latency-under-load in addition to total-throughput as our metrics for network performance.
The problem is compounded by having excess buffers damn near everywhere. They're in your router, in your ISP's terminal equipment, in your NIC hardware and probably in a few places in the kernel.
Posted Feb 25, 2011 23:53 UTC (Fri)
by jg (guest, #17537)
[Link] (8 responses)
So we've actually *increased* packet loss much of the time, by trying to avoid packet loss, and in doing so, make the speed (latency) of doing anything when your connection is loaded often extremely long.
This occurs in all of the OS's to differing degrees, in home routers, in broadband systems, in 3g networks and some ISP's networks.
In Linux, there are usually two such buffers currently; and several more I have examples of, and one other I hypothesize. The two common disasters are the Linux transmit queue, and then most modern network hardware has insanely large transmit/receive rings.
Posted Feb 26, 2011 0:30 UTC (Sat)
by zlynx (guest, #2285)
[Link] (7 responses)
Posted Feb 26, 2011 2:05 UTC (Sat)
by jg (guest, #17537)
[Link]
Memory has gotten so big/so cheap that people often use values much much larger than makes sense under any circumstances.
For example, I've heard of DSL hardware with > 6 seconds of buffering.
Take a look at the netalyzr plots in: http://gettys.wordpress.com/2010/12/06/whose-house-is-of-...
The diagonal lines are latency in *seconds*.
Posted Feb 26, 2011 2:15 UTC (Sat)
by jg (guest, #17537)
[Link] (5 responses)
Example: A gigabit ethernet.
Say you have supposedly sized your buffers "correctly" presuming a global length path, at your maximum speed, presuming some number of flows, by the usual rule of thumb: bandwidth x delay x sqrt (#flows).
Now you plug this gigabit NIC, into your 100Mbps switch. Right off the bat, your system should be using 1/10th the # of buffers.
And you don't know how many flows.
So even if you did it "right", you have the *wrong answer*, and do so most of the time.
Example 2: 802.11n
Size your buffers for, say, 100Mbps over continental US delays, presuming some number of flows.
Now, go to a conference with 802.11g, and sit in a quiet corner. Your wireless might be running at a few megabits/second; but you are sharing the channel with 50 other people.
Your "right answer" for buffering can easily be off by 2-3 *orders magnitude*. At that low speed, it can take a *very long time* for your packets to finally get transmitted.
***There is no single right answer for the amount of buffering in most network environments.***
Right now, our systems' buffers are typically sized for the maximum amount of buffering they might ever need, even though we seldom operate them in that regime (if the buffer sizes were thought about by the engineers involved).
So the buffers aren't just over sized, they are downright bloated.
Posted Feb 26, 2011 12:20 UTC (Sat)
by hmh (subscriber, #3838)
[Link] (2 responses)
The queue should be able to grow large, but only for flows where the bandwidth-delay product requires it. And it should early-drop.
And the driver DMA ring-buffer size really should be considered part of the queue for any calculations, although you probably have to consider that part of the queue a "done deal" and not drop/reorder anything there. Otherwise, you can get even fast-ethernet to feel like a very badly behaved LFN (long fat network). However, reducing DMA ring-buffer size can have several drawbacks on high-throughput hosts.
Using latency-aware, priority-aware AQM (even if it is not flow-aware) should fix the worst issues, without downgrading throughput on bursty links or long fat networks. Teaching it about the hardware buffers would let it autotune better.
Posted Feb 26, 2011 13:37 UTC (Sat)
by jg (guest, #17537)
[Link] (1 responses)
What AQM algorithm is a different question.
Van Jacobson says that RED is fundamentally broken, and has no hope of working in the environments we have to operate in. And Van was one of the
SFB may or may not hack it. Van has an algorithm he is finishing up the write up of that he thinks may work. Hopefully will be available soon. We have fundamentally interesting problem here. And testing this is going to be much more work than implementing, by orders of magnitude.
It isn't clear the AQM needs to be priority aware; wherever the queues are building, you are more likely to choose a packet to drop (literally drop, or ECN mark) just by running an algorithm across all the queues. I haven't seen arguments that makes me believe the AQM must be per queue (that doesn't mean there aren't any! just I haven't seen them).
And there are good reasons why the choice of packet to drop should have randomness in it; time based congestion can occur if you don't. Different packet types also have different leverage to them (acks, vs. data, vs. syn, etc.).
Posted Feb 26, 2011 16:14 UTC (Sat)
by hmh (subscriber, #3838)
[Link]
The Diffserv model got it right, in the sense that even on a simple host, there are flows for which you do NOT want to drop packets (DNS, NTP) if you can help it, and that there is naturally an hierarchy of priorities of which services you'd rather suffer more packet drops than others during congestion.
I've also found that "socializing" the available bandwidth among flows of the same class is a damn convenient thing (SFQ). SFB does this well, AFAIK.
So, I'd say that what we should aim for hosts is an auto-tuned flow-aware AQM that at least pays attention to the bare minimum of priority ordering (802.1p/DSCP class selectors) and does a good job of keeping latency under wraps without killing throughput on high bandwidth-delay product flows. Such a beast could be enabled by default on a distro [for desktops] with little fear.
This doesn't mean you need multiple queues. However, you will want multiple queues in many cases because that's how hardware-assisted QoS works, such as what you find on any 802.11n device or non-el-cheap-o gigabit ethernet NIC.
Routers are a different deal altogether.
Posted Feb 26, 2011 17:58 UTC (Sat)
by kleptog (subscriber, #1183)
[Link] (1 responses)
Any delay in the routers adds to the overall delay and thus adds to the bandwidth-delay product. In essence your endpoint's memory usage is the sum of the memory used by all the routers in between. Packets spend more time in memory than they do in-flight.
There was a paper somewhere about buffers and streams and the more streams you have the few buffers you need. So your endpoints need big buffers, your modem smaller buffers and internet routers practically none.
Posted Feb 27, 2011 16:33 UTC (Sun)
by mtaht (subscriber, #11087)
[Link]
http://www.cs.clemson.edu/~jmarty/papers/PID1154937.pdf
Wireless is not part of this study and has special problems (retries)...
Posted Feb 26, 2011 0:36 UTC (Sat)
by zlynx (guest, #2285)
[Link] (12 responses)
As the buffers start to fill up the ACK timestamps should get further and further behind. This should tell the sender to slow down even if it is still getting ACKs for every sent packet.
Why doesn't this happen?
Posted Feb 26, 2011 1:43 UTC (Sat)
by drag (guest, #31333)
[Link]
Probably something like TCP/IP does not assume that latency is a good measurement of available bandwidth. Over a high latency link it's not going to be a reliable indicator of anything. Packets can take different paths, arrive out of order, and all sorts of fun stuff like that.
The fact that latency may change randomly may mean that your running on a different path over the internet or a dozen other things. If your not losing packets it should be a safe indication that you can continue to send at the current rate*.
* That is unless your router has a gigantic bit bucket and can cache everything, of course. Maybe Linux should have a dial on it that says 'Ok you have hit 100% network bandwidth, now dial it back 20% to avoid potential issues if you see latency changes' or something like that.
Posted Feb 26, 2011 2:57 UTC (Sat)
by david.a.wheeler (subscriber, #72896)
[Link]
Posted Feb 26, 2011 3:12 UTC (Sat)
by jg (guest, #17537)
[Link]
So you end up using 1 more buffer per RTT. The RTT does continue to increase, and increase, until you've filled the buffer at the bottleneck (in each direction). You can see this in the TCP traces I took.
But having added soooo much latency (many times the natural RTT), the servo system that is TCP can no longer be stable. Bad Things Happen (TM). No servo system can be stable with those delays; things oscillate.
Remember, you should have seen a packet drop much sooner; you don't because of these bloated buffers. At some point, TCP thinks the path may have changed, and goes searching for a potential higher operating point, and goes back to multiplicative increase.
So the buffers then rapidly fill and drop lots of packets, and you get these horrible bursts (that you can see in my traces), with periods of order 10 seconds.
I still don't understand all of what I see in those traces; I need to sit down with a real TCP expert. Why the pattern is a double peak, I don't understand.
Posted Feb 26, 2011 5:08 UTC (Sat)
by walken (subscriber, #7089)
[Link] (8 responses)
In a sense you are right. The TCP/IP congestion control algorithms used on the internet today ("reno", "newreno", "bic", "cubic") are all based on detecting dropped packets. Alternatives have been proposed, such as "vegas", that are based on detecting increasing latency when buffers fill up.
The problem is that nobody has figured out yet how to mix drop based algorithms with delay based ones: if vegas and cubic streams share the same connection, vegas backs off as soon as the buffers fill up but cubic keeps going until packets start dropping. This leaves the vegas streams starved - no good. So delay based algorithms have never been able to take off, given the lack of a clean transition mechanism.
Technically it seems easier to fix the drop based algorithms by making sure you don't let buffer grow too much before dropping the first packets; but this is a fix that needs to be deployed in all routers rather than just in the endpoints.
Posted Feb 26, 2011 22:53 UTC (Sat)
by k3ninho (subscriber, #50375)
[Link] (7 responses)
It may be the wrong question. Sure, I get that noting dropped packets gives you an idea what's going on, but for the network links connected to your hardware -- the ones you can affect -- the rate at which your buffer changes tells you how congested you are.
I know it's Saturday night (in my timezone) and I'm speculating on a web forum, but why not inform the feedback control loop with the rate (or lowest recent rate) of outflow from your buffers, and throttle to match that outflow for a given delta-t time period after sampling?
You have the buffers, you don't need to lose packets enqueue'd and you use the most meaningful local measure of congestion without having to do a lot of calculations.
K3n.
Posted Feb 26, 2011 23:15 UTC (Sat)
by dlang (guest, #313)
[Link] (6 responses)
Posted Feb 27, 2011 3:46 UTC (Sun)
by jg (guest, #17537)
[Link]
Your host.
The problem is, you don't (with the possible exception of your host if you are a Linux geek) typically have ability to get this gear fixed.
Note that since many home routers run Linux, fixing it in Linux also means that (if you run an open source home router, such as OpenWRT), we might get much of the problem fixed quite quickly.
So please come help out at bufferbloat.net.
Posted Feb 27, 2011 3:49 UTC (Sun)
by jg (guest, #17537)
[Link]
On your host
So if you come help fix it in Linux, and run an open source router such as OpenWRT, you can get help much of this fixed quickly...
Posted Feb 28, 2011 9:24 UTC (Mon)
by k3ninho (subscriber, #50375)
[Link] (1 responses)
(I note no technical objection. I also think you may have missed the context of my phrase "the network links connected to your hardware -- the ones you can affect": it is actually talking as if you are the piece of hardware, sending TCP/IP at any point within the internetwork. Not just your laptop sharing wifi at a conference or up and down an asynchronous subscriber line.)
K3n.
Posted Mar 1, 2011 3:54 UTC (Tue)
by kevinm (guest, #69913)
[Link]
Posted Feb 28, 2011 10:36 UTC (Mon)
by etienne (guest, #25256)
[Link] (1 responses)
Posted Feb 28, 2011 16:28 UTC (Mon)
by jzbiciak (guest, #5246)
[Link]
The only thing you can truly rely on is the observed behavior of your own TCP streams--ie. how quickly you get your ACKs back, whether or not you experience traffic loss, and so on.
If you're willing to give up on end-to-end architecture, then yeah, you could coordinate bandwidth-controlled "virtual circuits", but I thought the whole point of IP was to keep as much of the network as state-less as possible with respect to individual connections, letting the endpoints do most of the throttling and letting the network just focus on routing and providing feedback on the current conditions.
Posted Feb 28, 2011 4:43 UTC (Mon)
by butlerm (subscriber, #13312)
[Link] (2 responses)
So if there is a problem with broken devices filtering out ECN packets, some sort of fallback strategy should be adopted so that everything can be ECN by default, and give a moderate incentive for the defective devices to be replaced. Of course the network device queues still need to start ECN marking at relatively low queue lengths. It is just a lot easier to do that without problems if you are not dropping packets until absolutely necessary.
It is unfortunate that we have a bandwidth cult in our industry. Very low or no packet loss at low latency makes a much bigger difference for interactive applications than squeezing out the highest possible bandwidth.
Posted Feb 28, 2011 10:27 UTC (Mon)
by nix (subscriber, #2304)
[Link] (1 responses)
So ECN is not a panacea. I suspect I am bufferbloated by the buffers in my (ISP-supplied) zyxel ADSL router, but it appears impossible to inspect, let alone change, the size of those buffers. (There is no useful manual for this router's command line interface, so it is very hard to tell. Typical, really: hundreds of pages documenting the hopelessly uninformative web interface, five pages describing almost nothing about the command line. I could experiment, only if I get it wrong enough I could lock myself out of the router and lose my network connection forever. No way.)
(btw, when people figure out the fix for this, if it involves changes to traffic shaping, can someone package the whole thing up in something like the long-maintenance-dead wshaper? The whole traffic shaping stuff in Linux is woefully underdocumented: I tried reading lartc.org but it just left me more confused than when I started, and there appear to be no useful manpages yet, thanks Alexey. So I just leave wshaper to do its job and hope, even though bits of it appear to throw errors on recent kernels. Consider me a normal geek who isn't a TCP/IP specialist. If I can't do it, no non-geek is going to have a hope.)
Posted Mar 1, 2011 5:26 UTC (Tue)
by butlerm (subscriber, #13312)
[Link]
Ideally, of course, individual packets would be tagged with the round trip latency, and routers would bin them into different queues and ECN mark accordingly. That would help work around the defective router problem. End hosts could do the marking, or even provide direct feedback to the transport layer, for example. No need to wait a whole round trip. And where the end hosts do it, no in-PDU data would be required. Just pass the estimated RTT to the network layer.
The debloat-testing kernel tree
No, not yet. Jim does explain it fairly well if you stick with him, so I wasn't sure how much we could improve on his stuff.
The debloat-testing kernel tree
The debloat-testing kernel tree
The debloat-testing kernel tree
The debloat-testing kernel tree
Bufferbloat: the summary
Bufferbloat: the summary
Bufferbloat: the summary
Bufferbloat: the summary
Bufferbloat: the summary
Bufferbloat: the summary
Bufferbloat: the summary
inventors of RED...
Bufferbloat: the summary
Bufferbloat: the summary
Bufferbloat: the summary
The debloat-testing kernel tree
The debloat-testing kernel tree
Connections don't always have the same properties in both directions
The debloat-testing kernel tree
The debloat-testing kernel tree
The debloat-testing kernel tree
The debloat-testing kernel tree
The debloat-testing kernel tree
Your home router.
Your broadband gear (which may be your home router).
The debloat-testing kernel tree
On your home router
On your broadband gear
The debloat-testing kernel tree
The debloat-testing kernel tree
The debloat-testing kernel tree
The debloat-testing kernel tree
The right way to fix this problem is with ECN (explicit congestion notification), or something like ECN. Far better than dropping packets.
The debloat-testing kernel tree
The debloat-testing kernel tree
I suspect the problem is with your DSL router. One of the challenges with ECN is that every bottleneck router must implement it, with reasonable queue sizes / mark thresholds, for it to work well. If the queue size or mark threshold is ridiculously large or high (especially for a WAN uplink), ECN isn't going to help very much even when the bottleneck router supports it.
The debloat-testing kernel tree