Gettys: Whose house is of glasse, must not throw stones at another
The buffers are confusing TCP's RTT estimator; the delay caused by the buffers is many times the actual RTT on the path. Remember, TCP is a servo system, which is constantly trying to "fill" the pipe. So by not signalling congestion in a timely fashion, there is *no possible way* that TCP's algorithms can possibly determine the correct bandwidth it can send data at (it needs to compute the delay/bandwidth product, and the delay becomes hideously large). TCP increasingly sends data a bit faster (the usual slow start rules apply), reestimates the RTT from that, and sends data faster. Of course, this means that even in slow start, TCP ends up trying to run too fast. Therefore the buffers fill (and the latency rises). Note the actual RTT on the path of this trace is 10 milliseconds; TCP's RTT estimator is mislead by more than a factor of 100. It takes 10-20 seconds for TCP to get completely confused by the buffering in my modem; but there is no way back."
Posted Dec 7, 2010 4:03 UTC (Tue)
by jg (guest, #17537)
[Link]
See: http://gettys.wordpress.com/2010/11/29/home-router-puzzle... and http://gettys.wordpress.com/2010/12/02/home-router-puzzle... for details; Linux is more guilty than most here, though Mac OSX and Windows also suffer.
Jim
Posted Dec 7, 2010 5:19 UTC (Tue)
by drag (guest, #31333)
[Link] (18 responses)
It's really really cool beyond belief that you (Getty) figured it out.
I mean seriously amazing stuff. No question about it.
I terms of technical insight and investigative ability this was a HUGE hit out of the ballpark. Way out. You not only got a home run, you hit it over the stands, past the parking lot and it's bouncing over the highway as we speak.
Internet history in the making. No question about it at all.
Beyond belief you deserve the gratitude of, well, anybody with high speed internet access to the internet.
Words escape me. All I can do is shake my head in awe. Completely awesome.
Kudos.
Now we just need to get the word out.
Posted Dec 7, 2010 7:55 UTC (Tue)
by paulj (subscriber, #341)
[Link]
Posted Dec 7, 2010 8:29 UTC (Tue)
by PO8 (guest, #41661)
[Link] (11 responses)
Reading the fine article is probably helpful here. I'm guessing that even if he thought he was discovering something new here, one of these people probably was able to fill him in. I guess.
Posted Dec 7, 2010 10:45 UTC (Tue)
by jg (guest, #17537)
[Link] (10 responses)
It was the severity and widespread nature of the bufferbloat problem that was a monumental surprise. I misdiagnosed the root cause initially. And then pulling on the thread went far afield from broadband, connecting with Dave Reed's discoveries about 3g, to the Linux specific txqueuelen disaster and into home routers.
We have both many of the causes and some of the solutions to the bufferbloat problem already present in Linux, and can move faster than most in the industry. Will we rise to the occasion? Only time will tell.
Posted Dec 7, 2010 11:33 UTC (Tue)
by paulj (subscriber, #341)
[Link]
Have you looked at the various academic papers on buffer sizing and, in particular, those looking at effects/benefits of "tiny buffers"? I presume you have...
Posted Dec 8, 2010 16:15 UTC (Wed)
by fuhchee (guest, #40059)
[Link] (8 responses)
For what it's worth, I experimented briefly with forcing the txqueuelen
Posted Dec 9, 2010 12:18 UTC (Thu)
by marcH (subscriber, #57642)
[Link]
Posted Dec 10, 2010 4:10 UTC (Fri)
by jg (guest, #17537)
[Link]
And the total amount of buffering that is appropriate also depends on the workload.
And also the link "goodput".
For some hardware (and workloads) you'll be able to set txqueuelen and the rings way down; for others, not (they may already be very small).
On the transmit side, you really want pretty minimal buffering; but as usual, zero is the wrong answer ;-).
So on this laptop, I've had no problems setting my ethernet interface ring to 64 (the lowest it will go) and txqueuelen to zero. Note that on this laptop, this means I actually have 64 packets of buffering present, at minimum.
But on some other older hardware I've played with, setting txqueuelen to zero is a recipe for disaster, as the hardware has essentially no buffering.
The fundamental issue is *there is no single number* you can pick which will be right for everyone for buffering. And in the quest for single stream TCP performance over very high speed networks, we've happened to set our current number to values that are really insanely high for many situations. The challenge going forward, I believe, is making the system smart enough to actually figure out how much buffering is appropriate.
More about this in tomorrow's blog installment.
Posted Dec 10, 2010 8:34 UTC (Fri)
by Yenya (subscriber, #52846)
[Link] (5 responses)
I would guess your hangs with txqueuelen of 5 can be from the same reason. For values of about 100, however, I don't think it can be the issue.
Posted Dec 10, 2010 19:56 UTC (Fri)
by giraffedata (guest, #1954)
[Link] (3 responses)
Why?
Posted Dec 11, 2010 8:43 UTC (Sat)
by dlang (guest, #313)
[Link] (2 responses)
I ran into a similar problem years ago with the heartbeat packets of a high availability package, with a few machines there were no problems, but when you had lots of machines, the messages got large enough to require several UDP packets per message, and on a busy network the odds that some of the packets needed for a message would get lost became statistically significant and the entire cluster became unreliable.
Posted Dec 11, 2010 19:26 UTC (Sat)
by giraffedata (guest, #1954)
[Link] (1 responses)
I had a problem once (it developed one day, hung on few a few days, and then went away -- something out in the Internet changed) where a VPN link would not conduct a packet larger than something like 1220 bytes at all. So small ssh shell conversations worked fine, but try to ls a page full of files: 100% failure. I lowered the MTU size on the "tun" device and all was well. I guess that was something different from what we're talking about now.
Posted Dec 14, 2010 9:38 UTC (Tue)
by Yenya (subscriber, #52846)
[Link]
Posted Dec 11, 2010 22:31 UTC (Sat)
by paulj (subscriber, #341)
[Link]
Posted Dec 7, 2010 10:58 UTC (Tue)
by jg (guest, #17537)
[Link] (1 responses)
Many of the puzzle pieces were handed to me (unassembled) by Comcast.
As always, we are on the shoulders of other giants: the area of congestion avoidance was explored with a depth of understanding I admire deeply by the likes of Van Jacobson, Sally Floyd, and many, many others. If Ive done anything important here, it has been recognizing that the problem is occurring in other parts of the end-to-end system than conventional internet core routers, where it was pretty fully explored in the 1980′s and 1990′s.
And chance is very important: aiding me was knowing some of the players here, so that when I smelled smoke, they could diagnose the fire, giving me the confidence to dig deeper and look more widely. So in part, its being in a particular place at a particular time.
Posted Dec 7, 2010 12:53 UTC (Tue)
by drag (guest, #31333)
[Link]
No. Your cool.
> Many of the puzzle pieces were handed to me (unassembled) by Comcast.
Yay Comcast.
> As always, we are on the shoulders of other giants: the area of congestion avoidance was explored with a depth of understanding I admire deeply by the likes of Van Jacobson, Sally Floyd, and many, many others. If Ive done anything important here, it has been recognizing that the problem is occurring in other parts of the end-to-end system than conventional internet core routers, where it was pretty fully explored in the 1980′s and 1990′s.
Plus you described it in a useful way in a entertaining format packed full of information on how individuals can diagnose and take some steps to mitigate the problem on their own.
Trust me, stuff like this matters. :)
Posted Dec 7, 2010 18:03 UTC (Tue)
by chad.netzer (subscriber, #4257)
[Link] (2 responses)
Posted Dec 8, 2010 1:10 UTC (Wed)
by drag (guest, #31333)
[Link]
Posted Dec 8, 2010 1:18 UTC (Wed)
by rahvin (guest, #16953)
[Link]
Posted Dec 7, 2010 7:39 UTC (Tue)
by gmaxwell (guest, #30048)
[Link] (3 responses)
So not only is there not much of an incentive for things like active queue management, some providers are strongly discouraged by their customers from behaving reasonably.
Not likely a factor in the sort residential broadband networks that were being looked at here but it does partially explain some common industry practices.
Posted Dec 7, 2010 15:26 UTC (Tue)
by foom (subscriber, #14868)
[Link] (1 responses)
Just to go through some numbers: you have a local network, with an RTT of < 1ms. Your remote app generally responds in < 10ms.
If you drop a SYN packet, the retransmit delay if you don't get a SYNACK is 3 *seconds*. You've now blown your expected response time by 300x.
If you drop the last packet in a request or response, you have to wait for the other side not to send an ack. The expected time to wait for an ack is supposed to be informed by the RTT estimator, except that the *minimum* (RTO_MIN) is pegged to 200ms in linux, so you have a 200ms delay in the request/response, even on a 1ms RTT network! You've now blown your expected response time by 20x.
Note that both those delays are impossibly large compared to the expected RTT. With those kinds of numbers, it's a Really Really Bad Thing to drop packets...it's no wonder people decide to buffer them for a few ms instead!
The only time there's no issue with dropping packets is when it's in the middle of a large data transfer stream: SACK will take care of that, and the packet will be retransmitted with no delay to the overall transfer.
There is some research about reducing (or eliminating) RTO_MIN -- let the retransmit time be informed purely by the RTT estimator. But one issue is that delayed-ack is a 40ms timeout (itself huge!), which means that reducing the RTO_MIN below that value will cause many packets to be sent twice (only twice, because an immediate ack is sent after the second transmission of the same packet is received). And, RTO_MIN is controlled by the sender, while delayed-ack is controlled by the receiver.
Sigh.
Posted Dec 7, 2010 19:23 UTC (Tue)
by dlang (guest, #313)
[Link]
Posted Dec 7, 2010 18:44 UTC (Tue)
by jd (guest, #26381)
[Link]
Dropping packets can be good - hence schemes developed to specifically drop packets. eg: RED, BLUE, BLACK, GREEN, PURPLE, WHITE. (Does anyone notice a pattern here?) If two packets would otherwise collide, dropping one rather than losing both will improve overall throughput.
However, the problem is to drop the right packets. Not all packets are equal. As far as I can tell, packet dropping schemes don't attempt to introduce anything but the simplest of bias. Which may indeed be all you can do. If analysis takes longer than the resend, you don't gain with the cleverer bias.
There are, of course, other mechanisms. ECN handles throttling, for example.
As far as over-large buffers are concerned, it is important to distinguish between two different methodologies - those buffers that try and buffer everything and those which are sub-divided somehow (one per conversation, one per classification, etc).
It's also important to distinguish between various buffering schemes (flush on fill, continuous drain, real-time, etc). If a buffer is polled regularly (so it acts as more of a queue than a classical buffer), or where there are certain guarantees as to the minimum and maximum time a packet can be held without either being sent or being dropped, then you've a very different behaviour than when the buffer has to be entirely filled and then dumped in one go.
As far as Linux is concerned, it has ample code for QoS and windowing controls. There's also the Web100 code for adding various tweaks. There is absolutely ZERO excuse for any distribution not providing a kernel that is sub-standard. If there is to be any laws passed regarding Internet traffic, I want it illegal to ship incompetently-configured networking software, with a mandatory minimum of 100 hours meta-moderating Slashdot.
Posted Dec 7, 2010 7:55 UTC (Tue)
by jmm82 (guest, #59425)
[Link]
One way to trick tcp and fill the pipe is to use multiple tcp connections for one transfer. I have used custom ftp application that do so and they help fill your pipe. This can also be experienced using bit-torrent. This is ok as long as you are not sharing the connection with other applications.
Posted Dec 7, 2010 11:28 UTC (Tue)
by marcH (subscriber, #57642)
[Link] (8 responses)
ethtool -s eth0 advertise 0x002
Now try to upload a big file and see how the latency jumps to the roof. The explanation is just here:
ethtool -g eth0
Already reported 6 years ago:
Posted Dec 7, 2010 11:45 UTC (Tue)
by marcH (subscriber, #57642)
[Link]
Conclusion: "We believe that the buffers in backbone routers are much
Another good starting point into this old and really vast literature:
Posted Dec 7, 2010 19:16 UTC (Tue)
by jmm82 (guest, #59425)
[Link]
This is often hide on many computers because the Ethernet on their machine is not the bottleneck and therefore the traffic is queued elsewhere.
Posted Dec 8, 2010 15:08 UTC (Wed)
by jg (guest, #17537)
[Link] (2 responses)
You don't see bufferbloat on Windows at 100Mbps, as they have deliberately tuned their TCP to never exceed about 85Mbps (XP, Vista, and Win 7), so in usual use on the most common switch still in use, Windows won't exhibit buffer bloat (since the ethernet bandwidth is higher than the NIC, the bottleneck moves somewhere else).
I suspect that Microsoft ran into bufferbloat, but didn't fully understand it. I'd love to talk to the right engineers and find out. Bufferbloat (from comments in one of their tech notes) appears to have played havoc with the responsiveness of media playing), and so they rate limited their network stack (by default). If you put much latency into the loop, the media player/media server control loop won't function well. Since usually in other situations the bottleneck is in some other device (home router or broadband device), they exonerated Windows and moved on.
Or maybe they were brighter than we are; but then I think they would have gone for a more general fix. The engineers/scientists there are really good, and they now have a first rate TCP stack (not true in XP days). So I doubt they would have stopped there if they had analyzed the situation correctly.
Posted Dec 9, 2010 16:44 UTC (Thu)
by BenHutchings (subscriber, #37955)
[Link] (1 responses)
Posted Dec 12, 2010 20:26 UTC (Sun)
by marcH (subscriber, #57642)
[Link]
Wow, now this is really serious bufferbloat. Infinity is going to be hard to beat.
Posted Dec 11, 2010 6:34 UTC (Sat)
by njs (subscriber, #40338)
[Link] (1 responses)
But, now that I actually understand what's going on, it turns out that for my situation a one *character* kernel change suffices! -- http://thread.gmane.org/gmane.linux.kernel.wireless.gener...
Man, this has been frustrating me for *years*. I hope it gets fixed for real soon!
Posted Dec 12, 2010 9:09 UTC (Sun)
by paulj (subscriber, #341)
[Link]
Posted Dec 18, 2010 1:35 UTC (Sat)
by clemenstimpler (guest, #71914)
[Link]
Posted Dec 7, 2010 12:48 UTC (Tue)
by werth1 (guest, #48435)
[Link] (18 responses)
What is new and exciting is the extend of the analysis and the discovery that this actually increases the amount of dropped packages for large file transfers by a few percent.
Posted Dec 7, 2010 14:28 UTC (Tue)
by pizza (subscriber, #46)
[Link] (9 responses)
Deep buffers means that the medium is always utilized as much as possible, which gives better throughput numbers at the expense of latency. But that latency doesn't really crop up until you try to use multiple, differing data streams.
My company's clients didn't like it when I pointed this out to them, but to quote a certain engineer, "I kinna change the laws of physics!"
Don't get me wrong, I'm glad to see this bit of sleuthing presented with such visibility, but... it seems to me like one of those dirty open secrets that everyone knew about but nobody completely thought through.
I'm not sure what we can really do about this though.. ("Oh no, this new kernel is slower than the old one, it sucks") ...at least until the PowersThatBe are made to care about multi-stream performance.
Posted Dec 7, 2010 15:37 UTC (Tue)
by paulj (subscriber, #341)
[Link] (7 responses)
Posted Dec 7, 2010 18:14 UTC (Tue)
by jg (guest, #17537)
[Link]
Posted Dec 7, 2010 19:36 UTC (Tue)
by pizza (subscriber, #46)
[Link] (5 responses)
(My wireless access point's current stats show that in the 9 days since it last lost power, nearly 17% of all transmited frames and 5% of received frames are retries)
Since you're talking about 802.11's default to retry 7 (or 4) times, you should also consider that 802.11's worst-case packet transmit time (max size * slowest rate, including time for the ACK) is under 19ms. Even under a highly-contended medium (ie poor S/N) the default max packet lifetime is 500ms, so your worst-case ping time attributable to 802.11 will be just under 1 second. If you're seeing multi-second delays that's far more likely due to the network stack queueing described in JG's excellent paper.
The 802.11e QoS stuff actually helps a lot to decrease latency for things that care about it, but that requires the incoming packets to be properly tagged.
Going wireless makes things a *lot* more complicated (you're at the mercy of a fundamentally and highly unreliable medium that you can't control) and the IEEE 802.11 Working Groups had to balance a lot of conflicting requirements -- The resulting design is sane, and the defaults are just that -- defaults that can be changed to better suit the facts on the ground (or over the air, as it were)
Posted Dec 7, 2010 22:34 UTC (Tue)
by paulj (subscriber, #341)
[Link] (4 responses)
No one is saying communication over 802.11 doesn't need retrans mechanisms somewhere. I'm saying the link layer is the wrong place for strong such schemes. You can accommodate all the possible spectrum of reliability/latency/throughput with protocols on top of "raw" unreliable links, but you can never use upper-level protocols to undo the damage done by building in ACKs and retransmit schemes into your lower layer link.
Posted Dec 8, 2010 3:11 UTC (Wed)
by pizza (subscriber, #46)
[Link] (3 responses)
1s RTT is not "daft for a link layer". It's not ideal, but there are plenty of real-world links out there that are worse than that -- and work just fine, even with TCP.
You're blaming the link-level for latency problems that are due to excessive/dumb buffering at the OS level.
If you have stuff that depends on low latency, then you need to use traffic shaping, QoS, and/or over-provisioning to ensure you can meet your needs. Linux (and 802.11) have all the mechanisms to do this already, but those mechanisms are rather useless without someone configuring sane policies and turning it all on.
Oh, and trusting application/os writers/users to do classification right instead of just saying "everything's high priority/low latency, gimme gimme gimme" which is basically what we have now.
Posted Dec 8, 2010 9:01 UTC (Wed)
by paulj (subscriber, #341)
[Link] (2 responses)
Also, you're clearly failing to understand the extremely healthy reasons for layering protocols. QoS, changes to OS buffers, shaping and over-provisioning simply can *not* undo the damage of the link underneath trying to implement reliability.
Posted Dec 10, 2010 7:55 UTC (Fri)
by jzbiciak (guest, #5246)
[Link] (1 responses)
Posted Dec 12, 2010 20:30 UTC (Sun)
by marcH (subscriber, #57642)
[Link]
Note: it is easy to emulate a satellite link using "netem".
Posted Dec 7, 2010 16:08 UTC (Tue)
by drag (guest, #31333)
[Link]
Posted Dec 7, 2010 16:26 UTC (Tue)
by marcH (subscriber, #57642)
[Link] (5 responses)
The only way to have good latency and throughput at the same time is QoS. Considering this is not yet a solved problem on your local workstation <http://lwn.net/Articles/404993/>, good luck with that on the Internet.
This being said, I tend to agree that the pendulum has been swung way too far in many devices/drivers. One simple way to put this excess into crude light and explain it in simple terms, even to non experts, is to stop describing queue sizes in terms of packets or bytes but in terms of milliseconds instead.
Posted Dec 7, 2010 22:33 UTC (Tue)
by intgr (subscriber, #39733)
[Link] (4 responses)
That's the point of this article, this wisdom does not apply to TCP. In fact, most of the article focuses on this one bulk transfer case where the buffering causes the connection itself to perform too many retransmits, actually reducing its throughput as well as destroying latency.
Posted Dec 8, 2010 9:41 UTC (Wed)
by marcH (subscriber, #57642)
[Link] (3 responses)
However, my bufferbloat experience (nice name BTW) has never been as bad as Jim's. I mean latency is destroyed of course but aggregate, directed throughput is NOT. I find TCP surprisingly robust.
I have a hard time finding any clear *throughput* conclusion in Jim's (too long) blog entry. Granted TCP is "confused", but most people only look at only at the end result: throughput, and in only one direction at a time.
If Jim's aggregate *rsync throughput* (not just latency) was really reduced significantly, then I would say his configuration is more the exception than the rule. Every device or driver that goes out on the market is tested with TCP (at least). Unlike for latency, I doubt there is any widespread *directed throughput* problem in practice. Yet?
Posted Dec 8, 2010 15:38 UTC (Wed)
by jg (guest, #17537)
[Link] (2 responses)
The issue is that for everyone else, the connection becomes unusable, and fair bandwidth sharing is also destroyed.
It's so bad on many networks you don't want to even do web surfing while such a transfer is occurring, much less anything like VOIP or Skype.
Posted Dec 8, 2010 16:18 UTC (Wed)
by marcH (subscriber, #57642)
[Link] (1 responses)
No offence Jim but ALL gamers know this rule since AGES.
A few years back I was working on this topic. Then I was amazed to see some researchers and other "lab experts" ignoring or sometimes even denying this basic rule of thumb. They were probably working too hard and not playing enough!
When your model is not good enough, throw away reality. This will increase your chances to get your paper published.
Posted Dec 8, 2010 20:28 UTC (Wed)
by jg (guest, #17537)
[Link]
If I've done anything, it's showing that a single TCP connection can fill the buffers, and that bufferbloat is causing havoc in subtle ways.
The problem is that as a systematic problem, bufferbloat has not been widely recognized. It pervades just about everything from end to end. Applications have it, our TCP stacks may have it, the network code underneath has it, the device drivers have it, and just about everything else.
Hell, if I don't mess with txqueuelen and my ethernet driver with ifconfig and ethtool, I get hundreds of milliseconds on a ethernet switch on this Linux laptop.
But ISP's have often not understood, and certainly most engineers building network hardware (and operating systems) have not.
Parts I haven't bogged about yet include elsewhere in the internet core, 3g networks, corporate networks, etc.
The problem here is to educate everyone enough to be sensitive to and to solve the problem.
The internet is busted folks: we are both part of the problem and part of the solution.
Posted Dec 8, 2010 15:27 UTC (Wed)
by jg (guest, #17537)
[Link] (1 responses)
But my tests are not extensive, and I've heard of sporadic reports of much higher loss rates, so far non-reproducible.
Over a year ago, when I was last chasing home network disease in my house (documented in an older blog entry), my tests were showing much higher loss rates than the 1-3% I got out of the traces I've taken. Now, that may have been lightning damage in either my router's NIC or the cable modem; but then again, it might have been bufferbloat provoking TCP, if the timing was pessimal. I can't easily reproduce that test configuration, as all the hardware involved is now dead due to lightning.
Nick Weaver also reported to me seeing higher loss rates last summer when he reproduced my tests on his home service, but since then has not seen the tens of percent losses again (neither has he been testing).
So it is entirely possible that there is some state where TCP is yet less sane the buffers induce sometimes.
But to prove it and understand the cause, we need TCP traces. Without them, it is anecdotal evidence at best, and I carefully claim the documented 1-3% I've observed in my traces.
Please do investigate: if you can catch higher loss rates, *please* do a packet capture and let me know!!!!
P.S. it has often been very difficult to monitor your overall home network, since you haven't been able to easily get a good tap between a home router and the broadband device. I stumbled across the following device which is a small switch which provides port mirroring and runs at a gigabit, and bought one (they are $150 or so). So I can finally watch all my traffic! Sometime soon, I'll try to see what all the spyware in web browsers are trying to do....
See: http://www.dual-comm.com/
Posted Dec 8, 2010 21:33 UTC (Wed)
by jmm82 (guest, #59425)
[Link]
Posted Dec 7, 2010 17:32 UTC (Tue)
by adamgundy (subscriber, #5418)
[Link] (12 responses)
there are many, many tools out there to do this from the consumer end - builtin traffic shaping in the linux kernel (using for example the 'wondershaper' script or others), most of the 'replacement' firmwares for routers that are based on linux (Tomato etc) have built in support for this, 'gaming' or 'VOIP' routers that have QOS built in..
there are several major issues:
1) you have to shape to a known speed (outbound), to avoid filling the buffers - that works great with business class broadband, where you get exactly what you pay for, but not so well with consumer broadband where the bandwidth is contended - you end up setting a low threshold otherwise latency goes to hell when you're contended and start filling the modem's buffer again.
2) removing the buffers (or using small ones) slows down TCP. queue prioritization solves that (and linux does it nicely, if configured). QOS enabled switches in a corporate LAN also have it.
3) inbound shaping is impossible to do correctly - it needs to be shaped outbound, at the other end of the bottleneck. finding an ISP who will do that can be tricky (at least, consumer side - business ones are a little more willing, but typically only when it benefits *their* VOIP service, for example)
I see three real solutions:
(a) uncontended bandwidth, so you can shape appropriately. anyone thinking *that's* going to happen for consumer internet is clearly deluded..
(b) some way of having the modem tell the router about backpressure (and yes, that's in the ethernet standards, but they never *do* it) and disable or reduce its buffering.
(c) have the modem itself support QOS levels and fair queueing.. then its buffers can be put to good use without screwing up latency. seems like an easy one until you realize that it's not that easy - you have to configure the right ports, protocols, etc, and most users don't even realize their modem is anything more than a dumb brick..
the most worrying thing in the entire article was the amount of head-scratching apparently going on at Comcast - their network techs *surely* know about QOS, right?
Posted Dec 7, 2010 18:16 UTC (Tue)
by jg (guest, #17537)
[Link] (7 responses)
Unfortunately, many devices have no provision for QOS, including much/all of broadband (this is certainly true for the cable modem/CMTS link).
So QOS is *not* a panacea; it may be a useful tool, but it is *not* sufficient by itself. Some sort of queue management is fundamentally necessary.
Posted Dec 7, 2010 20:39 UTC (Tue)
by marcH (subscriber, #57642)
[Link] (1 responses)
QoS prioritizes packets. Your VoIP application does not care at all if there is packet loss and huge latency *for others*.
> and potential (or real) congestion collapse.
I have experienced bufferbloat a number of times, yet I have never seen any "collapse". No more than on a satellite link (which is basically "Airbufferbloat"). Things just become slow.
> So QOS is *not* a panacea; it may be a useful tool, but it is *not* sufficient by itself. Some sort of queue management is fundamentally necessary.
I miss the difference between QoS and queue management.
Posted Dec 10, 2010 8:35 UTC (Fri)
by jzbiciak (guest, #5246)
[Link]
Well, in the case of satellite, that isn't actually buffer bloat. Your bandwidth*delay product is actually that large. Since your RTT is also huge (1 second if both directions go via satellite), your throughput becomes a function of your TCP window size more than anything. If anything, you'll send too little traffic into the link on any one connection, and so each connection will be much slower than the raw underlying bandwidth if it's a high bandwidth uplink. But, with TCP window scaling, that shouldn't be a problem.
With bufferbloat, you end up oversubscribing the connection with "eager" traffic that gets accepted, but just queued. Because it gets accepted more quickly, TCP mis-estimates the actual bandwidth of the stream. When it sees the ACKs coming back slowly, then takes the original bandwidth estimate times the delay and comes up with a huge bandwidth*delay product which isn't actually measuring data in flight, but rather data in queues. With that in hand it goes right on saturating the queues. Any other traffic that comes along now has to sit in those same queues and sweat it out. In the satellite example, if other traffic showed up and there wasn't much actual buffering ahead of the satellite uplink (and likewise in the satellite itself), then the other traffic's packets would get on the air in an orderly fashion, and you wouldn't see a huge amount of jitter even if you had a high bandwidth transfer in parallel. The point is that it's always a fixed transit time to bounce off the satellite once the packet gets on the air. The speed of light doesn't change with network congestion. Queues in a router are a different story. The latency between when you insert and when you remove something from a network buffer varies with the traffic, leading to what Jim Gettys describes: TCP overestimates the bandwidth and therefore the bandwidth*delay product, and so overcommits the network to the high-speed stream.
Posted Dec 7, 2010 21:32 UTC (Tue)
by yaap (subscriber, #71398)
[Link]
Cable Modems (CM) and CMTS use the DOCSIS standard nowadays, and it supports QoS since DOCSIS 1.1 (we're now at 3.0). Moreover, to sell a CM you must get it certified, and the certification verifies that QoS is properly implemented. A CM must have a valid DOCSIS-signed certificate, which covers a specific software and hardware versions. You must pass certification to get this X.509 certificate, and without it a CM is not let into the cable network. There's also certification for CMTS, it's a bit less paranoid on versions but anyway, all CMTS do support DOCSIS QoS too.
Now all this QoS framework can only be controlled by the cable operator (as per the standard). And they typically stick to the basics: one best-effort flow for data, dedicated flows for VoIP when needed.
So as a user, you don't see any of this. But why couldn't the cable operator let the user configure it's own QoS? Well the model is not suitable for this. If there was a notion of a per-user global budget, there wouldn't be an issue to let the user do QoS within it (QoS bounded to the user level). But that's not how it works. In the model, each flow has its dedicated budget (min and peak rate, priority, latency...), and all flows are top level (no "user" level QoS). So it's not practical to let the user configure this.
And because there's already QoS on the cable side, I guess all CM makers don't bother adding an additional user level QpS sub-system.
This was the situation a few years ago, when I was working on cable products. I don't think it has changed much.
Posted Dec 7, 2010 22:16 UTC (Tue)
by adamgundy (subscriber, #5418)
[Link] (3 responses)
if you get the QOS correct, then TCP takes care of the throttling for you. the missing component after QOS is traffic shaping so that each TCP connection gets a fair slice of the available bandwidth, rather than one hogging it all. the two things are pretty much bundled together in the Linux implementation.
the whole reason the buffers are there in the first place is to prevent packet drops - it allows routers to even out the flow instead of just dropping the packets. provided you don't ever *fill* the buffer, life is good.
the major issue is *trust*. you can do QOS and shaping outbound relatively easily (modulo not knowing your available bandwidth in a contended situation), but inbound involves the ISP marking (or trusting the markings on) some packets. that makes for easy DOS situations. IMHO they should just mark the 'flow' inbound according to what's marked outbound, then the user gets to choose..
> Unfortunately, many devices have no provision for QOS, including
cable modems have queue priority support in them - have ever since DOCSIS 1. the issue is that the ISPs disable it... mainly because it doesn't work well with contended bandwidth - there's no support for packet classification in the standard.
you occasionally see it in use when they provide their own VOIP solution (they basically reserve enough bandwidth for a couple of lines), but Comcast I believe opted for a completely separate modem instead.
> So QOS is *not* a panacea; it may be a useful tool, but it is *not*
yes. queue priorities are part of QOS (it doesn't work without them). fair shaping is an extra addon, which not all solutions offer (commercial switches often have QOS but not shaping, at least the lower end models).
you should try this: get a Tomato compatible router, flash the firmware, then turn on QOS. you'll have to set your bandwidth cap correctly to avoid contention. you'll be shocked at the difference. alternatively, for an even simpler solution, buy a 'gaming' router and just plug it in - I use a DLINK DI-634M as a home solution and it works wonders (even attempts to guess the available bandwidth instead of setting it manually).
buffering outside of the last mile just isn't an issue, unless something very bad is going on. the backbone providers simply provision enough bandwidth to ensure they never overflow the buffers.. the issue is all about trying to squeeze too much data through a slow pipe.
Posted Dec 7, 2010 23:49 UTC (Tue)
by mgedmin (subscriber, #34497)
[Link] (2 responses)
Posted Dec 7, 2010 23:55 UTC (Tue)
by adamgundy (subscriber, #5418)
[Link] (1 responses)
try tomato.. it's much slicker and runs on a WRT54GL no problem.
Posted Dec 8, 2010 6:21 UTC (Wed)
by Los__D (guest, #15263)
[Link]
Posted Dec 7, 2010 20:31 UTC (Tue)
by marcH (subscriber, #57642)
[Link] (3 responses)
It is hard but not totally impossible, you can cheat TCP into some form of "inbound shaping". The trick is to artificially create an inbound bottleneck on your premises. Even when this bottleneck is downtream, as long as it is somewhat "narrower" than any other bottleneck on the path it will be the first to drop packets which will half the TCP congestion window of senders. By keeping the queue size of the artificial bottleneck small you can preserve latency.
Of course this does not work at all if you are hammered down by oblivious inbound UDP traffic.
> I see three real solutions:
I am not sure what you mean by "uncontended". No matter where the bottleneck between A and B is, TCP is designed to reach it. See the Cisco reference I posted above.
> (b) some way of having the modem tell the router about backpressure (and yes, that's in the ethernet standards, but they never *do* it) and disable or reduce its buffering.
Backpressure is just pushing the problem one step further. And it typically adds a Head Of Line blocking problem when multiplexing.
> (c) have the modem itself support QOS levels and fair queueing.. then its buffers can be put to good use without screwing up latency. seems like an easy one until you realize that it's not that easy - you have to configure the right ports, protocols, etc,
I think the configuration complexity of this is what killed alternatives to IP. In other words, IP won because it did not care about QoS, in accordance with the End to End principle.
Posted Dec 7, 2010 21:23 UTC (Tue)
by adamgundy (subscriber, #5418)
[Link] (2 responses)
> It is hard but not totally impossible, you can cheat TCP into some form of
tried this.. didn't work (well). you end up with 'sawtooth' latency as the existing connections try to ramp up, then fall back. you have to throttle the bandwidth down enough to deal with any upswing, which is composed of all the TCP connections that decide to ramp up at any point in time.. you have to throw away a big chunk of your bandwidth to make it work (and it's still prone to someone starting a torrent, or using an FTP client that opens 20 connnections at once)
end result is you fight a losing battle - you throw away a decent chunk of your expensive bandwidth (eg 30%), and *still* have crappy sounding phone calls occasionally because an FTP client went nuts with dozens of connections... or a bunch of spammers hit your email server at the same time. sigh.
> Of course this does not work at all if you are hammered down by oblivious inbound UDP traffic.
yup. but less likely, unless you're being DOSed
Posted Dec 7, 2010 21:47 UTC (Tue)
by marcH (subscriber, #57642)
[Link] (1 responses)
Agreed, this trick is insanely complicated to setup and fine-tune and works only in simple cases. I should have highlighted this, thanks for doing it.
Posted Dec 8, 2010 10:31 UTC (Wed)
by i3839 (guest, #31386)
[Link]
Modems are usually the bottleneck and know the link speed, they could do the above automatically. Even finetuning can happen automatically, by monitoring the delays and adjusting the limits if latency is too high.
It may not be perfect, but it's a hell of a lot better than the default.
Posted Dec 7, 2010 17:40 UTC (Tue)
by luto (guest, #39314)
[Link] (4 responses)
In principle, if all the networks used some sensible active queue management, and all routers and apps supported ECN, then under good conditions and light to moderate load there would be no need to drop packets.
Posted Dec 7, 2010 18:20 UTC (Tue)
by jg (guest, #17537)
[Link] (3 responses)
The story I've heard is that a particular Taiwainese ODM shipped a pile of code used in home routers a while ago that keels over and dies if it sees a ECN bit set. Various ISP's and big institutional networks and the like have been avoiding service calls by clearing the ECN bit.
Research is underway to see if we can go ahead and deploy it.
If we can, it's almost certainly the case that ECN should get marked anytime the buffers begin to grow (before packet loss occurs).
Posted Dec 11, 2010 5:08 UTC (Sat)
by kijiki (subscriber, #34691)
[Link] (1 responses)
You can do even better by setting ECN bits on a percentage of packets in a flow that is proportional to the depth of the buffer. Sadly it requires modifying endpoint OSes to not halve the window every time it sees an ECN bit, so it is probably not deployable outside of datacenter clusters.
http://research.microsoft.com:8081/pubs/121386/dctcp-publ...
Posted Dec 11, 2010 6:57 UTC (Sat)
by njs (subscriber, #40338)
[Link]
Posted Dec 13, 2010 17:23 UTC (Mon)
by quotemstr (subscriber, #45331)
[Link]
Posted Dec 7, 2010 18:24 UTC (Tue)
by martinfick (subscriber, #4455)
[Link]
http://en.wikipedia.org/wiki/Micro_Transport_Protocol
This was mentioned on LWN back in May:
http://lwn.net/Articles/389105/
Does Jim not have an LWN subscription? :)
Posted Dec 7, 2010 19:27 UTC (Tue)
by piggy (guest, #18693)
[Link]
I can attest that the understanding of the harm done by large buffers is far from universal. A few years ago I found myself trying to convince an office full of networking professionals that our device needed to drop packets, and drop them quickly. Fortunately, my job was to exercise the equipment under load. It was very easy to construct plausible examples which performed terribly. This led to a significant refactoring of our product.
Posted Dec 7, 2010 22:42 UTC (Tue)
by sustrik (guest, #62161)
[Link] (1 responses)
Posted Dec 8, 2010 15:50 UTC (Wed)
by jg (guest, #17537)
[Link]
I've seen bufferbloat in many applications (including the X server).
Apps are just as prone to bufferbloat as the network.
Think end to end: you have to manage all buffers to control latency. The trick is figuring out how to manage them efficiently and correctly. This turns out to be subtle and sometimes hard, and there is some real research to do. It is particularly difficult when you are in situations where predicting the available bandwidth is difficult (e.g. wireless, graphics). One size never fits all.
- Jim
Posted Dec 8, 2010 5:06 UTC (Wed)
by gdt (subscriber, #6284)
[Link] (1 responses)
Nice work :-) My own experience tuning tails and customer equipment is that it is not so much the amount of buffering which is the issue so much as the buffer playout algorithm. That is, too much of the buffer is exposed to any one TCP connection because the router is running FIFO rather than a real queueing algorithm. Also, what is exposed allows resource starvation of the buffer -- that is, some flows get too much buffer and some get far too little. The poor buffering makes unfair TCP algorithms even unfairer, you can see Linux's CUBIC suffer from that when running parallel flows. I notice some commenters are drawing conclusions about all routers from tail routers. A tail sees more congestion, but also sees a smaller range of bandwidth-delay products in the TCP connections. The core sees less congestion but a massive range of BDPs. The tuning is very different, the core having a focus on avoiding syncronisation (since 40Gbps of in-sync traffic can really mess up your backbone).
Posted Dec 8, 2010 16:02 UTC (Wed)
by jg (guest, #17537)
[Link]
You can see in my traces periodic behavior; if there are tendencies (as there often are in large systems) for such behavior to synchronize, you can get really bad aggregate time based congestion behavior.
Best, of course that the underlying flows don't start exhibiting periodic behavior so they won't participate in such synchronization.
A future posting will be my fears for the future.
We've effectively destroyed congestion avoidance by this buffering, for TCP and other protocols.
So I have to fear possible congestion collapse. The more subtle form of this is this self synchronizing aggregate behavior.
And the traffic pattern of the Internet is now finally shifting as Windows XP finally gets retired. Just because we don't see (with the exception of 3g networks) bad congestion collapse today, doesn't mean we may not see it in the future, as the overall traffic shifts toward TCP's that run with window scaling more able to congest all links. And time based behavior is more subtle than the simple congestion collapse case.
Maybe I'm paranoid (having experience the NSFnet collapse), but I do lose sleep about it.
Posted Dec 12, 2010 21:33 UTC (Sun)
by astrophoenix (guest, #13528)
[Link]
http://lartc.org/howto/lartc.cookbook.ultimate-tc.html
I fired up four bittorents and starting tuning my upload and download limits. 13/16 for download and 3/4 for upload is working well today.
Before this, I was in the habit of only starting bittorrent before I went to bed; I couldn't surf the web from my laptop while it was going! Now I can't even tell if bittorrent is going while I surf.
Posted Dec 21, 2010 17:10 UTC (Tue)
by JesperBrouer (guest, #62728)
[Link] (5 responses)
Skimming through Getty's blog post, I think Getty has actually missed what is happening. He should read my masters thesis[1], specifically page 11.
The real problem is that TCP/IP is clocked by the ACK packets, and on asymmetric links (like ADSL and DOCSIS), the ACK packets are simply comming downstream too fast (on the larger downstream link), resulting in bursts and high-latency on the upstream link.
With the ADSL-optimizer I actually solved Gettys problem, by having an ACK queue, which is bandwidth "sized" to the opposite link size.
But I guess the real solution would be to implement a TCP algorithm which handels this asymmetry, and e.g. isn't based on the ACK feedback...
Posted Dec 23, 2010 0:49 UTC (Thu)
by quotemstr (subscriber, #45331)
[Link] (4 responses)
Posted Dec 30, 2010 8:10 UTC (Thu)
by JesperBrouer (guest, #62728)
[Link] (3 responses)
The high latencies are of-cause caused by packets being queue, often by too big (default?) queues, I'm not saying Getty is wrong.
I'm just pointing out that part of the TCP/IP clocking mechanism breakes down on asymmetric links. An mechanism which would mitigate this scenario, by giving a smoother data clocking on the upstream link.
I think that, the root cause of the buffer-bloat problem, is because ISP are configuring a default queue size on all their products, regardless of the products bandwidth.
ISPs should really choose the queue size based upon the products bandwidth.
Posted Dec 30, 2010 14:17 UTC (Thu)
by JesperBrouer (guest, #62728)
[Link]
Where I have explained where the delay comes from and how to do the math.
Posted Dec 31, 2010 17:36 UTC (Fri)
by paulj (subscriber, #341)
[Link] (1 responses)
E.g. Jim pointed out the default maximum Tx queue length on Linux interfaces is 1000 packets for many kinds of interfaces (ethernet, even some wireless). This is excessively large for many ethernets, given the wide-scale use of slow wifi. After setting it much lower (e.g. 100 ~= 10ms serialisation delay for an 11Mbit/s effective capacity link and 1200 byte packets), I see much tighter variance in latency between hosts which are connected by wifi.
Posted Jan 3, 2011 21:22 UTC (Mon)
by JesperBrouer (guest, #62728)
[Link]
The buffer I measured was in the ADSL modem, because it is the bottleneck in the setup, quite simple. My test PC is connected via a 100 Mbit/s port to the ADSL modem, and the exit point from the ADSL modem is a 512 Kbit/s link...
The very interesting point made by Getty (which I didn't think about), is that Wifi introduces another bottleneck in the system. When introducing Wifi into the equation, then we have a new bottleneck, and then the buffering occurs on your Linux box, which has significant txqueuelen buffer bloat (just checked that my wlan0 has a txqueuelen=1000, which should be changed/lowered for wifi!).
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
- Jim
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
way down from 1000 to a range like 5-100 on some NFS client/server boxes running 2.6.34ish. Any combination of nfs3/nfs4 tcp/udp resulted in NFS filesystem hangs.
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Problems due to Internet buffering
When the buffers on the switch are too small, for every fragmented UDP packet one or more fragments are lost
Problems due to Internet buffering
Oh, I see. The original description made it sound like the packet gets dropped because of the fragmentation. And unconditionally. The problem is due to the 8K retransmission unit size, not the fragmentation per se.
Problems due to Internet buffering
Problems due to Internet buffering
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
- Jim
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Just (re-)tested with tg3 2.6.35.9-64.fc14.
http://oss.sgi.com/archives/netdev/2004-05/msg00180.html
Gettys: Whose house is of glasse, must not throw stones at another
larger than they need to be possibly by two orders of magnitude." (2004!)
http://www.cisco.com/web/about/ac123/ac147/archived_issue... (2006)
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
The basic problem was known in 1999:
http://lkml.org/lkml/1999/4/27/50
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
For a low RTT You have to avoid all other unthrottled operations on the network that can fill your buffers. That includes all these little task bar apps calling home for updates, P2P VoIP like skype, your web browser refreshing flash advertisements, your email client checking for news, other unthrottled computers sharing the network, and viruses sending spam. The later is a good candidate when it should work but doesn't.
Most P2P software allows to use client side throttling which circumvents the failure of the TCP/IP throttling. What also works is to deploy a client side traffic shaper that throttles the bandwidth. Essentially you are shifting the bottleneck out of the modem where it fills the large buffers into the clients where you have a better control over the buffer.
This is a consequence of coding to a benchmark.
This is a consequence of coding to a benchmark.
This is a consequence of coding to a benchmark.
This is a consequence of coding to a benchmark.
This is a consequence of coding to a benchmark.
This is a consequence of coding to a benchmark.
This is a consequence of coding to a benchmark.
This is a consequence of coding to a benchmark.
This is a consequence of coding to a benchmark.
This is a consequence of coding to a benchmark.
same old trade-off
same old trade-off
same old trade-off
same old trade-off
same old trade-off
same old trade-off
Gettys: Whose house is of glasse, must not throw stones at another
- Jim
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
> down in congestion, we'll have congested network with bad packet loss
> and potential (or real) congestion collapse.
> much/all of broadband (this is certainly true for the cable modem/CMTS
> link).
> sufficient by itself. Some sort of queue management is fundamentally
> necessary.
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
> (a) uncontended bandwidth, so you can shape appropriately. anyone thinking *that's* going to happen for consumer internet is clearly deluded..
Gettys: Whose house is of glasse, must not throw stones at another
> "inbound shaping". The trick is to artificially create an inbound
> bottleneck on your premises. Even when this bottleneck is downtream, as
> long as it is somewhat "narrower" than any other bottleneck on the path it
> will be the first to drop packets which will half the TCP congestion
> window of senders. By keeping the queue size of the artificial bottleneck
> small you can preserve latency.
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
> get marked anytime the buffers begin to grow (before packet
> loss occurs).
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
Already known by most LWN readers?
Gettys: Whose house is of glasse, must not throw stones at another
What about application level?
What about application level?
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
- Jim
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
--Jesper Dangaard Brouer
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another
I have created a blog post on the buffer-bloat subject:
Gettys: Whose house is of glasse, must not throw stones at another
Buffer-Bloat: The calculations
Gettys: Whose house is of glasse, must not throw stones at another
Gettys: Whose house is of glasse, must not throw stones at another