Gettys: Whose house is of glasse, must not throw stones at another [LWN.net]

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 4:03 UTC (Tue) by jg (guest, #17537) [Link]

This isn't just a router problem, Jon.

See: http://gettys.wordpress.com/2010/11/29/home-router-puzzle... and http://gettys.wordpress.com/2010/12/02/home-router-puzzle... for details; Linux is more guilty than most here, though Mac OSX and Windows also suffer.

Jim

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 5:19 UTC (Tue) by drag (guest, #31333) [Link] (18 responses)

This is freaking fantastic.

It's really really cool beyond belief that you (Getty) figured it out.

I mean seriously amazing stuff. No question about it.

I terms of technical insight and investigative ability this was a HUGE hit out of the ballpark. Way out. You not only got a home run, you hit it over the stands, past the parking lot and it's bouncing over the highway as we speak.

Internet history in the making. No question about it at all.

Beyond belief you deserve the gratitude of, well, anybody with high speed internet access to the internet.

Words escape me. All I can do is shake my head in awe. Completely awesome.

Kudos.

Now we just need to get the word out.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 7:55 UTC (Tue) by paulj (subscriber, #341) [Link]

Note that Jim has rediscovered well-known facts about networks and TCP. NB: I remain a fully paid up member of his fan club. ;)

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 8:29 UTC (Tue) by PO8 (guest, #41661) [Link] (11 responses)

"At this point, I worried that we (all of us) are in trouble, and asked a number of others to help me understand my results, ensure their correctness, and get some guidance on how to proceed. These included Dave Clark, Vint Cerf, Vern Paxson, Van Jacobson, Dave Reed, Dick Sites and others. They helped with the diagnosis from the traces I had taken, and confirmed the cause."

Reading the fine article is probably helpful here. I'm guessing that even if he thought he was discovering something new here, one of these people probably was able to fill him in. I guess.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 10:45 UTC (Tue) by jg (guest, #17537) [Link] (10 responses)

I knew very early on that others had been long ahead of me; but they've generally stopped at the particular instance of bufferbloat they uncovered, and/or failed to communicate widely. Dave Clark ran into bufferbloat on his DSLAM 6 years ago, who then told Comcast to look for it, who suggested to me my problems were bufferbloat. An amusing story there: Dave Clark used it to DDOS his son's excessive WOW playing.

It was the severity and widespread nature of the bufferbloat problem that was a monumental surprise. I misdiagnosed the root cause initially. And then pulling on the thread went far afield from broadband, connecting with Dave Reed's discoveries about 3g, to the Linux specific txqueuelen disaster and into home routers.

We have both many of the causes and some of the solutions to the bufferbloat problem already present in Linux, and can move faster than most in the industry. Will we rise to the occasion? Only time will tell.
- Jim

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 11:33 UTC (Tue) by paulj (subscriber, #341) [Link]

Jim,

Have you looked at the various academic papers on buffer sizing and, in particular, those looking at effects/benefits of "tiny buffers"? I presume you have...

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 8, 2010 16:15 UTC (Wed) by fuhchee (guest, #40059) [Link] (8 responses)

"the Linux specific txqueuelen disaster"

For what it's worth, I experimented briefly with forcing the txqueuelen
way down from 1000 to a range like 5-100 on some NFS client/server boxes running 2.6.34ish. Any combination of nfs3/nfs4 tcp/udp resulted in NFS filesystem hangs.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 9, 2010 12:18 UTC (Thu) by marcH (subscriber, #57642) [Link]

No matter how and why this happens, this is a bug.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 10, 2010 4:10 UTC (Fri) by jg (guest, #17537) [Link]

The total buffering available is a function of both the transmit queue and the ring buffers in the NIC.

And the total amount of buffering that is appropriate also depends on the workload.

And also the link "goodput".

For some hardware (and workloads) you'll be able to set txqueuelen and the rings way down; for others, not (they may already be very small).

On the transmit side, you really want pretty minimal buffering; but as usual, zero is the wrong answer ;-).

So on this laptop, I've had no problems setting my ethernet interface ring to 64 (the lowest it will go) and txqueuelen to zero. Note that on this laptop, this means I actually have 64 packets of buffering present, at minimum.

But on some other older hardware I've played with, setting txqueuelen to zero is a recipe for disaster, as the hardware has essentially no buffering.

The fundamental issue is *there is no single number* you can pick which will be right for everyone for buffering. And in the quest for single stream TCP performance over very high speed networks, we've happened to set our current number to values that are really insanely high for many situations. The challenge going forward, I believe, is making the system smart enough to actually figure out how much buffering is appropriate.

More about this in tomorrow's blog installment.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 10, 2010 8:34 UTC (Fri) by Yenya (subscriber, #52846) [Link] (5 responses)

About NFS: some 7-8 years ago I have experienced NFS filesystem hangs when the connection between client and server went through a no-name switch with too small buffers. After days of experiments I have discovered that the NFS client does not send MTU-sized packets, but instead wsize/rsize-sized (8 KB by default, I think), which are then fragmented by the IP layer to MTU-sized chunks. When the buffers on the switch are too small, for every fragmented UDP packet one or more fragments are lost, which means no NFS request is able to pass through the switch intact. Instant filesystem hang for anything bigger than "ls".

I would guess your hangs with txqueuelen of 5 can be from the same reason. For values of about 100, however, I don't think it can be the issue.

Problems due to Internet buffering

Posted Dec 10, 2010 19:56 UTC (Fri) by giraffedata (guest, #1954) [Link] (3 responses)

When the buffers on the switch are too small, for every fragmented UDP packet one or more fragments are lost

Why?

Problems due to Internet buffering

Posted Dec 11, 2010 8:43 UTC (Sat) by dlang (guest, #313) [Link] (2 responses)

because they are sent to switch rapidly, faster than the switch can send them on to the destination, and if the buffers are too small, the switch will throw some fragments away (after all, it's UDP and therefor not guaranteed delivery, right ;-)

I ran into a similar problem years ago with the heartbeat packets of a high availability package, with a few machines there were no problems, but when you had lots of machines, the messages got large enough to require several UDP packets per message, and on a busy network the odds that some of the packets needed for a message would get lost became statistically significant and the entire cluster became unreliable.

Problems due to Internet buffering

Posted Dec 11, 2010 19:26 UTC (Sat) by giraffedata (guest, #1954) [Link] (1 responses)

Oh, I see. The original description made it sound like the packet gets dropped because of the fragmentation. And unconditionally. The problem is due to the 8K retransmission unit size, not the fragmentation per se.

I had a problem once (it developed one day, hung on few a few days, and then went away -- something out in the Internet changed) where a VPN link would not conduct a packet larger than something like 1220 bytes at all. So small ssh shell conversations worked fine, but try to ls a page full of files: 100% failure. I lowered the MTU size on the "tun" device and all was well. I guess that was something different from what we're talking about now.

Problems due to Internet buffering

Posted Dec 14, 2010 9:38 UTC (Tue) by Yenya (subscriber, #52846) [Link]

The behaviour you describe is usually caused by the ICMP "fragmentation needed" packets being dropped by somebody, most probably an overly zealous firewall.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 11, 2010 22:31 UTC (Sat) by paulj (subscriber, #341) [Link]

Linux NFS now uses TCP by default, which presumably won't have this problem. Very interesting observation though. ;)

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 10:58 UTC (Tue) by jg (guest, #17537) [Link] (1 responses)

Thanks; but I think you are too kind.

Many of the puzzle pieces were handed to me (unassembled) by Comcast.

As always, we are on the shoulders of other giants: the area of congestion avoidance was explored with a depth of understanding I admire deeply by the likes of Van Jacobson, Sally Floyd, and many, many others. If Ive done anything important here, it has been recognizing that the problem is occurring in other parts of the end-to-end system than conventional internet core routers, where it was pretty fully explored in the 1980′s and 1990′s.

And chance is very important: aiding me was knowing some of the players here, so that when I smelled smoke, they could diagnose the fire, giving me the confidence to dig deeper and look more widely. So in part, its being in a particular place at a particular time.
- Jim

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 12:53 UTC (Tue) by drag (guest, #31333) [Link]

> Thanks; but I think you are too kind.

No. Your cool.

> Many of the puzzle pieces were handed to me (unassembled) by Comcast.

Yay Comcast.

> As always, we are on the shoulders of other giants: the area of congestion avoidance was explored with a depth of understanding I admire deeply by the likes of Van Jacobson, Sally Floyd, and many, many others. If Ive done anything important here, it has been recognizing that the problem is occurring in other parts of the end-to-end system than conventional internet core routers, where it was pretty fully explored in the 1980′s and 1990′s.

Plus you described it in a useful way in a entertaining format packed full of information on how individuals can diagnose and take some steps to mitigate the problem on their own.

Trust me, stuff like this matters. :)

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 18:03 UTC (Tue) by chad.netzer (subscriber, #4257) [Link] (2 responses)

Can I bill you for completely obliterating my sarcasm meter?

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 8, 2010 1:10 UTC (Wed) by drag (guest, #31333) [Link]

No. Return to maker, it's defective. I was being honest. :) I was half asleep when I wrote it, but I still think it's fantastic what he did.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 8, 2010 1:18 UTC (Wed) by rahvin (guest, #16953) [Link]

Mine's apparently broken too because I read it the same way you did although I wasn't quite sure and waited till others posted to see if I was reading it right. I'm not sure you can write something that "glowy" without people assuming it's sarcasm.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 7:39 UTC (Tue) by gmaxwell (guest, #30048) [Link] (3 responses)

Unfortunately a lot of the world has this crazy idea that even single packet drops are indicative of a dysfunctional network. Some of these people spend a lot of money on IP connectivity.

So not only is there not much of an incentive for things like active queue management, some providers are strongly discouraged by their customers from behaving reasonably.

Not likely a factor in the sort residential broadband networks that were being looked at here but it does partially explain some common industry practices.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 15:26 UTC (Tue) by foom (subscriber, #14868) [Link] (1 responses)

Unfortunately, a single dropped packet *is* a disaster for many applications.

Just to go through some numbers: you have a local network, with an RTT of < 1ms. Your remote app generally responds in < 10ms.

If you drop a SYN packet, the retransmit delay if you don't get a SYNACK is 3 *seconds*. You've now blown your expected response time by 300x.

If you drop the last packet in a request or response, you have to wait for the other side not to send an ack. The expected time to wait for an ack is supposed to be informed by the RTT estimator, except that the *minimum* (RTO_MIN) is pegged to 200ms in linux, so you have a 200ms delay in the request/response, even on a 1ms RTT network! You've now blown your expected response time by 20x.

Note that both those delays are impossibly large compared to the expected RTT. With those kinds of numbers, it's a Really Really Bad Thing to drop packets...it's no wonder people decide to buffer them for a few ms instead!

The only time there's no issue with dropping packets is when it's in the middle of a large data transfer stream: SACK will take care of that, and the packet will be retransmitted with no delay to the overall transfer.

There is some research about reducing (or eliminating) RTO_MIN -- let the retransmit time be informed purely by the RTT estimator. But one issue is that delayed-ack is a 40ms timeout (itself huge!), which means that reducing the RTO_MIN below that value will cause many packets to be sent twice (only twice, because an immediate ack is sent after the second transmission of the same packet is received). And, RTO_MIN is controlled by the sender, while delayed-ack is controlled by the receiver.

Sigh.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 19:23 UTC (Tue) by dlang (guest, #313) [Link]

buffering for a few ms is not the problem, the problem is that the buffers are so large that you can have a few 10's of seconds of data in the buffer, which will blow your application even worse than a 3 second retransmit delay.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 18:44 UTC (Tue) by jd (guest, #26381) [Link]

The problem is not so much dropping packets, but dropping the right packets.

Dropping packets can be good - hence schemes developed to specifically drop packets. eg: RED, BLUE, BLACK, GREEN, PURPLE, WHITE. (Does anyone notice a pattern here?) If two packets would otherwise collide, dropping one rather than losing both will improve overall throughput.

However, the problem is to drop the right packets. Not all packets are equal. As far as I can tell, packet dropping schemes don't attempt to introduce anything but the simplest of bias. Which may indeed be all you can do. If analysis takes longer than the resend, you don't gain with the cleverer bias.

There are, of course, other mechanisms. ECN handles throttling, for example.

As far as over-large buffers are concerned, it is important to distinguish between two different methodologies - those buffers that try and buffer everything and those which are sub-divided somehow (one per conversation, one per classification, etc).

It's also important to distinguish between various buffering schemes (flush on fill, continuous drain, real-time, etc). If a buffer is polled regularly (so it acts as more of a queue than a classical buffer), or where there are certain guarantees as to the minimum and maximum time a packet can be held without either being sent or being dropped, then you've a very different behaviour than when the buffer has to be entirely filled and then dumped in one go.

As far as Linux is concerned, it has ample code for QoS and windowing controls. There's also the Web100 code for adding various tweaks. There is absolutely ZERO excuse for any distribution not providing a kernel that is sub-standard. If there is to be any laws passed regarding Internet traffic, I want it illegal to ship incompetently-configured networking software, with a mandatory minimum of 100 hours meta-moderating Slashdot.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 7:55 UTC (Tue) by jmm82 (guest, #59425) [Link]

I have been working with optimizing routers the past four years and am very interested in this topic. The buffers that matter the most is at the bottle neck which is the hop with the lowest throughput. I have been mostly working in cellular, so maybe I have not been able to reach the "clean connection" required to experience this mess. I have spent most my time trying to determine how to keep traffic in order and minimize latency.

One way to trick tcp and fill the pipe is to use multiple tcp connections for one transfer. I have used custom ftp application that do so and they help fill your pipe. This can also be experienced using bit-torrent. This is ok as long as you are not sharing the connection with other applications.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 11:28 UTC (Tue) by marcH (subscriber, #57642) [Link] (8 responses)

Most Linux drivers are also to blame for bufferbloat. Just force your interface to good old 10Mb/s:

ethtool -s eth0 advertise 0x002

Now try to upload a big file and see how the latency jumps to the roof. The explanation is just here:

ethtool -g eth0

Just (re-)tested with tg3 2.6.35.9-64.fc14.

Already reported 6 years ago:
http://oss.sgi.com/archives/netdev/2004-05/msg00180.html

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 11:45 UTC (Tue) by marcH (subscriber, #57642) [Link]

http://tiny-tera.stanford.edu/~nickm/papers/sigcomm2004.pdf

Conclusion: "We believe that the buffers in backbone routers are much
larger than they need to be possibly by two orders of magnitude." (2004!)

Another good starting point into this old and really vast literature:
http://www.cisco.com/web/about/ac123/ac147/archived_issue... (2006)

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 19:16 UTC (Tue) by jmm82 (guest, #59425) [Link]

IMO, the buffer should be relative to the throughput of the link to account for jitter and traffic bursts. Therefore a 10 Gb/s port and a 10 Mb/s would not require the same buffer size. I am using the two extremes here, but I bet you could make a case for gigabit Ethernet to have a buffer > 100, while it may not be advantageous to the same 10 Mbit nic.

This is often hide on many computers because the Ethernet on their machine is not the bottleneck and therefore the traffic is queued elsewhere.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 8, 2010 15:08 UTC (Wed) by jg (guest, #17537) [Link] (2 responses)

Yes; this test is exactly how most easily to show that Microsoft Windows also has driver bufferbloat.

You don't see bufferbloat on Windows at 100Mbps, as they have deliberately tuned their TCP to never exceed about 85Mbps (XP, Vista, and Win 7), so in usual use on the most common switch still in use, Windows won't exhibit buffer bloat (since the ethernet bandwidth is higher than the NIC, the bottleneck moves somewhere else).

I suspect that Microsoft ran into bufferbloat, but didn't fully understand it. I'd love to talk to the right engineers and find out. Bufferbloat (from comments in one of their tech notes) appears to have played havoc with the responsiveness of media playing), and so they rate limited their network stack (by default). If you put much latency into the loop, the media player/media server control loop won't function well. Since usually in other situations the bottleneck is in some other device (home router or broadband device), they exonerated Windows and moved on.

Or maybe they were brighter than we are; but then I think they would have gone for a more general fix. The engineers/scientists there are really good, and they now have a first rate TCP stack (not true in XP days). So I doubt they would have stopped there if they had analyzed the situation correctly.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 9, 2010 16:44 UTC (Thu) by BenHutchings (subscriber, #37955) [Link] (1 responses)

I've been discussing the issue of TX queue length with developers working on net drivers for other kernels (Windows, Solaris, FreeBSD). None of them even have a limit! Drivers are expected to carry on buffering packets until they run out of memory. Linux at least expects the driver to return TX_BUSY when the hardware ring is full, and then limits the length of the software backlog (and both queue lengths are configurable).

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 12, 2010 20:26 UTC (Sun) by marcH (subscriber, #57642) [Link]

> None of them even have a limit!

Wow, now this is really serious bufferbloat. Infinity is going to be hard to beat.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 11, 2010 6:34 UTC (Sat) by njs (subscriber, #40338) [Link] (1 responses)

And 802.11 connections are much worse -- you can't use traffic shaping because the bandwidth is variable, you can't use packet prioritization because the offending buffer is *between* your QoS code and your network card, and I don't know that any of them even support ethtool. Mine doesn't. So this is literally unfixable without kernel changes.

But, now that I actually understand what's going on, it turns out that for my situation a one *character* kernel change suffices! -- http://thread.gmane.org/gmane.linux.kernel.wireless.gener...

Man, this has been frustrating me for *years*. I hope it gets fixed for real soon!

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 12, 2010 9:09 UTC (Sun) by paulj (subscriber, #341) [Link]

Very interesting results. Presumably your patch won't be accepted - they'll want something that makes that high_mark configurable? Do report back here if/when the driver is fixed/made configurable, if you could!

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 18, 2010 1:35 UTC (Sat) by clemenstimpler (guest, #71914) [Link]

The basic problem was known in 1999: http://lkml.org/lkml/1999/4/27/50

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 12:48 UTC (Tue) by werth1 (guest, #48435) [Link] (18 responses)

This is a somewhat known problem in the online gaming community where latency is critical. In particular for FPS's.
For a low RTT You have to avoid all other unthrottled operations on the network that can fill your buffers. That includes all these little task bar apps calling home for updates, P2P VoIP like skype, your web browser refreshing flash advertisements, your email client checking for news, other unthrottled computers sharing the network, and viruses sending spam. The later is a good candidate when it should work but doesn't.
Most P2P software allows to use client side throttling which circumvents the failure of the TCP/IP throttling. What also works is to deploy a client side traffic shaper that throttles the bandwidth. Essentially you are shifting the bottleneck out of the modem where it fills the large buffers into the clients where you have a better control over the buffer.

What is new and exciting is the extend of the analysis and the discovery that this actually increases the amount of dropped packages for large file transfers by a few percent.

This is a consequence of coding to a benchmark.

Posted Dec 7, 2010 14:28 UTC (Tue) by pizza (subscriber, #46) [Link] (9 responses)

In the (commercial) 802.11 world, the emphasis has been on Maximizing single-stream/application throughput at any cost. And by throughput, I mean throughput using the NetChariot benchmark tool.

Deep buffers means that the medium is always utilized as much as possible, which gives better throughput numbers at the expense of latency. But that latency doesn't really crop up until you try to use multiple, differing data streams.

My company's clients didn't like it when I pointed this out to them, but to quote a certain engineer, "I kinna change the laws of physics!"

Don't get me wrong, I'm glad to see this bit of sleuthing presented with such visibility, but... it seems to me like one of those dirty open secrets that everyone knew about but nobody completely thought through.

I'm not sure what we can really do about this though.. ("Oh no, this new kernel is slower than the old one, it sucks") ...at least until the PowersThatBe are made to care about multi-stream performance.

This is a consequence of coding to a benchmark.

Posted Dec 7, 2010 15:37 UTC (Tue) by paulj (subscriber, #341) [Link] (7 responses)

Another incredibly annoying aspect to 802.11 is that it goes *way* too far in trying to guarantee delivery. The underlying MAC protocols can require a series of communications to just communicate 1 data frame, and it can may retry quite a few times (default is 7 times for packets less than the RTSThreshold). I've seen small, close-area 802.11s with ping times of several *seconds* under low S/N conditions, presumably at least partly attributable to 802.11s daft levels of retrying.

This is a consequence of coding to a benchmark.

Posted Dec 7, 2010 18:14 UTC (Tue) by jg (guest, #17537) [Link]

Yes, 802.11 has this problem. I'll cover it in a future blog post, which will cover this and other problems.

This is a consequence of coding to a benchmark.

Posted Dec 7, 2010 19:36 UTC (Tue) by pizza (subscriber, #46) [Link] (5 responses)

802.11's retries are one of the artifacts of it trying to recreate wired ethernet's reliability, and are a VeryGoodThing(tm). Collisions and dropped 802.11 frames are a simple fact of life and are *very* common even under ideal circumstances.

(My wireless access point's current stats show that in the 9 days since it last lost power, nearly 17% of all transmited frames and 5% of received frames are retries)

Since you're talking about 802.11's default to retry 7 (or 4) times, you should also consider that 802.11's worst-case packet transmit time (max size * slowest rate, including time for the ACK) is under 19ms. Even under a highly-contended medium (ie poor S/N) the default max packet lifetime is 500ms, so your worst-case ping time attributable to 802.11 will be just under 1 second. If you're seeing multi-second delays that's far more likely due to the network stack queueing described in JG's excellent paper.

The 802.11e QoS stuff actually helps a lot to decrease latency for things that care about it, but that requires the incoming packets to be properly tagged.

Going wireless makes things a *lot* more complicated (you're at the mercy of a fundamentally and highly unreliable medium that you can't control) and the IEEE 802.11 Working Groups had to balance a lot of conflicting requirements -- The resulting design is sane, and the defaults are just that -- defaults that can be changed to better suit the facts on the ground (or over the air, as it were)

This is a consequence of coding to a benchmark.

Posted Dec 7, 2010 22:34 UTC (Tue) by paulj (subscriber, #341) [Link] (4 responses)

I said at least some of that incredibly high ping time would have been due to 802.11. And I have to disagree, 1s RTT is just daft for a low-level link-layer. It makes it impossible for higher-layer protocols to work properly, e.g. anything real-time in which low latency is preferable to reliability. It makes life difficult to impossible for RTT-estimators in higher protocols when the lower-level link is artificially inducing lots of variance in latency due to it trying to do TCP at the link-layer (and badly).

No one is saying communication over 802.11 doesn't need retrans mechanisms somewhere. I'm saying the link layer is the wrong place for strong such schemes. You can accommodate all the possible spectrum of reliability/latency/throughput with protocols on top of "raw" unreliable links, but you can never use upper-level protocols to undo the damage done by building in ACKs and retransmit schemes into your lower layer link.

This is a consequence of coding to a benchmark.

Posted Dec 8, 2010 3:11 UTC (Wed) by pizza (subscriber, #46) [Link] (3 responses)

Sheesh, kids these days are so spoiled with their switched ethernet...

1s RTT is not "daft for a link layer". It's not ideal, but there are plenty of real-world links out there that are worse than that -- and work just fine, even with TCP.

You're blaming the link-level for latency problems that are due to excessive/dumb buffering at the OS level.

If you have stuff that depends on low latency, then you need to use traffic shaping, QoS, and/or over-provisioning to ensure you can meet your needs. Linux (and 802.11) have all the mechanisms to do this already, but those mechanisms are rather useless without someone configuring sane policies and turning it all on.

Oh, and trusting application/os writers/users to do classification right instead of just saying "everything's high priority/low latency, gimme gimme gimme" which is basically what we have now.

This is a consequence of coding to a benchmark.

Posted Dec 8, 2010 9:01 UTC (Wed) by paulj (subscriber, #341) [Link] (2 responses)

Which other link technologies *add* 500ms of worst case latencies to transmission? I'd like to avoid them.

Also, you're clearly failing to understand the extremely healthy reasons for layering protocols. QoS, changes to OS buffers, shaping and over-provisioning simply can *not* undo the damage of the link underneath trying to implement reliability.

This is a consequence of coding to a benchmark.

Posted Dec 10, 2010 7:55 UTC (Fri) by jzbiciak (guest, #5246) [Link] (1 responses)

How about satellite?

This is a consequence of coding to a benchmark.

Posted Dec 12, 2010 20:30 UTC (Sun) by marcH (subscriber, #57642) [Link]

Satellite is related, but different: the delay is constant at all times in any traffic conditions. I guess that makes TCP more stable.

Note: it is easy to emulate a satellite link using "netem".

This is a consequence of coding to a benchmark.

Posted Dec 7, 2010 16:08 UTC (Tue) by drag (guest, #31333) [Link]

All you have to do, if your doing a commercial or consumer router is put a configuration option of 'bulk file' or 'voip' and explain one is for optimizing one machine max file performance and the other is optimized for busy networks and latency.

same old trade-off

Posted Dec 7, 2010 16:26 UTC (Tue) by marcH (subscriber, #57642) [Link] (5 responses)

This is just the same good old latency/throughput trade-off as seen not just in networking but in so many other places. To get good throughput you must to favour asynchrony. To favour asynchrony you need big queues, otherwise you run the risk to leave queues empty/idle for a fraction of the time. Once you have big queues say good bye to your latency.

The only way to have good latency and throughput at the same time is QoS. Considering this is not yet a solved problem on your local workstation <http://lwn.net/Articles/404993/>, good luck with that on the Internet.

This being said, I tend to agree that the pendulum has been swung way too far in many devices/drivers. One simple way to put this excess into crude light and explain it in simple terms, even to non experts, is to stop describing queue sizes in terms of packets or bytes but in terms of milliseconds instead.

same old trade-off

Posted Dec 7, 2010 22:33 UTC (Tue) by intgr (subscriber, #39733) [Link] (4 responses)

> To get good throughput you must to favour asynchrony. To favour asynchrony you need big queues

That's the point of this article, this wisdom does not apply to TCP. In fact, most of the article focuses on this one bulk transfer case where the buffering causes the connection itself to perform too many retransmits, actually reducing its throughput as well as destroying latency.

same old trade-off

Posted Dec 8, 2010 9:41 UTC (Wed) by marcH (subscriber, #57642) [Link] (3 responses)

Yeah the problem is that TCP is designed as a "synchronous" protocol (It is ACK-clocked, needs accurate RTT estimations, etc.) but implemented in asynchronous operating systems which love big queues. This is a interesting design mismatch with funny workarounds.

However, my bufferbloat experience (nice name BTW) has never been as bad as Jim's. I mean latency is destroyed of course but aggregate, directed throughput is NOT. I find TCP surprisingly robust.

I have a hard time finding any clear *throughput* conclusion in Jim's (too long) blog entry. Granted TCP is "confused", but most people only look at only at the end result: throughput, and in only one direction at a time.

If Jim's aggregate *rsync throughput* (not just latency) was really reduced significantly, then I would say his configuration is more the exception than the rule. Every device or driver that goes out on the market is tested with TCP (at least). Unlike for latency, I doubt there is any widespread *directed throughput* problem in practice. Yet?

same old trade-off

Posted Dec 8, 2010 15:38 UTC (Wed) by jg (guest, #17537) [Link] (2 responses)

I see 1-3% packet loss in my traces; but not high increase in elapsed time as fast retransmit and SACK are covering most sins. Unless I or others can demonstrate higher loss rates, it's hard to argue that that loss rate is horrific (though higher than a TCP connection over a correct unbloated path).

The issue is that for everyone else, the connection becomes unusable, and fair bandwidth sharing is also destroyed.

It's so bad on many networks you don't want to even do web surfing while such a transfer is occurring, much less anything like VOIP or Skype.

same old trade-off

Posted Dec 8, 2010 16:18 UTC (Wed) by marcH (subscriber, #57642) [Link] (1 responses)

> It's so bad on many networks you don't want to even do web surfing while such a transfer is occurring, much less anything like VOIP or Skype.

No offence Jim but ALL gamers know this rule since AGES.

A few years back I was working on this topic. Then I was amazed to see some researchers and other "lab experts" ignoring or sometimes even denying this basic rule of thumb. They were probably working too hard and not playing enough!

When your model is not good enough, throw away reality. This will increase your chances to get your paper published.

same old trade-off

Posted Dec 8, 2010 20:28 UTC (Wed) by jg (guest, #17537) [Link]

Yes, I know the gamers have known of bufferbloat for years. In fact, there are home routers that are marketed as "gamer" routers. The extent and magnitude of of the broadband problem has not attracted attention.

If I've done anything, it's showing that a single TCP connection can fill the buffers, and that bufferbloat is causing havoc in subtle ways.

The problem is that as a systematic problem, bufferbloat has not been widely recognized. It pervades just about everything from end to end. Applications have it, our TCP stacks may have it, the network code underneath has it, the device drivers have it, and just about everything else.

Hell, if I don't mess with txqueuelen and my ethernet driver with ifconfig and ethtool, I get hundreds of milliseconds on a ethernet switch on this Linux laptop.

But ISP's have often not understood, and certainly most engineers building network hardware (and operating systems) have not.

Parts I haven't bogged about yet include elsewhere in the internet core, 3g networks, corporate networks, etc.

The problem here is to educate everyone enough to be sensitive to and to solve the problem.

The internet is busted folks: we are both part of the problem and part of the solution.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 8, 2010 15:27 UTC (Wed) by jg (guest, #17537) [Link] (1 responses)

A few percent in my tests (1-3%).

But my tests are not extensive, and I've heard of sporadic reports of much higher loss rates, so far non-reproducible.

Over a year ago, when I was last chasing home network disease in my house (documented in an older blog entry), my tests were showing much higher loss rates than the 1-3% I got out of the traces I've taken. Now, that may have been lightning damage in either my router's NIC or the cable modem; but then again, it might have been bufferbloat provoking TCP, if the timing was pessimal. I can't easily reproduce that test configuration, as all the hardware involved is now dead due to lightning.

Nick Weaver also reported to me seeing higher loss rates last summer when he reproduced my tests on his home service, but since then has not seen the tens of percent losses again (neither has he been testing).

So it is entirely possible that there is some state where TCP is yet less sane the buffers induce sometimes.

But to prove it and understand the cause, we need TCP traces. Without them, it is anecdotal evidence at best, and I carefully claim the documented 1-3% I've observed in my traces.

Please do investigate: if you can catch higher loss rates, *please* do a packet capture and let me know!!!!
- Jim

P.S. it has often been very difficult to monitor your overall home network, since you haven't been able to easily get a good tap between a home router and the broadband device. I stumbled across the following device which is a small switch which provides port mirroring and runs at a gigabit, and bought one (they are $150 or so). So I can finally watch all my traffic! Sometime soon, I'll try to see what all the spyware in web browsers are trying to do....

See: http://www.dual-comm.com/

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 8, 2010 21:33 UTC (Wed) by jmm82 (guest, #59425) [Link]

Looks like a glorified "Hub". When hubs became obsolete and everyone realized that switches were much better because it reduced collisions we could no longer sniff all traffic. I have a hub just for this purpose, well its actually called an concentrator, but its a hub and half duplex. It sorta destroys your network when you place it in the middle, but the device you posted looks nice because you can sniff and transmit ad gigabit speeds.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 17:32 UTC (Tue) by adamgundy (subscriber, #5418) [Link] (12 responses)

hmm. I'm reading this article and thinking 'but everyone knows this, right?'. this is basically commandment no. 1 for anyone doing VOIP or gaming: thou shalt use QOS.

there are many, many tools out there to do this from the consumer end - builtin traffic shaping in the linux kernel (using for example the 'wondershaper' script or others), most of the 'replacement' firmwares for routers that are based on linux (Tomato etc) have built in support for this, 'gaming' or 'VOIP' routers that have QOS built in..

there are several major issues:

1) you have to shape to a known speed (outbound), to avoid filling the buffers - that works great with business class broadband, where you get exactly what you pay for, but not so well with consumer broadband where the bandwidth is contended - you end up setting a low threshold otherwise latency goes to hell when you're contended and start filling the modem's buffer again.

2) removing the buffers (or using small ones) slows down TCP. queue prioritization solves that (and linux does it nicely, if configured). QOS enabled switches in a corporate LAN also have it.

3) inbound shaping is impossible to do correctly - it needs to be shaped outbound, at the other end of the bottleneck. finding an ISP who will do that can be tricky (at least, consumer side - business ones are a little more willing, but typically only when it benefits *their* VOIP service, for example)

I see three real solutions:

(a) uncontended bandwidth, so you can shape appropriately. anyone thinking *that's* going to happen for consumer internet is clearly deluded..

(b) some way of having the modem tell the router about backpressure (and yes, that's in the ethernet standards, but they never *do* it) and disable or reduce its buffering.

(c) have the modem itself support QOS levels and fair queueing.. then its buffers can be put to good use without screwing up latency. seems like an easy one until you realize that it's not that easy - you have to configure the right ports, protocols, etc, and most users don't even realize their modem is anything more than a dumb brick..

the most worrying thing in the entire article was the amount of head-scratching apparently going on at Comcast - their network techs *surely* know about QOS, right?

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 18:16 UTC (Tue) by jg (guest, #17537) [Link] (7 responses)

QOS by itself is insufficient; if we don't signal the end points to slow down in congestion, we'll have congested network with bad packet loss and potential (or real) congestion collapse.

Unfortunately, many devices have no provision for QOS, including much/all of broadband (this is certainly true for the cable modem/CMTS link).

So QOS is *not* a panacea; it may be a useful tool, but it is *not* sufficient by itself. Some sort of queue management is fundamentally necessary.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 20:39 UTC (Tue) by marcH (subscriber, #57642) [Link] (1 responses)

> QOS by itself is insufficient; if we don't signal the end points to slow down in congestion, we'll have congested network with bad packet loss

QoS prioritizes packets. Your VoIP application does not care at all if there is packet loss and huge latency *for others*.

> and potential (or real) congestion collapse.

I have experienced bufferbloat a number of times, yet I have never seen any "collapse". No more than on a satellite link (which is basically "Airbufferbloat"). Things just become slow.

> So QOS is *not* a panacea; it may be a useful tool, but it is *not* sufficient by itself. Some sort of queue management is fundamentally necessary.

I miss the difference between QoS and queue management.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 10, 2010 8:35 UTC (Fri) by jzbiciak (guest, #5246) [Link]

Well, in the case of satellite, that isn't actually buffer bloat. Your bandwidth*delay product is actually that large. Since your RTT is also huge (1 second if both directions go via satellite), your throughput becomes a function of your TCP window size more than anything. If anything, you'll send too little traffic into the link on any one connection, and so each connection will be much slower than the raw underlying bandwidth if it's a high bandwidth uplink. But, with TCP window scaling, that shouldn't be a problem.

With bufferbloat, you end up oversubscribing the connection with "eager" traffic that gets accepted, but just queued. Because it gets accepted more quickly, TCP mis-estimates the actual bandwidth of the stream. When it sees the ACKs coming back slowly, then takes the original bandwidth estimate times the delay and comes up with a huge bandwidth*delay product which isn't actually measuring data in flight, but rather data in queues. With that in hand it goes right on saturating the queues. Any other traffic that comes along now has to sit in those same queues and sweat it out.

In the satellite example, if other traffic showed up and there wasn't much actual buffering ahead of the satellite uplink (and likewise in the satellite itself), then the other traffic's packets would get on the air in an orderly fashion, and you wouldn't see a huge amount of jitter even if you had a high bandwidth transfer in parallel.

The point is that it's always a fixed transit time to bounce off the satellite once the packet gets on the air. The speed of light doesn't change with network congestion. Queues in a router are a different story. The latency between when you insert and when you remove something from a network buffer varies with the traffic, leading to what Jim Gettys describes: TCP overestimates the bandwidth and therefore the bandwidth*delay product, and so overcommits the network to the high-speed stream.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 21:32 UTC (Tue) by yaap (subscriber, #71398) [Link]

Just some info on the cable side, as you mention cable lack of QoS support.

Cable Modems (CM) and CMTS use the DOCSIS standard nowadays, and it supports QoS since DOCSIS 1.1 (we're now at 3.0). Moreover, to sell a CM you must get it certified, and the certification verifies that QoS is properly implemented. A CM must have a valid DOCSIS-signed certificate, which covers a specific software and hardware versions. You must pass certification to get this X.509 certificate, and without it a CM is not let into the cable network. There's also certification for CMTS, it's a bit less paranoid on versions but anyway, all CMTS do support DOCSIS QoS too.

Now all this QoS framework can only be controlled by the cable operator (as per the standard). And they typically stick to the basics: one best-effort flow for data, dedicated flows for VoIP when needed.

So as a user, you don't see any of this. But why couldn't the cable operator let the user configure it's own QoS? Well the model is not suitable for this. If there was a notion of a per-user global budget, there wouldn't be an issue to let the user do QoS within it (QoS bounded to the user level). But that's not how it works. In the model, each flow has its dedicated budget (min and peak rate, priority, latency...), and all flows are top level (no "user" level QoS). So it's not practical to let the user configure this.

And because there's already QoS on the cable side, I guess all CM makers don't bother adding an additional user level QpS sub-system.

This was the situation a few years ago, when I was working on cable products. I don't think it has changed much.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 22:16 UTC (Tue) by adamgundy (subscriber, #5418) [Link] (3 responses)

> QOS by itself is insufficient; if we don't signal the end points to slow
> down in congestion, we'll have congested network with bad packet loss
> and potential (or real) congestion collapse.

if you get the QOS correct, then TCP takes care of the throttling for you. the missing component after QOS is traffic shaping so that each TCP connection gets a fair slice of the available bandwidth, rather than one hogging it all. the two things are pretty much bundled together in the Linux implementation.

the whole reason the buffers are there in the first place is to prevent packet drops - it allows routers to even out the flow instead of just dropping the packets. provided you don't ever *fill* the buffer, life is good.

the major issue is *trust*. you can do QOS and shaping outbound relatively easily (modulo not knowing your available bandwidth in a contended situation), but inbound involves the ISP marking (or trusting the markings on) some packets. that makes for easy DOS situations. IMHO they should just mark the 'flow' inbound according to what's marked outbound, then the user gets to choose..

> Unfortunately, many devices have no provision for QOS, including
> much/all of broadband (this is certainly true for the cable modem/CMTS
> link).

cable modems have queue priority support in them - have ever since DOCSIS 1. the issue is that the ISPs disable it... mainly because it doesn't work well with contended bandwidth - there's no support for packet classification in the standard.

you occasionally see it in use when they provide their own VOIP solution (they basically reserve enough bandwidth for a couple of lines), but Comcast I believe opted for a completely separate modem instead.

> So QOS is *not* a panacea; it may be a useful tool, but it is *not*
> sufficient by itself. Some sort of queue management is fundamentally
> necessary.

yes. queue priorities are part of QOS (it doesn't work without them). fair shaping is an extra addon, which not all solutions offer (commercial switches often have QOS but not shaping, at least the lower end models).

you should try this: get a Tomato compatible router, flash the firmware, then turn on QOS. you'll have to set your bandwidth cap correctly to avoid contention. you'll be shocked at the difference. alternatively, for an even simpler solution, buy a 'gaming' router and just plug it in - I use a DLINK DI-634M as a home solution and it works wonders (even attempts to guess the available bandwidth instead of setting it manually).

buffering outside of the last mile just isn't an issue, unless something very bad is going on. the backbone providers simply provision enough bandwidth to ensure they never overflow the buffers.. the issue is all about trying to squeeze too much data through a slow pipe.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 23:49 UTC (Tue) by mgedmin (subscriber, #34497) [Link] (2 responses)

Sadly, when I last tried to enable QOS on a WRT54GL running some version of OpenWRT, it would just go catatonic every few hours and then reboot. Not enough RAM? CPU too weak? No idea.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 23:55 UTC (Tue) by adamgundy (subscriber, #5418) [Link] (1 responses)

there were some issues with the QOS stuff in kernels < 2.6.16 where it would overload the CPU under network pressure..

try tomato.. it's much slicker and runs on a WRT54GL no problem.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 8, 2010 6:21 UTC (Wed) by Los__D (guest, #15263) [Link]

Tomato might be slicker (it is what I run myself), but it is also VERY limited compared to OpenWRT.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 20:31 UTC (Tue) by marcH (subscriber, #57642) [Link] (3 responses)

> 3) inbound shaping is impossible to do correctly - it needs to be shaped outbound, at the other end of the bottleneck.

It is hard but not totally impossible, you can cheat TCP into some form of "inbound shaping". The trick is to artificially create an inbound bottleneck on your premises. Even when this bottleneck is downtream, as long as it is somewhat "narrower" than any other bottleneck on the path it will be the first to drop packets which will half the TCP congestion window of senders. By keeping the queue size of the artificial bottleneck small you can preserve latency.

Of course this does not work at all if you are hammered down by oblivious inbound UDP traffic.

> I see three real solutions:
> (a) uncontended bandwidth, so you can shape appropriately. anyone thinking *that's* going to happen for consumer internet is clearly deluded..

I am not sure what you mean by "uncontended". No matter where the bottleneck between A and B is, TCP is designed to reach it. See the Cisco reference I posted above.

> (b) some way of having the modem tell the router about backpressure (and yes, that's in the ethernet standards, but they never *do* it) and disable or reduce its buffering.

Backpressure is just pushing the problem one step further. And it typically adds a Head Of Line blocking problem when multiplexing.

> (c) have the modem itself support QOS levels and fair queueing.. then its buffers can be put to good use without screwing up latency. seems like an easy one until you realize that it's not that easy - you have to configure the right ports, protocols, etc,

I think the configuration complexity of this is what killed alternatives to IP. In other words, IP won because it did not care about QoS, in accordance with the End to End principle.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 21:23 UTC (Tue) by adamgundy (subscriber, #5418) [Link] (2 responses)

>> 3) inbound shaping is impossible to do correctly - it needs to be shaped outbound, at the other end of the bottleneck.

> It is hard but not totally impossible, you can cheat TCP into some form of
> "inbound shaping". The trick is to artificially create an inbound
> bottleneck on your premises. Even when this bottleneck is downtream, as
> long as it is somewhat "narrower" than any other bottleneck on the path it
> will be the first to drop packets which will half the TCP congestion
> window of senders. By keeping the queue size of the artificial bottleneck
> small you can preserve latency.

tried this.. didn't work (well). you end up with 'sawtooth' latency as the existing connections try to ramp up, then fall back. you have to throttle the bandwidth down enough to deal with any upswing, which is composed of all the TCP connections that decide to ramp up at any point in time.. you have to throw away a big chunk of your bandwidth to make it work (and it's still prone to someone starting a torrent, or using an FTP client that opens 20 connnections at once)

end result is you fight a losing battle - you throw away a decent chunk of your expensive bandwidth (eg 30%), and *still* have crappy sounding phone calls occasionally because an FTP client went nuts with dozens of connections... or a bunch of spammers hit your email server at the same time. sigh.

> Of course this does not work at all if you are hammered down by oblivious inbound UDP traffic.

yup. but less likely, unless you're being DOSed

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 21:47 UTC (Tue) by marcH (subscriber, #57642) [Link] (1 responses)

> you have to throw away a big chunk of your bandwidth to make it work (and it's still prone to someone starting a torrent, or using an FTP client that opens 20 connnections at once)

Agreed, this trick is insanely complicated to setup and fine-tune and works only in simple cases. I should have highlighted this, thanks for doing it.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 8, 2010 10:31 UTC (Wed) by i3839 (guest, #31386) [Link]

All you need to know is your upload and download bandwidth, and then throw away at least 20% of your download speed, more if you want less jitter in the mentioned cases.

Modems are usually the bottleneck and know the link speed, they could do the above automatically. Even finetuning can happen automatically, by monitoring the delays and adjusting the limits if latency is too high.

It may not be perfect, but it's a hell of a lot better than the default.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 17:40 UTC (Tue) by luto (guest, #39314) [Link] (4 responses)

I wonder how much of an improvement we'd see if people started using ECN? It's supported, but off by default, on Linux, Windows Vista, and Windows 7.

In principle, if all the networks used some sensible active queue management, and all routers and apps supported ECN, then under good conditions and light to moderate load there would be no need to drop packets.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 18:20 UTC (Tue) by jg (guest, #17537) [Link] (3 responses)

It's not clear we can use ECN or not.

The story I've heard is that a particular Taiwainese ODM shipped a pile of code used in home routers a while ago that keels over and dies if it sees a ECN bit set. Various ISP's and big institutional networks and the like have been avoiding service calls by clearing the ECN bit.

Research is underway to see if we can go ahead and deploy it.

If we can, it's almost certainly the case that ECN should get marked anytime the buffers begin to grow (before packet loss occurs).

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 11, 2010 5:08 UTC (Sat) by kijiki (subscriber, #34691) [Link] (1 responses)

> If we can, it's almost certainly the case that ECN should
> get marked anytime the buffers begin to grow (before packet
> loss occurs).

You can do even better by setting ECN bits on a percentage of packets in a flow that is proportional to the depth of the buffer. Sadly it requires modifying endpoint OSes to not halve the window every time it sees an ECN bit, so it is probably not deployable outside of datacenter clusters.

http://research.microsoft.com:8081/pubs/121386/dctcp-publ...

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 11, 2010 6:57 UTC (Sat) by njs (subscriber, #40338) [Link]

That was a very nice read -- thanks for the link!

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 13, 2010 17:23 UTC (Mon) by quotemstr (subscriber, #45331) [Link]

I'd be interested in seeing the results of this research. I've been running with ECN enabled for ages now and I've never had a problem. In 2003? Sure. Now? It works perfectly. Let's turn it on by default.

Already known by most LWN readers?

Posted Dec 7, 2010 18:24 UTC (Tue) by martinfick (subscriber, #4455) [Link]

Bittorent developers dealt with this by creating:

http://en.wikipedia.org/wiki/Micro_Transport_Protocol

This was mentioned on LWN back in May:

http://lwn.net/Articles/389105/

Does Jim not have an LWN subscription? :)

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 7, 2010 19:27 UTC (Tue) by piggy (guest, #18693) [Link]

Excellent article.

I can attest that the understanding of the harm done by large buffers is far from universal. A few years ago I found myself trying to convince an office full of networking professionals that our device needed to drop packets, and drop them quickly. Fortunately, my job was to exercise the equipment under load. It was very easy to construct plausible examples which performed terribly. This led to a significant refactoring of our product.

What about application level?

Posted Dec 7, 2010 22:42 UTC (Tue) by sustrik (guest, #62161) [Link] (1 responses)

I wonder how buffering on the application level fits into the picture. Specifically, messaging brokers (all kinds of MQ-style products) are basically just extremely large buffers. They can store gigabytes of data, effectively delaying the propagation of backpressure by hours or even days.

What about application level?

Posted Dec 8, 2010 15:50 UTC (Wed) by jg (guest, #17537) [Link]

Yes, this happens.

I've seen bufferbloat in many applications (including the X server).

Apps are just as prone to bufferbloat as the network.

Think end to end: you have to manage all buffers to control latency. The trick is figuring out how to manage them efficiently and correctly. This turns out to be subtle and sometimes hard, and there is some real research to do. It is particularly difficult when you are in situations where predicting the available bandwidth is difficult (e.g. wireless, graphics). One size never fits all.

- Jim

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 8, 2010 5:06 UTC (Wed) by gdt (subscriber, #6284) [Link] (1 responses)

Nice work :-)

My own experience tuning tails and customer equipment is that it is not so much the amount of buffering which is the issue so much as the buffer playout algorithm. That is, too much of the buffer is exposed to any one TCP connection because the router is running FIFO rather than a real queueing algorithm. Also, what is exposed allows resource starvation of the buffer -- that is, some flows get too much buffer and some get far too little. The poor buffering makes unfair TCP algorithms even unfairer, you can see Linux's CUBIC suffer from that when running parallel flows.

I notice some commenters are drawing conclusions about all routers from tail routers. A tail sees more congestion, but also sees a smaller range of bandwidth-delay products in the TCP connections. The core sees less congestion but a massive range of BDPs. The tuning is very different, the core having a focus on avoiding syncronisation (since 40Gbps of in-sync traffic can really mess up your backbone).

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 8, 2010 16:02 UTC (Wed) by jg (guest, #17537) [Link]

Yes, I've not mentioned time based congestion (yet), and self synchronization is a phenomena known to occur in the internet.

You can see in my traces periodic behavior; if there are tendencies (as there often are in large systems) for such behavior to synchronize, you can get really bad aggregate time based congestion behavior.

Best, of course that the underlying flows don't start exhibiting periodic behavior so they won't participate in such synchronization.

A future posting will be my fears for the future.

We've effectively destroyed congestion avoidance by this buffering, for TCP and other protocols.

So I have to fear possible congestion collapse. The more subtle form of this is this self synchronizing aggregate behavior.

And the traffic pattern of the Internet is now finally shifting as Windows XP finally gets retired. Just because we don't see (with the exception of 3g networks) bad congestion collapse today, doesn't mean we may not see it in the future, as the overall traffic shifts toward TCP's that run with window scaling more able to congest all links. And time based behavior is more subtle than the simple congestion collapse case.

Maybe I'm paranoid (having experience the NSFnet collapse), but I do lose sleep about it.
- Jim

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 12, 2010 21:33 UTC (Sun) by astrophoenix (guest, #13528) [Link]

The old Linux QOS howto, from 2004(!), addresses buffer bloat, and has a recipe for configuring QOS on a Linux router to mitigate it:

http://lartc.org/howto/lartc.cookbook.ultimate-tc.html

I fired up four bittorents and starting tuning my upload and download limits. 13/16 for download and 3/4 for upload is working well today.

Before this, I was in the habit of only starting bittorrent before I went to bed; I couldn't surf the web from my laptop while it was going! Now I can't even tell if bittorrent is going while I surf.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 21, 2010 17:10 UTC (Tue) by JesperBrouer (guest, #62728) [Link] (5 responses)

Skimming through Getty's blog post, I think Getty has actually missed what is happening. He should read my masters thesis[1], specifically page 11.

The real problem is that TCP/IP is clocked by the ACK packets, and on asymmetric links (like ADSL and DOCSIS), the ACK packets are simply comming downstream too fast (on the larger downstream link), resulting in bursts and high-latency on the upstream link.

With the ADSL-optimizer I actually solved Gettys problem, by having an ACK queue, which is bandwidth "sized" to the opposite link size. But I guess the real solution would be to implement a TCP algorithm which handels this asymmetry, and e.g. isn't based on the ACK feedback...

[1] http://www.adsl-optimizer.dk/thesis/

--Jesper Dangaard Brouer

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 23, 2010 0:49 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (4 responses)

That makes no sense. If what you say were true, we shouldn't see latency increase with a large *upstream* transfer occurring simultaneously with delays to latency-sensitive upstream traffic when the downstream link is idle. If what you say were true, one couldn't address the issue by creating a bandwidth constriction using traffic shaping and a short queue.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 30, 2010 8:10 UTC (Thu) by JesperBrouer (guest, #62728) [Link] (3 responses)

The high latencies are of-cause caused by packets being queue, often by too big (default?) queues, I'm not saying Getty is wrong.

I'm just pointing out that part of the TCP/IP clocking mechanism breakes down on asymmetric links. An mechanism which would mitigate this scenario, by giving a smoother data clocking on the upstream link.

I think that, the root cause of the buffer-bloat problem, is because ISP are configuring a default queue size on all their products, regardless of the products bandwidth.

ISPs should really choose the queue size based upon the products bandwidth.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 30, 2010 14:17 UTC (Thu) by JesperBrouer (guest, #62728) [Link]

I have created a blog post on the buffer-bloat subject:
Buffer-Bloat: The calculations

Where I have explained where the delay comes from and how to do the math.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Dec 31, 2010 17:36 UTC (Fri) by paulj (subscriber, #341) [Link] (1 responses)

Why do you think the outsize buffers are at the ISP? You were able to rule out buffers in the source computers or the ADSL routers?

E.g. Jim pointed out the default maximum Tx queue length on Linux interfaces is 1000 packets for many kinds of interfaces (ethernet, even some wireless). This is excessively large for many ethernets, given the wide-scale use of slow wifi. After setting it much lower (e.g. 100 ~= 10ms serialisation delay for an 11Mbit/s effective capacity link and 1200 byte packets), I see much tighter variance in latency between hosts which are connected by wifi.

Gettys: Whose house is of glasse, must not throw stones at another

Posted Jan 3, 2011 21:22 UTC (Mon) by JesperBrouer (guest, #62728) [Link]

The buffer I measured was in the ADSL modem, because it is the bottleneck in the setup, quite simple. My test PC is connected via a 100 Mbit/s port to the ADSL modem, and the exit point from the ADSL modem is a 512 Kbit/s link...

The very interesting point made by Getty (which I didn't think about), is that Wifi introduces another bottleneck in the system. When introducing Wifi into the equation, then we have a new bottleneck, and then the buffering occurs on your Linux box, which has significant txqueuelen buffer bloat (just checked that my wlan0 has a txqueuelen=1000, which should be changed/lowered for wifi!).