LPC: An update on bufferbloat

By Jonathan Corbet
September 13, 2011

Approximately one year after describing bufferbloat to the world and starting his campaign to remedy the problem, Jim Gettys traveled to the 2011 Linux Plumbers Conference to update the audience on the current state of affairs. A lot of work is being done to address the bufferbloat problem, but even more remains to be done.

"Bufferbloat" is the problem of excessive buffering used at all layers of the network, from applications down to the hardware itself. Large buffers can create obvious latency problems (try uploading a large file from a home network while somebody else is playing a fast-paced network game and you'll be able to measure the latency from the screams of frustration in the other room), but the real issue is deeper than that. Excessive buffering wrecks the control loop that enables implementations to maximize throughput without causing excessive congestion on the net. The experience of the late 1980's showed how bad a congestion-based collapse of the net can be; the idea that bufferbloat might bring those days back is frightening to many.

The initial source of the problem, Jim said, was the myth that dropping packets is a bad thing to do combined with the fact that it is no longer possible to buy memory in small amounts. The truth of the matter is that the timely dropping of packets is essential; that is how the network signals to transmitters that they are sending too much data. The problem is complicated with the use of the bandwidth-delay product to size buffers. Nobody really knows what either the bandwidth or the delay are for a typical network connection. Networks vary widely; wireless networks can be made to vary considerably just by moving across the room. In this environment, he said, no static buffer size can ever be correct, but that is exactly what is being used at many levels.

As a result, things are beginning to break. Protocols that cannot handle much in the way of delay or loss - DNS, ARP, DHCP, VOIP, or games, for example - are beginning to suffer. A large proportion of broadband links, Jim said, are "just busted." The edge of the net is broken, but the problem is more widespread than that; Jim fears that bloat can be found everywhere.

If static buffer sizes cannot work, buffers must be sized dynamically. The RED protocol is meant to do that sizing, but it suffers from one little problem: it doesn't actually work. The problem, Jim said, is that the protocol knows about the size of a given buffer, but it knows nothing about how quickly that buffer is draining. Even so, it can improve the situation in some situations. But it requires quite a bit of tuning to work right, so a lot of service providers simply do not bother. Efforts to create an improved version of RED are underway, but the results are not yet available.

A real solution to bufferbloat will have to be deployed across the entire net. There are some things that can be done now; Jim has spent a lot of time tweaking his home router to squeeze out excessive buffering. The result, he said, involved throwing away a bit of bandwidth, but the resulting network is a lot nicer to use. Some of the fixes are fairly straightforward; Ethernet buffering, for example, should be proportional to the link speed. Ring buffers used by network adapters should be reviewed and reduced; he found himself wondering why a typical adapter uses the same size for the transmit and receive buffers. There is also an extension to the DOCSIS standard in the works to allow ISPs to remotely tweak the amount of buffering employed in cable modems.

A complete solution requires more than that, though. There are a lot of hidden buffers out there in unexpected places; many of them will be hard to find. Developers need to start thinking about buffers in terms of time, not in terms of bytes or packets. And we'll need active queue management in all devices and hosts; the only problem is that nobody really knows which queue management algorithm will actually solve the problem. Steve Hemminger noted that there are no good multi-threaded queue-management algorithms out there.

CeroWRT

Jim yielded to Dave Täht, who talked about the CeroWRT router distribution. Dave pointed out that, even when we figure out how to tackle bufferbloat, we have a small problem: actually getting those fixes to manufacturers and, eventually, users. A number of popular routers are currently shipping with 2.6.16 kernels; it is, he said, the classic embedded Linux problem.

One router distribution that is doing a better job of keeping up with the mainline is OpenWRT. Appropriately, CeroWRT is based on OpenWRT; its purpose is to complement the debloat-testing kernel tree and provide a platform for real-world testing of bufferbloat fixes. The goals behind CeroWRT are to always be within a release or two of the mainline kernel, provide reproducible results for network testing, and to be reliable enough for everyday use while being sufficiently experimental to accept new stuff.

There is a lot of new stuff in CeroWRT. It has fixes to the packet aggregation code used in wireless drivers that can, in its own right, be a source of latency. The length of the transmit queues used in network interfaces has been reduced to eight packets - significantly smaller than the default values, which can be as high as 1000. That change alone is enough, Dave said, to get quality-of-service processing working properly and, he thinks, to push the real buffering bottleneck to the receive side of the equation. CeroWRT runs a tickless kernel, and enables protocol extensions like explicit congestion notification (ECN), selective acknowledgments (SACK), and duplicate SACK (DSACK) by default. A number of speedups have also been applied to the core netfilter code.

CeroWRT also includes a lot of interesting software, including just about every network testing tool the developers could get their hands on. Six TCP congestion algorithms are available, with Westwood used by default. Netem (a network emulator package) has been put in to allow the simulation of packet loss and delay. There is a bind9 DNS server with an extra-easy DNSSEC setup. Various mesh networking protocols are supported. A lot of data collection and tracing infrastructure has been added from the web10g project, but Dave has not yet found a real use for the data.

All told, CeroWRT looks like a useful tool for validating work done in the fight against bufferbloat. It has not yet reached its 1.0 release, though; there are still some loose ends to tie and some problems to be fixed. For now, it only works on the Netgear WNDR3700v2 router - chosen for its open hardware and relatively large amount of flash storage. CeroWRT should be ready for general use before too long; fixing the bufferbloat problem is likely to take rather longer.

[Your editor would like to thank LWN's subscribers for supporting his travel to LPC 2011.]

Index entries for this article
Kernel	Networking/Bufferbloat
Conference	Linux Plumbers Conference/2011

LPC: An update on bufferbloat

Posted Sep 13, 2011 17:22 UTC (Tue) by arekm (guest, #4846) [Link]

Hmm. So which congestion algorithm is best for "generic" use on servers nowdays? I use cubic, here westwood is mentioned.

Power impact of debufferbloating

Posted Sep 13, 2011 18:13 UTC (Tue) by arjan (subscriber, #36785) [Link] (31 responses)

Reading about how smaller (send) buffers are helping... this makes me a little worried.
Bigger (send) buffers are essential to achieve good power management (basically the OS puts a whole load of work in the hardware queue and then goes to sleep for a long time....).

If you end up with a micro-small buffer, you may get better behavior on the wire (that is the part that is clearly interesting, and essential for Jim's work), but it also means the OS will be waking up *all the time*, which has the risk of destroying power management...

Power management matters both for mobile devices as for servers... and guess which devices are at the edges of the network ;-(

Power impact of debufferbloating

Posted Sep 13, 2011 19:08 UTC (Tue) by josh (subscriber, #17465) [Link] (1 responses)

That's a tradeoff systems need to make carefully. An endpoint system can save power by incurring more latency, and in theory that doesn't affect any traffic to other systems since we (sadly) don't have widespread use of mesh networking. On the other hand, if you care about latency, you have to use a bit more power. The same tradeoff has appeared in audio and video processing as well.

Power impact of debufferbloating

Posted Sep 13, 2011 19:36 UTC (Tue) by arjan (subscriber, #36785) [Link]

I'm ok with it being a tradeoff.
What would worry me if it's not even looked at in the various designs/ideas....

Power impact of debufferbloating

Posted Sep 13, 2011 19:27 UTC (Tue) by Richard_J_Neill (subscriber, #23093) [Link] (7 responses)

Really? On my Android phone, the CPU is about the least power-hungry part. By a huge margin, the power-hog is the screen's backlight.

BTW, have you ever noticed that web-access on a 3G phone is sometimes really, really, really slow (20 seconds + for a google query, yet simultaneously quite quick for a large webpage load)? This is buffer bloat in action.

Lastly, it's worth pointing out that the Application can still buffer all it wants; we are considering the TCP buffering here. If my application wants to send a 1 MB file, there is no reason why the kernel shouldn't give userspace a whole 1MB of zero-copy buffer. That's outside the TCP end-points. The problem is that the TCP-end-points themselves need to see timely packet loss, in order to back-off slightly on transmit rates.

Power impact of debufferbloating

Posted Sep 13, 2011 19:35 UTC (Tue) by arjan (subscriber, #36785) [Link]

I think you're mistaken in thinking such activity bursts only impact the CPU.
Memory is equally impacted (going between fully active and self refresh), as are other parts of the chipset. And cpu + memory + chipset do make up a significant part of system power. Sure the screen takes a lot too.. but to get reasonable power you need to look across everything that's significant.

Power impact of debufferbloating

Posted Sep 14, 2011 6:07 UTC (Wed) by ringerc (subscriber, #3071) [Link] (3 responses)

That *can* be buffer bloat in action. It's just as often to do with the mostly-transparent base station roaming your phone does, its radio power management, the base station's load, the insane complexity of the HSPA+/HSPA/HSDPA/UMTS/GSM stack(s), the endless layers of weird and varied legacy crap that live under them, etc.

My phone will often poke away for a while trying to use a data connection, conclude it's not working, re-establish it, conclude it's still not working, switch to 2G (GPRS) and set that up, roam to a new base station as signal strenth varies, try to upgrade to 3G again, fail, and then eventually actually do what I asked.

Realistically, unless you're trying to do a google search while numerous other nearby people are watching TV / streaming video / downloading files / etc on their phones on the same network, it's probably more likely to be regular cellular quirks than bufferbloat.

As for power: Yep, the screen gets the blame for the majority of the power use on android phones. I can't help suspecting that means "display and GPU" though, simply because of the overwhelming power use. That's a total guess, but if GPU power is attributed under "System" then (a) it's insanely efficient and (b) Apple have invented new kinds of scary magic for their displays to allow them to run brighter, better displays for longer on similar batteries to Android phones.

Power impact of debufferbloating

Posted Sep 14, 2011 7:05 UTC (Wed) by Los__D (guest, #15263) [Link] (2 responses)

Erm, The only thing that Apple has on Andorid phones is resolution. I.e. the Galaxy S2's Super AMOLED Plus (damn I hate that name) has a brighter screen with deeper colors.

Power impact of debufferbloating

Posted Sep 14, 2011 14:48 UTC (Wed) by Aissen (subscriber, #59976) [Link] (1 responses)

Infinite contrast and vibrant colors, yes. Bright, I wouldn't say that:
http://www.anandtech.com/show/4686/samsung-galaxy-s-2-int...

(I'm one of those who went the AMOLED route and will probably never come back)

Power impact of debufferbloating

Posted Sep 14, 2011 17:04 UTC (Wed) by Los__D (guest, #15263) [Link]

That is a strange chart, besides own my comparisons (both devices @100%), there are several comparisons on YouTube, also showing a much brighter screen on the S2:
http://www.youtube.com/watch?v=_BXWqkOexiU&feature=fv...
http://www.youtube.com/watch?v=UyalqkdVk-8 (this is a bit flawed, as it is from quite an angle).

OTOH, I've seen many complaints about low brightness and yellowish tint on the S2, maybe there are different versions of the screen?

Power impact of debufferbloating

Posted Sep 14, 2011 8:19 UTC (Wed) by lab (guest, #51153) [Link]

>Really? On my Android phone, the CPU is about the least power-hungry part. By a huge margin, the power-hog is the screen's backlight.

Yes exactly. This is a wellknown fact. If you have an AMOLED (variant) screen, everything light/white consumes a lot of power. Black nothing.

>BTW, have you ever noticed that web-access on a 3G phone is sometimes really, really, really slow (20 seconds + for a google query, yet simultaneously quite quick for a large webpage load)? This is buffer bloat in action.

Interesting. I have the same observation, and always thought of it as "the mobile web has huge latency but quite good bandwidth".

Power impact of debufferbloating

Posted Sep 15, 2011 6:44 UTC (Thu) by Cato (guest, #7643) [Link]

Your anecdote about web page load times could equally be due to changed cell usage by other phones, or by congestion on the backhaul link from the cellsite to the mobile networks IP aggregation/core network. Without repeated measurements it's not really possible to say what's causing a delay.

Power impact of debufferbloating

Posted Sep 13, 2011 21:24 UTC (Tue) by njs (subscriber, #40338) [Link] (16 responses)

The real problem isn't the buffer sizes; there's no intrinsic reason why power management and network latency have to be at odds. This apparent conflict is due to a limitation of the current kernel design, where buffers that are inside device drivers are treated as black holes.

So network buffering in Linux works is, first a packet enters the "qdisc" buffer in the kernel, which can do all kinds of smart prioritization and what-not. Then it drains from the qdisc into a second buffer in the device driver, which is pure FIFO.

I've experienced 10-15 seconds of latency in this second, "dumb" buffer, with the iwlwifi drivers in real usage. (Not that iwlwifi is special in this regard, and I have terrible radio conditions, but to give you an idea of the magnitude of the problem.) That's a flat 10-15 seconds added to every DNS request, etc. So firstly, that's just way too large, well beyond the point of diminishing returns for power usage benefits. My laptop is sadly unlikely to average 0.1 wakeups/second, no matter what heroic efforts the networking subsystem makes.

*But* even this ridiculous buffer isn't *necessarily* a bad thing. What makes it bad is that my low-priority background rsync job, which is what's filling up that buffer, is blocking high-priority latency-sensitive things like DNS requests. That big buffer would still be fine if we could just stick the DNS packets and ssh packets and stuff at the head of the line whenever they came in, and dropped the occasional carefully-chosen packet.

But, in the kernel we have, prioritization and AQM in general can only be applied to packets that are still in the qdisc. Once they've hit the driver, they're untouchable. So what we want is prioritization, but the only way we can get it is to reduce the device buffers as small as possible, to force packets back up into the qdisc. This is an ugly hack. The real long-term solution is to enhance the drivers so that the AQM can reach in and rearrange packets that have already been queued.

This is exactly analogous to the situation with sound drivers, actually. we used to have to pick between huge latency (large audio buffers) and frequent wakeups (small audio buffers). Now good systems fill up a large buffer of audio data ahead of time and then go to sleep, but if something unpredictable happens ("oops, incoming phone call, better play the ring tone!") then we wake up and rewrite that buffer in flight, achieving both good power usage and good latency. We need an equivalent for network packets.

Power impact of debufferbloating

Posted Sep 13, 2011 21:37 UTC (Tue) by njs (subscriber, #40338) [Link] (3 responses)

...Also I should add that the above is very much talking about the long-term ideal architecture. Right now IIUC everyone's still focusing on figuring out how to make buffers smaller without killing throughput. I think this is basically the wrong question -- we're going to want big buffers anyway -- but no-one's even thinking about power consumption yet AFAICT.

And in the long run, packets will be dropped intelligently according to some control law that tries to maintain fairness and provide useful feedback to senders (i.e., AQM). For now we mostly use the "drop packets when the buffer overflows" rule, which is a terrible control law, but may be less terrible with small buffers than large ones.

So in the long run buffer size doesn't matter, but in the short run it's a single knob that's coupled to a ton of theoretically unrelated issues, and decoupling it is going to be a pain.

Power impact of debufferbloating

Posted Sep 13, 2011 21:55 UTC (Tue) by dlang (guest, #313) [Link] (1 responses)

one thing here, the main focus of the bufferbloat project is not the endpoint application, it's the routers in the middle of the path.

to a small extent, the IP stack on your endpoint device can be considered such a router as it aggregates the traffic from all your applications, but the biggest problem of large buffers is where you transition from a fast network to a slow one.

in many cases this is going from the high speed local network in someone's house to the slow DSL link to their ISP

I also think that you are mistaken if you think that any computers send a large number of packets to the network device and then shift to a low power state while the packets are being transmitted. usually shifting to such a low power state ends up powering off the network device as well.

servers almost never do something like this because as soon as they are done working on one request, they start working on another one.

shifting power states is a slow and power hungry process, it's not done frequently and on reasonably busy servers seldom takes place

Power impact of debufferbloating

Posted Sep 13, 2011 21:59 UTC (Tue) by arjan (subscriber, #36785) [Link]

I completely disagree with your last statement.
Servers run C states ALL THE TIME.... and often are "only" 50% to 80% utilized.... and not 10% is no exception either...

Power impact of debufferbloating

Posted Sep 13, 2011 22:59 UTC (Tue) by jg (guest, #17537) [Link]

It's currently multiple knobs to twist... Sometimes the drivers even lack the knob one might want to twist. Not good.

It isn't clear that the current network stack buffering architecture in Linux is what it needs to be (in fact, pretty clear it needs some serious surgery; and I'm not the person to say what it really needs to be).

Ultimately, we need AQM *everywhere* that "just works". Right now we don't have an algorithm that is known to fit this description.

And yes, I'm very aware that power management folds into this; I've worked on iPAQ's as you may or may not remember, from which other hand held Linux devices were at least inspired.

Lots of interesting work to do...

Power impact of debufferbloating

Posted Sep 13, 2011 21:40 UTC (Tue) by arjan (subscriber, #36785) [Link]

very good points, and I totally agree with your statement.

Power impact of debufferbloating

Posted Sep 14, 2011 1:35 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

Well, it's not hard to do better in your situation - just use proactive traffic shaping based on traffic class.

However, it would be nice to do it automatically.

Power impact of debufferbloating

Posted Sep 14, 2011 6:26 UTC (Wed) by njs (subscriber, #40338) [Link] (3 responses)

You miss the point. Yes, in principle that is the solution. But, right now, it just doesn't work. I turned on traffic shaping, and it had no effect whatsoever. That's because traffic shaping only applies to packets that are in the qdisc buffer. Packets only end up in the qdisc buffer if the device buffer is full, but device buffers are so large that they never fill up unless you either have a very high bandwidth connection, or have made the buffers smaller by hand. After I hacked my kernel to reduce the device buffer sizes, traffic shaping started working, so that makes reducing buffer sizes a reasonable short-term solution, but in the long-term it produces other bad effects like Arjan points out.

Actually, there is one other way to make traffic shaping work -- if you throttle your outgoing bandwidth, then that throttling is applied in between the qdisc and the device buffers, so the device buffers stop filling up. Lots of docs on traffic shaping say that this is a good idea to work around problems with your ISP's queue management, but in fact it's needed just to work around problems within your own kernel's queue management. Also, you can't really do it for wifi since you don't know what the outgoing bandwidth is at any given time, and in the case of 802.11n, this will actually *reduce* that bandwidth because hiding packets from the device driver will screw up its ability to do aggregation.

Power impact of debufferbloating

Posted Sep 14, 2011 14:34 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

Uhm. As far as I understand, qdisc sees all the packets (check the http://www.docum.org/docum.org/kptd/ ). Now, policies for ingress shaping are somewhat lacking in Linux, but egress shaping works perfectly fine.

In my own tests, I was able to segregate heavy BitTorrent traffic from VoIP traffic and get good results on a puny 2MBit line.

I did lost some bandwidth, but not that much.

Power impact of debufferbloating

Posted Sep 14, 2011 16:37 UTC (Wed) by njs (subscriber, #40338) [Link]

Yes, qdisc every packet goes through the qdisc. But here's a typical sequence of events in the situation I'm talking about:
- packet 1 enters qdisc
- packet 1 leaves qdisc
- packet 2 enters qdisc
- packet 2 leaves qdisc
- ...

There isn't much the qdisc can do here, prioritization-wise -- even if packet 2 turns out to be high priority, it can't "take back" packet 1 and send packet 2 first. Packet 1 is already gone.

In your tests you used bandwidth throttling, which moved the chokepoint so that it fell in between the qdisc buffer and your device buffer. You told the qdisc to hold onto packets and only let them dribble out at a fixed rate, and chose that rate so that it would be slower than the rate the device buffer drained. So the device buffer never filled up, and the qdisc actually had multiple packets visible at once and was able to reorder them.

Power impact of debufferbloating

Posted Sep 29, 2011 0:02 UTC (Thu) by marcH (subscriber, #57642) [Link]

Indeed: you cannot shape traffic unless/until you are the bottleneck. In other words, policing an empty queue does not much. This makes fixing bufferbloat (or QoS more generally speaking) even more difficult since bottlenecks come and go.

Power impact of debufferbloating

Posted Sep 14, 2011 5:17 UTC (Wed) by russell (guest, #10458) [Link] (4 responses)

Sounds complicated. Perhaps and easier way to deal with it is to treat buffers downstream of qdisc as "external" to the system and just drop packets from those buffers if they sit there too long, just the same as if they were transmitted and someone else dropped them.

This would place an absolute upper bound on latency between the app and the wire.

Power impact of debufferbloating

Posted Sep 14, 2011 16:55 UTC (Wed) by njs (subscriber, #40338) [Link]

That (plus intelligently setting up the buffers so that packets *don't* sit there too long) would be an improvement. Is it good enough? I dunno. Say we set a latency target of 10 ms. That means that sendfile()'s going to incur 100 wakeups/second, which is probably more than we'd like, but maybe acceptable (and maybe we'd need to wake up that often to deal with ACKs anyway). It's also not clear that that's an aggressive enough latency target. For a web server, that's already 10% of Amazon's "100 ms latency = 1% lost sales" guideline. For servers chatting with each other inside a datacenter, I just measured 1/4 of a ms as the average ping between two machines in our cluster, so call it an 80x increase in one-way latency. That seems like a lot, maybe?

And there are a lot of advantages to picking the *right* packets to drop -- if the packet you drop happens to be DNS, or interactive SSH, or part of a small web page, then you'll cause an immediate user-visible hiccup, and won't get any benefits in terms of reduced contention (like you would if you had dropped a packet from a long-running bulk TCP flow that then backs off). Then again, maybe that's okay, and re-ordering packets that have already been handed off to the driver does sound pretty tricky! (And might require hardware support.)

But it's useful to try and find the "right" solution first, because that way even if you give up on achieving it, at least in plan B you know what you're trying to approximate.

Hardware support

Posted Sep 14, 2011 17:19 UTC (Wed) by dmarti (subscriber, #11625) [Link] (2 responses)

Yes, it would be a win to have hardware that can timestamp packets going into its buffers and drop "stale" ones on the way out instead of transmitting them. (relevant thread on timestamps from the bufferbloat list). Right now, hardware assumes that late is better than never, and TCP would prefer never over too late.

Hardware support

Posted Sep 15, 2011 7:58 UTC (Thu) by johill (subscriber, #25196) [Link] (1 responses)

I'm pretty sure that's possible with a bunch of wireless devices, but I don't know how the timestamps are checked etc. off the top of my head.

Hardware support

Posted Sep 15, 2011 16:06 UTC (Thu) by dmarti (subscriber, #11625) [Link]

That would be useful to see. It looks like the problem of bufferbloat is that packets stay in the buffer until they get stale -- so checking staleness directly, ideally without having to involve the CPU, could be a way to save having to tune the buffer size.

(If you run the café in Grand Central Station, you need to bake a bigger "buffer" of muffins than the low-traffic neighborhood place does. But customers at the two places should get the same muffin freshness as long as both have the same policy of dropping day-old muffins.)

Power impact of debufferbloating

Posted Sep 15, 2011 21:43 UTC (Thu) by BenHutchings (subscriber, #37955) [Link]

So what we want is prioritization, but the only way we can get it is to reduce the device buffers as small as possible, to force packets back up into the qdisc.

Not necessarily. There may be hardware support for queues with differing priority. The mqprio qdisc takes advantage of this.

Power impact of debufferbloating

Posted Sep 14, 2011 5:01 UTC (Wed) by butlerm (subscriber, #13312) [Link] (1 responses)

You have to understand that TCP was not designed to save power, quite the opposite. It was designed to provide a relatively smooth transmission rate without hardware support. To that end it uses ACK clocking where ideally no more than two packets are transmitted for every incoming ACK. That requires TCP level processing for every ACK, which amounts to considerable CPU overhead on a high bandwidth transfer.

If you want to save power there is really only one good way to go about it - hardware support for packet pacing, where the driver instructs the hardware: here is a series of ten packets, schedule for transmission at X microsecond intervals. Adapting that to TCP is kind of a trick, but the point is that the kernel can then go to sleep for a considerably longer time before it has to wake up again to do proper congestion control.

Unfortunately, as far as I know, there are no Ethernet chipsets out there with support for hardware packet pacing. It is a big problem, because the kernel can only respond so fast (and actually get any other work done) to incoming traffic on a high bandwidth interface, so what we get instead is ACK compression where a much longer series of packets gets queued for transmission in a short period of time, swamping network queues and increasing packet loss and jitter.

Power impact of debufferbloating

Posted Sep 15, 2011 21:53 UTC (Thu) by BenHutchings (subscriber, #37955) [Link]

Solarflare Ethernet controllers support per-queue TX pacing. This was meant primarily as a means to throttle specific processes that are given their own queues, though so far I don't think we've made any use of it. You probably don't put 10G network controllers in a real low-power system though!

In the absence of pacing, it might be more useful to vary TX interrupt moderation (if there are separate interrupts for TX and RX).

Power impact of debufferbloating

Posted Sep 14, 2011 17:39 UTC (Wed) by rgmoore (✭ supporter ✭, #75) [Link]

Maybe the correct solution to this problem is to have a way of setting the buffer size dynamically. Before you go to sleep, you grow the buffer size because going to sleep is a big flashing sign that latency is not a big worry right now. Shrinking the buffer again when you wake up is likely to be a bigger problem, or at least shrinking it without just throwing away the extra packets you collected in the bigger buffer while you were asleep, but there ought to be some way of doing it. Coming up with good heuristics about how big it ought to be depending on your circumstances (beyond just sleep vs. awake) is probably the hardest part.

Power impact of debufferbloating

Posted Sep 28, 2011 23:21 UTC (Wed) by marcH (subscriber, #57642) [Link]

> Bigger (send) buffers are essential to achieve good power management

I do not think power management changes the picture. Buffers are required to reduce the frequency of context switchs in order to increase performance. Power management is just a not really special case where the "idle task" is taken into consideration.

The ACK-clocked and time-sensitive nature of TCP is totally at odds with the goal above. While TCP attempts to smooth throughput as much as possible, the above is about increasing burstiness in order to increase performance. Go figure.

This is the good same old "throughput versus latency" trade-off here again. Think "asynchronous, non-blocking, caches, buffers, bursts" versus... not.

LPC: An update on bufferbloat

Posted Sep 13, 2011 20:11 UTC (Tue) by ncm (guest, #165) [Link] (18 responses)

I wonder if chasing bufferbloat is barking up the wrong tree. If TCP doesn't tolerate generous buffers, surely that means there's something wrong with TCP, and we need to fix that.

One approach that has been wildly successful, commercially, is to ignore packet loss as an indicator of congestion, and instead measure changes in transit time. As queues get longer, packets spend more time in them, and arrive later, so changes in packet delay provide a direct measure of queue length, and therefore of congestion. Given reasonable buffering and effective filtering, this backs off well before forcing packet drops. Since it doesn't require router or kernel coöperation, it can be (and, indeed, has been) implemented entirely in user space, but a kernel implementation could provide less-noisy timing measurements.

Maximum performance depends on bringing processing in the nominal endpoints (e.g. disk channel delays) into the delay totals, but below a Gbps that can often be neglected.

LPC: An update on bufferbloat

Posted Sep 13, 2011 21:21 UTC (Tue) by dlang (guest, #313) [Link]

the problem isn't when there is just one application communicating, the problem is when there are multiple applications going through the same device.

In that case, some applications won't care at all about latency because they are doing a large file transfer, and so as long as the delays are not long enough to cause timeouts (30 seconds+), they don't care. These applicatons want to dump as much data into the pipe as possible so that the total throughput is as high as possible.

the problem comes when you have another application that does care about latency, or only has a small amount of data to transmit. This application's packets go into the queue behind the packets for the application that doesn't care about latency, and since the queue is fifo, the application can time out (for single digit seconds timeouts) before it's packets get sent.

since different applications care about different things, this is never going to be fixed in the applications.

LPC: An update on bufferbloat

Posted Sep 14, 2011 5:59 UTC (Wed) by cmccabe (guest, #60281) [Link]

Here's concrete example. If your internet is provided through a cable company, there is a box somewhere owned by that company. The box connects you and a bunch of your neighbors to some bigger uplink.

There's a buffer there that all of your packets are going to have to wait in before getting serviced. It doesn't matter how carefully you measure changes in transit time. If your neighbors are downloading big files, they are probably going to fill that buffer to the brim and you're going to have to wait a length of time proportional to total buffer size. Your latency will be bad.

LPC: An update on bufferbloat

Posted Sep 14, 2011 10:03 UTC (Wed) by epa (subscriber, #39769) [Link] (15 responses)

To the layman it does seem odd that 'packet loss is essential' as a way to signal congestion. Maybe each packet should have a bit which says 'I would have dropped this to tell you to back off a bit, but I delivered it anyway because I'm such a nice guy'. Then the endpoints could take notice of this in the same way as they normally notice dropped packets, but without the need to retransmit the lost data.

Essential, but not the first line

Posted Sep 14, 2011 11:36 UTC (Wed) by tialaramex (subscriber, #21167) [Link] (14 responses)

Sure, that's roughly ECN as I understand it.

But if you just have a "congested bit" what happens is that you can "optimise" by half-implementing it, you just send as much as you can, and all your data gets "politely" delivered with an ignorable flag, while any poor schmuck sharing the link who listens to the flag throttles back more and more trying to "share" with a bully who wants everything for themselves.

This is a race to the bottom, so any congestion handling needs to be willing to get heavy handed and just drop packets on the floor so that such bullies get lousy performance and give up. "Essential" means it's a necessary component, not the first line of defence.

QoS for unfriendly networks has to work the same way. If you just have a bit which says "I'm important" all the bullies will set it, so you need QoS which offers real tradeoffs like "I'd rather be dropped than arrive late" or "I am willing to suffer high latency if I can get more throughput".

Essential, but not the first line

Posted Sep 14, 2011 17:58 UTC (Wed) by cmccabe (guest, #60281) [Link] (12 responses)

I feel like having two QoS types, bulk and interactive, would solve 99% of the problems real applications have.

There are a lot of applications where you really just don't care about latency at all, like downloading a software update or retrieving a large file. And then there's applications like instant messenger, Skype, and web browsing, where latency is very important.

If bulk really achieved higher throughput, and interactive really got reasonable latency, I think applications would fall in line pretty quickly and nobody would "optimize" by setting the wrong class.

The problem is that there's very little real competition in the broadband market, at least in the US. The telcos tend to regard any new feature as just another cost for them. Even figuring out "how much bandwidth will I get?" or "how many gigs can I download per month?" is often difficult. So I don't expect to see real end-to-end QoS any time soon.

Essential, but not the first line

Posted Sep 15, 2011 13:18 UTC (Thu) by joern (guest, #22392) [Link] (10 responses)

Now it would be nice if there were a good heuristic to determine whether a packet is bulk or interactive. Ssh might seem interactive, but my multi-gigabyte scp copies sure aren't. Http will often be interactive, unless initiated by wget. Or unless the "web page" is a kernel image or some similar large structure.

Tcp actually has a good heuristic. If a packet got lost, this connection is too fast for the available bandwidth and has to back off a bit. If no packets get lost, it will use a bit more bandwidth. With this simple mechanism, it can adjust to any network speed, fairly rapidly adjust changing network speeds, etc.

Until you can come up with a similarly elegant heuristic that doesn't involve decisions like "ssh, but not scp, unless scp is really small", consider me unconvinced. :)

Essential, but not the first line

Posted Sep 16, 2011 16:14 UTC (Fri) by sethml (guest, #8471) [Link] (9 responses)

I think simply tracking how many bytes have been transferred over a given TCP connection in the past, say, 250 ms would be a decent measure of the importance of low latency. A TCP connection which has only transferred a small amount of data recently is likely either an http connection I just initiated or an interactive say connection, and either way, low latency is probably desirable. A TCP connection with a large amount of recent data is probably a larger download, and cares more about bandwidth than latency. I think this sort of adaptive approach which requires no application-level changes is the only sort of approach which is likely to work in the real world.

Unfortunately my scheme requires routers to track TCP connection state, which might be prohibitively expensive in practice on core routers.

Essential, but not the first line

Posted Sep 16, 2011 21:03 UTC (Fri) by piggy (guest, #18693) [Link]

Quite a few years ago some friends and I set up a test network based on interlocking rings of modems. One direction around the ring was "high throughput" and the other was "low latency". We sent full-MTU packets the high throughput direction and runts the other. For our toy loads it seemed to be a pretty good heuristic.

Essential, but not the first line

Posted Sep 22, 2011 5:28 UTC (Thu) by cmccabe (guest, #60281) [Link] (7 responses)

I don't think short connection = low latency, long connection = high throughput is a good idea.

Due to the 3-way handshake, TCP connections which only transfer a small amount of data have to pay a heavy latency penalty before sending any data at all. It seems pretty silly to ask applications that want low latency to spawn a blizzard of tiny TCP connections, all of which will have to do the 3-way handshake before sending even a single byte. Also, spawning a blizzard of connections tends to short-circuit even the limited amount of fairness that you currently get from TCP.

This problem is one of the reasons why Google designed SDPY. The SPDY web page explains that it was designed "to minimize latency" by "allow[ing] many concurrent HTTP requests to run across a single TCP session."
(See http://www.chromium.org/spdy/spdy-whitepaper)

Routers could do deep packet inspection and try to put packets into a class of service that way. This is a dirty hack, on par with flash drives scanning the disk for the FAT header. Still, we've been stuck with even dirtier hacks in the past, so who knows.

I still feel like the right solution is to have the application set a flag in the header somewhere. The application is the one who knows. Just to take your example, the ssh does know whether the input it's getting is coming from a tty (interactive) or a file that's been catted to it (non-interactive). And scp should probably always be non-interactive. You can't deduce this kind of information at a lower layer, because only the application knows.

I guess there is this thing in TCP called "urgent data" (aka OOB data), but it seems to be kind of a veniform appendix of the TCP standard. Nobody has ever been able to explain to me just what an application might want to do with it that is useful...

Essential, but not the first line

Posted Sep 22, 2011 8:23 UTC (Thu) by kevinm (guest, #69913) [Link]

I have heard of exactly one use for TCP URG data - a terminal emulator sending a BREAK to a remote system immediately after the user types it, allowing it to jump other keystrokes that may be still on their way.

Essential, but not the first line

Posted Sep 22, 2011 17:20 UTC (Thu) by nix (subscriber, #2304) [Link] (2 responses)

I still feel like the right solution is to have the application set a flag in the header somewhere. The application is the one who knows. Just to take your example, the ssh does know whether the input it's getting is coming from a tty (interactive) or a file that's been catted to it (non-interactive). And scp should probably always be non-interactive. You can't deduce this kind of information at a lower layer, because only the application knows.

And SSH can do just this: if DISPLAY is unset and SSH is running without a terminal, it sets the QoS bits for a bulk transfer: otherwise, it sets them for an interactive transfer. Unfortunately scp doesn't unset DISPLAY, so if you run scp from inside an X session I suspect it always gets incorrectly marked as interactive... but that's a small thing.

Essential, but not the first line

Posted Sep 23, 2011 6:44 UTC (Fri) by salimma (subscriber, #34460) [Link] (1 responses)

Can't you do "DISPLAY= scp ..." ?

Essential, but not the first line

Posted Sep 23, 2011 10:57 UTC (Fri) by nix (subscriber, #2304) [Link]

Yes, but my point is that you shouldn't have to. scp should pass in "-o IPQoS throughput" by default. (Speaking as someone who, er, hasn't written the patch to make it do so.)

Essential, but not the first line

Posted Sep 23, 2011 0:52 UTC (Fri) by njs (subscriber, #40338) [Link] (2 responses)

> I don't think short connection = low latency, long connection = high throughput is a good idea.

I don't think anyone does? (I've had plenty of multi-day ssh connections; they were very low throughput...)

I think the idea is that if some connection is using *less* than its fair share of the available bandwidth, then it's reasonable to give it priority latency-wise. If it could have sent a packet 100 ms again without being throttled, but chose not to -- then it's pretty reasonable to let the packet it sends now jump ahead of all the other packets that have arrived in the last 100 ms; it'll end up at the same place as it would have if the flow were more aggressive. So it should work okay, and naturally gives latency priority to at least some of the connections that need it more.

Essential, but not the first line

Posted Sep 23, 2011 9:39 UTC (Fri) by cmccabe (guest, #60281) [Link] (1 responses)

Penalizing high bandwidth users is kind of an interesting heuristic. It's definitely better than penalizing long connections, at least!

However, I think you're assuming that all the clients are the same. This is definitely not be the case in real life. Also, not all applications that need low latency are low bandwidth. For example video chat can suck up quite a bit of bandwith.

Just to take one example. If I'm the cable company, I might have some customers with a 1.5 MBit/s download and others with 6.0 MBit/s. Assuming that they all go into one big router at some point, the 6.0 MBit/s guys will obviously be using more than their "fair share" of the uplink from this box. Maybe I can be super clever and account for this, but what about the next router in the chain? It may not even be owned by my cable company, so it's not going to know the exact reason why some connections are using more bandwidth than others.

Maybe there's something I'm not seeing, but this still seems problematic...

Essential, but not the first line

Posted Sep 24, 2011 1:30 UTC (Sat) by njs (subscriber, #40338) [Link]

Well, that's why we call it a heuristic :-) It can be helpful even if it's not perfect. A really tough case is flows that can scale their bandwidth requirements but value latency over throughput -- for something like VNC or live video you'd really like to use all the bandwidth you can get, but latency is more important. (I maintain a program like this, bufferbloat kicks its butt :-(.) These should just go ahead and set explicit QoS bits.

Obviously the first goal should be to minimize latency in general, though.

The QoS lost cause

Posted Sep 29, 2011 21:40 UTC (Thu) by marcH (subscriber, #57642) [Link]

> I feel like having two QoS types, bulk and interactive, would solve 99% of the problems real applications have.

Interesting, but never going to happen. The main reason why TCP/IP is successful is because QoS is optional in theory and non-existent in practice.

The end to end principle states that the network should be as dumb as possible. This is at the core of the design of TCP/IP. It notably allows interconnecting any network technologies together, including the least demanding ones. The problem with this approach is: as soon as you have the cheapest and dumbest technology somewhere in your path (think: basic Ethernet) there is a HUGE incentive to align your other network section(s) on this lowest common denominator (think... Ethernet). Because the advanced features and efforts you paid in the more expensive sections are wasted.

Suppose you have the perfect QoS settings implemented in only a few sections of your network path (like many posts in this thread do). As soon as the traffic changes and causes your current bottleneck (= non-empty queue) to move to another, QoS-ignorant section then all your QoS dollars and configuration efforts become instantly wasted. Policing empty queues has no effect.

An even more spectacular way to waste time and money with QoS in TCP/IP is to have different network sections implementing QoS in ways not really compatible with each other.

The only cases where TCP/IP QoS can be made to work is when a *single* entity has a tight control on the entire network; think for instance VoIP at the corporate or ISP level. And even there I suspect it does not come cheap. In other cases bye bye QoS.

Essential, but not the first line

Posted Sep 16, 2011 9:22 UTC (Fri) by ededu (guest, #64107) [Link]

What about the following simple solution? The buffer is divided in two, a 5% part and a 95% part; the first allows low latency (since small size), the latter achieves high throughput (since big size). The sender sets in each packet a bit to choose in which buffer part the packet will be put. The router serves in round-robin (one packet from the 1st part, one packet from the 2nd, one from 1st, one from 2nd etc.)

(An optimisation can be done to use the 5% part if it is partly used and the 95% part is full.)

At the very least Gettys got a wicked T-shirt.

Posted Sep 13, 2011 21:04 UTC (Tue) by man_ls (guest, #15091) [Link] (10 responses)

This issue was already very well explained in TLDP in 2003; I did a similar tweaking on my network connection and it worked beautifully. A bit of QoS configuration and some bandwidth shaping was enough to make the biggest downloads go smoothly while my SSH connections worked great.

But it was a lot of work, which by now should already be automated by the OS. Sadly it isn't.

At the very least Gettys got a wicked T-shirt.

Posted Sep 13, 2011 21:24 UTC (Tue) by dlang (guest, #313) [Link] (9 responses)

the problem is that there is no one right answer to what the settings should be, and the right settings depend on the sizes of pipes (and useage generated by other people) in between you and your destination.

right now there is no queuing or QoS setting or algorithm that is right for all situations

At the very least Gettys got a wicked T-shirt.

Posted Sep 13, 2011 21:43 UTC (Tue) by man_ls (guest, #15091) [Link] (8 responses)

There is no algorithm for the general situation, but there is a very good approximation for the endpoints, right? Prioritize your outgoing packets, limit your download rate to a little less than the available bandwidth for your incoming packets. That is what Gettys calls "dropping packets", which should be a good thing and not too hard to do in the OS.

At the very least Gettys got a wicked T-shirt.

Posted Sep 13, 2011 21:58 UTC (Tue) by dlang (guest, #313) [Link] (7 responses)

you can't limit your download rate if you don't know the size of the pipes.

if you have a machine plugged in to a gig-E network in your house, that is then connected to the Internet via a 1.5 down/768 up DSL line, your machine has no way of knowing what the bandwidth that it should optimize for is.

the prioritization and queuing fixes need to be done on your router to your ISP, and on the ISPs router to you.

you can't fix the problem on your end because by the time you see the downloaded packets they have already gotten past the chokepoint where they needed to be prioritized.

At the very least Gettys got a wicked T-shirt.

Posted Sep 13, 2011 22:08 UTC (Tue) by man_ls (guest, #15091) [Link] (6 responses)

Hmmm, true. What my router could do is drop packets for any traffic beyond 90% of its capacity; my computer in turn could drop packets for anything beyond 90% of its gig-E nominal capacity. Anything beyond my control, I cannot do anything but order a bufferbloat T-shirt.

At the very least Gettys got a wicked T-shirt.

Posted Sep 13, 2011 22:19 UTC (Tue) by dlang (guest, #313) [Link] (1 responses)

what the bufferbloat folks are trying to do is to find an QoS algorithm that works and if nothing else, get it so that the ISPs (and their upstream hardware suppliers) configure their routers with defaults that do not create as big a problem.

if you think about this, the ISP router will have a very high speed connection to the rest of the ISP, and then a lot of slow speed connections to individual houses.

having a large buffer is appropriate for the high speed pipe, and this will work very well if the traffic is evenly spread across all the different houses.

but if one house generates a huge amount of traffic (they download a large file from a very fast server), the buffer can be filled up by the traffic to this one house. that will cause all traffic to the other houses to be delayed (or dropped if the buffer is actually full), and having all of this traffic queued at the ISPs local router does nobody very much good.

TCP is designed such that in this situation, the ISPs router is supposed to drop packets for this one house early on and the connection will never ramp up to have so much data in flight.

but by having large buffers, the packets are delayed a significant amount, but not dropped, so the sender keeps ramping up to higher speeds.

the fact the vendors were not testing latency and bandwith at the same time hid this fact. the devices would do very well in latency tests that never filled the buffers, and they would do very well in throughput tests that used large parts of the buffers. without QoS providing some form of prioritization, or dropping packets, the combination of the two types of traffic is horrible.

At the very least Gettys got a wicked T-shirt.

Posted Sep 14, 2011 21:09 UTC (Wed) by jg (guest, #17537) [Link]

Actually, we need a AQM algorithm more than QOS; all QOS can do is change who suffers when, not prevent the buffers in some queue or another from filling. To do that, you need to signal congestion in a timely way via packet loss or ECN.

At the very least Gettys got a wicked T-shirt.

Posted Sep 14, 2011 2:35 UTC (Wed) by jg (guest, #17537) [Link] (3 responses)

There is a lot you can do with your router and knowledge about bufferbloat.

I'm going to try to write a blog posting soon about what is probably the best strategy for most to do.

And the tee shirt design is in Cafe Press, though I do want to add a footnote to the "fair queuing" line, that runs something like "for some meaning of fair", in that I don't really mean the traditional TCP fair queuing necessarily (whether that is fair or not is in the eye of the beholder).

At the very least Gettys got a wicked T-shirt.

Posted Sep 14, 2011 7:52 UTC (Wed) by mlankhorst (subscriber, #52260) [Link]

Yeah bufferbloat is a pain. I patched openwrt to include SFB (stochastic fair blue), then created a simple traffic shaping script based on the approximate upload limits of my dsl2+ router. I no longer get multisecond delays when the network is actually used in the way it's supposed to. :)

Enabled ECN where possible too, which helps with the traffic shaping method I was using. Windows XP doesn't support it, and later versions disable it by default.

Time, not bandwidth/delay, is the key

Posted Sep 14, 2011 14:53 UTC (Wed) by davecb (subscriber, #1574) [Link] (1 responses)

Some practical suggestions will be much appreciated.

Changing the subject slightly, there's a subtle, underlying problem in that we tend to work with what's easy, not what's important.

We work with the bandwidth/delay product because it's what we needed in the short run, and we probably couldn't predict we'd need something more ta the time. We work with buffer sizes because that's dead easy.

What we need is the delay, latency and/or service time of the various components. It's easy to deal with performance problems that are stated in time units and are fixed by varying the times things take. It's insanely hard to deal with performance problems when all we know is a volume in bytes. It's a bit like measuring the performance of large versus small cargo containers when you don't know if they're on a truck, a train or a ship!

If you expose any time-based metrics or tuneables in your investigation, please highlight them. Anything that looks like delay or latency would be seriously cool.

One needs very little to analyze this class of problems. Knowing the service time of a packet, the number of packets, and the time between packets is sufficient to build a tiny little mathematical model of the thing you measured. From the model you can then predict what happens when you improve or disimprove the system. More information allows for more predictive models, of course, and eventually to my mathie friends becoming completely unintelligable (;-))

--dave (davecb@spamcop.net) c-b

Time, not bandwidth/delay, is the key

Posted Sep 14, 2011 21:01 UTC (Wed) by jg (guest, #17537) [Link]

Most of the practical suggestions are in my blog already; I will try to pull something a bit more howtoish together.

You are exactly correct that any real solution for AQM must be time based; the rate of draining a buffer and the rate of growth of a buffer are related to the incoming and outing data per unit time.

As you note, not all bytes are created equal; the best example is in 802.11 where a byte in a multicast/broadcast packet can be 100 times more expensive than a unicast payload.

Thankfully, in miac80211 is a package called Minstrel, which is, on an on-going dynamic basis, keeping track of the costs of each packet (802.11n aggregation in particular makes this "interesting".

So the next step is to hook up appropriate AQM algorithms to it such as eBDP or the "RED Light" algorithm that Kathie Nichols and Van Jacobson are again trying to make work. John Linvile's quick reimplementation of eBDP (the current patch is in the debloat-testing tree) does not do this as yet and can't go upstream in it's current form for this and other reasons. eBDP seems to help as Van predicted it should (he pointed me at it in January), but we've not tested it much as
yet.

The challenge after that is going to be to get that all working while dealing with all the buffering issues along the way in the face of aggregation and QOS classification. There are some fun challenges for those who want to make this all work well; it's at least a three dimensional problem, so there will be no easy trivial solution ultimately, and will be a challenge. It's way beyond my understanding of Linux internals.

Please come help!

LPC: An update on bufferbloat

Posted Sep 22, 2011 5:05 UTC (Thu) by fest3er (guest, #60379) [Link] (3 responses)

I'm two years into exploring Linux Traffic Control (LTC) to create a nice web-based UI to configure it; I'm still learning what it can and cannot do. A lot of the documentation isn't very good, and some of it is flat-out wrong. The UI is finally working fairly well, even though it is very wrong in a couple places.

So far, I have found:

When using HTB, with Stochastic Fair Queuing (SFQ) on each leaf, LTC can do a very good job balancing data streams
I have to measure the actual sustained throughput of each interface to ensure that I don't try to push too much data through it. For example, my 100Mbit links generally max out at around 92Mbit/s and my internet uplink maxes out at around 2.8Mbit/s.
Priority should only ever be given to data streams that use limited, well-defined bandwidth, such as DNS and VoIP; using too much prioritization leads to bad latencies that can be avoided.
All other streams should equally share the bandwidth and be allowed to use 100% of it when there is no other contention. HTB does this very well.

With that in hand, I identified the most common sets of data streams (FTP, HTTP, SSH, DNS, et al) and designed an HTB scheme that fairly shares bandwidth among the many data streams while allowing all but DNS, VoIP and IM to use 100% of the available bandwidth in the absence of contention--DNS, VoIP and IM are allowed limited bandwidth, based on educated guesses as to the most they can need. I also set DNS and VoIP to priority 0 (zero), IM to priority 2, and all others to priority 1. DNS and VoIP have priority but won't keep other streams from getting bandwidth because their average bit rates are fairly low, nor will other streams see much changes in latency. I have not given ACK/NAK/etc. type packets any special treatment. (For what it's worth, I also limited unidentified traffic to 56kbit/s. It's bitten me a few times, but has also forced me to learn more.)

Before deploying this scheme, I would almost always see uploads start at a nice high rate, then plummet to 1Mbit/s and stay there, and I would see different data streams fight with each other causing choppy throughput. After deploying the scheme, throughput is smooth, uplink is stable at 2.8Mbit/s, and the various data streams share the available bandwidth very nicely. Interactive SSH proceeds as though the 500MiB tarball uploads and downloads aren't happening. Even wireless behaves nicely with its bandwidth set to match the actual wireless rate. The "uncontrollable downlink" problem? It flows fairly smoothly even though my only control is at the high-speed outbound IFs. By forcing data streams to share proportionately, none of them can take over the limited downlink bandwidth; don't forget that their uplink ACK/NAK/etc. are also shared proportionately. The "shared cable" problem? I don't remember the last time I noticed a bothersome slowdown during the normal 'overload' periods.

But that is only part of the solution: the outbound part. Inbound control is a whole other problem, one that LTC doesn't handle well. And when you add NAT to the equation, LTC all but falls apart; after outbound NAT, the source IPs and possibly ports are 'obscured' and requires iptables classifiers in mangle because 'tc filter' cannot work. When I have time, I am going to rewrite my UI to use IMQ devices to handle all traffic control. Then I'll be able to fairly share a limited link's bandwidth (read 'internet link') among the other links in the router and properly control traffic whether or not NAT is involved. It will even be much easier to control VPNs that terminate on the firewall.

In summary, I've found that it is possible to avoid flooding neighbors without horribly disfiguring latencies and without unduly throttling interfaces. Steady state transfers are clean. Bursty transfers are clean (even though they'll use some buffer space) as long as their average bit rate remains reasonable. But as has been pointed out several times and is the point of this series, controlling my end of the link is only half of the solution.

TCP/IP needs a way to query interfaces to determine their maximum sustainable throughput, and it needs a way to transmit that info to the other end (the peer), possibly in the manner of path MTU discovery. ISPs need to tell their customers their maximum sustainable throughput; the exact number, not their usual misleading 'up to' marketing BS. Packets should be dropped as close to the source as possible when necessary; dropping a packet at the first router on a 100Mb or Gb link is far better than dropping it after it's traversed the slow link on the other side of that router.

The effects of bufferbloat can be minimized, even when only one side can be controlled. But good Lord! Understanding LTC is not easy. The faint of heart and those with hypertension, anger management issues, difficulty understanding semi-mangled English, ADD, ADHD and other ailments should consult their physicians before undertaking such a trek. It is an arduous journey. Every time you think you've reached the end, you discover you've only looped back to an earlier position on the trail. Here, patience is a virtue; the patience of Job is almost required.

LPC: An update on bufferbloat

Posted Sep 22, 2011 13:03 UTC (Thu) by nix (subscriber, #2304) [Link] (1 responses)

See, this kind of comment is why I subscribe to LWN :)

(btw, have you considered fixing the documentation at the same time? I'm really looking for a replacement for wshaper that doesn't rely on a pile of deprecated stuff: a web UI is probably overkill for my application. Though when you have a working web UI I'd be happy to test it!)

LPC: An update on bufferbloat

Posted Sep 22, 2011 17:59 UTC (Thu) by fest3er (guest, #60379) [Link]

The UI is most definitely a work in progress. There are things it does wrong.

Try this one: http://agcl.us/cltc-configurator/; it's not quite as up-to-date as the version in GitHub for Roadster (my updated version of Smoothwall), but it should let you create a workable config. It starts with a default scheme that should serve as a guiding example. Remember to save both the scheme and the generated script. The script should work on most modern Linux platforms.

I have another version, http://agcl.us/traffic_control, that is better integrated with Smoothwall/Roadster, but the scripts it generates will require a little tweaking to work with generic Linux distros. And you have to start from scratch.

LPC: An update on bufferbloat

Posted Oct 12, 2011 12:02 UTC (Wed) by jch (guest, #51929) [Link]

> 2. I have to measure the actual sustained throughput of each interface to ensure that I don't try to push too much data through it. For example, my 100Mbit links generally max out at around 92Mbit/s and my internet uplink maxes out at around 2.8Mbit/s.

And that's one of the problems. Many modern interface types don't have a "native" speed -- Wifi varies dynamically, with a factor of 100 or so between the slowest and the fastest, cable and ADSL have variable rates, even with good old Ethernet all bets are off when switches are involved.

We really need to find a way to limit latency without knowing the bottleneck speed beforehand. The delay-based end-to-end approaches are promising (Vegas, LEDBAT), but they have at least two serious fairness problems. I don't know of any router-based techniques that solve the issue.

--jch

LPC: An update on bufferbloat

Posted Sep 27, 2011 13:15 UTC (Tue) by shawnjgoff (guest, #70104) [Link]

T-Mobile has some serious buffering issues. I've seen a 70-minute ping RTT before. http://sprunge.us/hdUd

You can easily cause multi-minute latencies by blocking the radio, then unblocking it after a few minutes - the replies all come flooding back in. You can also see this if you start a ping and then go do a few speed tests or some large downloads - you'll see the RTTs climb and climb as the transfer goes, then they come flooding back once it's done.

LPC: An update on bufferbloat

Posted Oct 12, 2011 16:34 UTC (Wed) by ViralMehta (guest, #80756) [Link] (1 responses)

when it comes to the "dark buffers" present in intermediate level in network (in some routers or switches), how can this cause the trouble ?

LPC: An update on bufferbloat

Posted Oct 12, 2011 23:44 UTC (Wed) by dlang (guest, #313) [Link]

if that router becomes a bottleneck, the buffers will delay all traffic going through it.