LPC: An update on bufferbloat
"Bufferbloat" is the problem of excessive buffering used at all layers of
the network, from applications down to the hardware itself. Large buffers
can create obvious latency problems (try uploading a large file from a home
network while somebody else is playing a fast-paced network game and you'll
be able to measure the latency from the screams of frustration in the other
room), but the real issue is deeper than that. Excessive buffering wrecks
the control loop that enables implementations to maximize throughput
without causing excessive congestion on the net. The experience of the
late 1980's showed how bad a congestion-based collapse of the net can be;
the idea that bufferbloat might bring those days back is frightening to
many.
The initial source of the problem, Jim said, was the myth that dropping packets is a bad thing to do combined with the fact that it is no longer possible to buy memory in small amounts. The truth of the matter is that the timely dropping of packets is essential; that is how the network signals to transmitters that they are sending too much data. The problem is complicated with the use of the bandwidth-delay product to size buffers. Nobody really knows what either the bandwidth or the delay are for a typical network connection. Networks vary widely; wireless networks can be made to vary considerably just by moving across the room. In this environment, he said, no static buffer size can ever be correct, but that is exactly what is being used at many levels.
As a result, things are beginning to break. Protocols that cannot handle much in the way of delay or loss - DNS, ARP, DHCP, VOIP, or games, for example - are beginning to suffer. A large proportion of broadband links, Jim said, are "just busted." The edge of the net is broken, but the problem is more widespread than that; Jim fears that bloat can be found everywhere.
If static buffer sizes cannot work, buffers must be sized dynamically. The RED protocol is meant to do that sizing, but it suffers from one little problem: it doesn't actually work. The problem, Jim said, is that the protocol knows about the size of a given buffer, but it knows nothing about how quickly that buffer is draining. Even so, it can improve the situation in some situations. But it requires quite a bit of tuning to work right, so a lot of service providers simply do not bother. Efforts to create an improved version of RED are underway, but the results are not yet available.
A real solution to bufferbloat will have to be deployed across the entire net. There are some things that can be done now; Jim has spent a lot of time tweaking his home router to squeeze out excessive buffering. The result, he said, involved throwing away a bit of bandwidth, but the resulting network is a lot nicer to use. Some of the fixes are fairly straightforward; Ethernet buffering, for example, should be proportional to the link speed. Ring buffers used by network adapters should be reviewed and reduced; he found himself wondering why a typical adapter uses the same size for the transmit and receive buffers. There is also an extension to the DOCSIS standard in the works to allow ISPs to remotely tweak the amount of buffering employed in cable modems.
A complete solution requires more than that, though. There are a lot of hidden buffers out there in unexpected places; many of them will be hard to find. Developers need to start thinking about buffers in terms of time, not in terms of bytes or packets. And we'll need active queue management in all devices and hosts; the only problem is that nobody really knows which queue management algorithm will actually solve the problem. Steve Hemminger noted that there are no good multi-threaded queue-management algorithms out there.
CeroWRT
Jim yielded to Dave Täht, who talked about the CeroWRT router distribution. Dave pointed out that, even when we figure out how to tackle bufferbloat, we have a small problem: actually getting those fixes to manufacturers and, eventually, users. A number of popular routers are currently shipping with 2.6.16 kernels; it is, he said, the classic embedded Linux problem.
One router distribution that is doing a better job of keeping up with the
mainline is OpenWRT. Appropriately,
CeroWRT is based on OpenWRT; its purpose is to complement
the debloat-testing kernel tree and provide
a platform for real-world testing of bufferbloat fixes. The goals behind
CeroWRT are to always be within a release or two of the mainline kernel,
provide reproducible results for network testing, and to be reliable enough
for everyday use while being sufficiently experimental to accept new stuff.
There is a lot of new stuff in CeroWRT. It has fixes to the packet aggregation code used in wireless drivers that can, in its own right, be a source of latency. The length of the transmit queues used in network interfaces has been reduced to eight packets - significantly smaller than the default values, which can be as high as 1000. That change alone is enough, Dave said, to get quality-of-service processing working properly and, he thinks, to push the real buffering bottleneck to the receive side of the equation. CeroWRT runs a tickless kernel, and enables protocol extensions like explicit congestion notification (ECN), selective acknowledgments (SACK), and duplicate SACK (DSACK) by default. A number of speedups have also been applied to the core netfilter code.
CeroWRT also includes a lot of interesting software, including just about every network testing tool the developers could get their hands on. Six TCP congestion algorithms are available, with Westwood used by default. Netem (a network emulator package) has been put in to allow the simulation of packet loss and delay. There is a bind9 DNS server with an extra-easy DNSSEC setup. Various mesh networking protocols are supported. A lot of data collection and tracing infrastructure has been added from the web10g project, but Dave has not yet found a real use for the data.
All told, CeroWRT looks like a useful tool for validating work done in the fight against bufferbloat. It has not yet reached its 1.0 release, though; there are still some loose ends to tie and some problems to be fixed. For now, it only works on the Netgear WNDR3700v2 router - chosen for its open hardware and relatively large amount of flash storage. CeroWRT should be ready for general use before too long; fixing the bufferbloat problem is likely to take rather longer.
[Your editor would like to thank LWN's subscribers for supporting his
travel to LPC 2011.]
| Index entries for this article | |
|---|---|
| Kernel | Networking/Bufferbloat |
| Conference | Linux Plumbers Conference/2011 |
Posted Sep 13, 2011 17:22 UTC (Tue)
by arekm (guest, #4846)
[Link]
Posted Sep 13, 2011 18:13 UTC (Tue)
by arjan (subscriber, #36785)
[Link] (31 responses)
If you end up with a micro-small buffer, you may get better behavior on the wire (that is the part that is clearly interesting, and essential for Jim's work), but it also means the OS will be waking up *all the time*, which has the risk of destroying power management...
Power management matters both for mobile devices as for servers... and guess which devices are at the edges of the network ;-(
Posted Sep 13, 2011 19:08 UTC (Tue)
by josh (subscriber, #17465)
[Link] (1 responses)
Posted Sep 13, 2011 19:36 UTC (Tue)
by arjan (subscriber, #36785)
[Link]
Posted Sep 13, 2011 19:27 UTC (Tue)
by Richard_J_Neill (subscriber, #23093)
[Link] (7 responses)
BTW, have you ever noticed that web-access on a 3G phone is sometimes really, really, really slow (20 seconds + for a google query, yet simultaneously quite quick for a large webpage load)? This is buffer bloat in action.
Lastly, it's worth pointing out that the Application can still buffer all it wants; we are considering the TCP buffering here. If my application wants to send a 1 MB file, there is no reason why the kernel shouldn't give userspace a whole 1MB of zero-copy buffer. That's outside the TCP end-points. The problem is that the TCP-end-points themselves need to see timely packet loss, in order to back-off slightly on transmit rates.
Posted Sep 13, 2011 19:35 UTC (Tue)
by arjan (subscriber, #36785)
[Link]
Posted Sep 14, 2011 6:07 UTC (Wed)
by ringerc (subscriber, #3071)
[Link] (3 responses)
My phone will often poke away for a while trying to use a data connection, conclude it's not working, re-establish it, conclude it's still not working, switch to 2G (GPRS) and set that up, roam to a new base station as signal strenth varies, try to upgrade to 3G again, fail, and then eventually actually do what I asked.
Realistically, unless you're trying to do a google search while numerous other nearby people are watching TV / streaming video / downloading files / etc on their phones on the same network, it's probably more likely to be regular cellular quirks than bufferbloat.
As for power: Yep, the screen gets the blame for the majority of the power use on android phones. I can't help suspecting that means "display and GPU" though, simply because of the overwhelming power use. That's a total guess, but if GPU power is attributed under "System" then (a) it's insanely efficient and (b) Apple have invented new kinds of scary magic for their displays to allow them to run brighter, better displays for longer on similar batteries to Android phones.
Posted Sep 14, 2011 7:05 UTC (Wed)
by Los__D (guest, #15263)
[Link] (2 responses)
Posted Sep 14, 2011 14:48 UTC (Wed)
by Aissen (subscriber, #59976)
[Link] (1 responses)
(I'm one of those who went the AMOLED route and will probably never come back
)
Posted Sep 14, 2011 17:04 UTC (Wed)
by Los__D (guest, #15263)
[Link]
OTOH, I've seen many complaints about low brightness and yellowish tint on the S2, maybe there are different versions of the screen?
Posted Sep 14, 2011 8:19 UTC (Wed)
by lab (guest, #51153)
[Link]
Yes exactly. This is a wellknown fact. If you have an AMOLED (variant) screen, everything light/white consumes a lot of power. Black nothing.
>BTW, have you ever noticed that web-access on a 3G phone is sometimes really, really, really slow (20 seconds + for a google query, yet simultaneously quite quick for a large webpage load)? This is buffer bloat in action.
Interesting. I have the same observation, and always thought of it as "the mobile web has huge latency but quite good bandwidth".
Posted Sep 15, 2011 6:44 UTC (Thu)
by Cato (guest, #7643)
[Link]
Posted Sep 13, 2011 21:24 UTC (Tue)
by njs (subscriber, #40338)
[Link] (16 responses)
So network buffering in Linux works is, first a packet enters the "qdisc" buffer in the kernel, which can do all kinds of smart prioritization and what-not. Then it drains from the qdisc into a second buffer in the device driver, which is pure FIFO.
I've experienced 10-15 seconds of latency in this second, "dumb" buffer, with the iwlwifi drivers in real usage. (Not that iwlwifi is special in this regard, and I have terrible radio conditions, but to give you an idea of the magnitude of the problem.) That's a flat 10-15 seconds added to every DNS request, etc. So firstly, that's just way too large, well beyond the point of diminishing returns for power usage benefits. My laptop is sadly unlikely to average 0.1 wakeups/second, no matter what heroic efforts the networking subsystem makes.
*But* even this ridiculous buffer isn't *necessarily* a bad thing. What makes it bad is that my low-priority background rsync job, which is what's filling up that buffer, is blocking high-priority latency-sensitive things like DNS requests. That big buffer would still be fine if we could just stick the DNS packets and ssh packets and stuff at the head of the line whenever they came in, and dropped the occasional carefully-chosen packet.
But, in the kernel we have, prioritization and AQM in general can only be applied to packets that are still in the qdisc. Once they've hit the driver, they're untouchable. So what we want is prioritization, but the only way we can get it is to reduce the device buffers as small as possible, to force packets back up into the qdisc. This is an ugly hack. The real long-term solution is to enhance the drivers so that the AQM can reach in and rearrange packets that have already been queued.
This is exactly analogous to the situation with sound drivers, actually. we used to have to pick between huge latency (large audio buffers) and frequent wakeups (small audio buffers). Now good systems fill up a large buffer of audio data ahead of time and then go to sleep, but if something unpredictable happens ("oops, incoming phone call, better play the ring tone!") then we wake up and rewrite that buffer in flight, achieving both good power usage and good latency. We need an equivalent for network packets.
Posted Sep 13, 2011 21:37 UTC (Tue)
by njs (subscriber, #40338)
[Link] (3 responses)
And in the long run, packets will be dropped intelligently according to some control law that tries to maintain fairness and provide useful feedback to senders (i.e., AQM). For now we mostly use the "drop packets when the buffer overflows" rule, which is a terrible control law, but may be less terrible with small buffers than large ones.
So in the long run buffer size doesn't matter, but in the short run it's a single knob that's coupled to a ton of theoretically unrelated issues, and decoupling it is going to be a pain.
Posted Sep 13, 2011 21:55 UTC (Tue)
by dlang (guest, #313)
[Link] (1 responses)
to a small extent, the IP stack on your endpoint device can be considered such a router as it aggregates the traffic from all your applications, but the biggest problem of large buffers is where you transition from a fast network to a slow one.
in many cases this is going from the high speed local network in someone's house to the slow DSL link to their ISP
I also think that you are mistaken if you think that any computers send a large number of packets to the network device and then shift to a low power state while the packets are being transmitted. usually shifting to such a low power state ends up powering off the network device as well.
servers almost never do something like this because as soon as they are done working on one request, they start working on another one.
shifting power states is a slow and power hungry process, it's not done frequently and on reasonably busy servers seldom takes place
Posted Sep 13, 2011 21:59 UTC (Tue)
by arjan (subscriber, #36785)
[Link]
Posted Sep 13, 2011 22:59 UTC (Tue)
by jg (guest, #17537)
[Link]
It isn't clear that the current network stack buffering architecture in Linux is what it needs to be (in fact, pretty clear it needs some serious surgery; and I'm not the person to say what it really needs to be).
Ultimately, we need AQM *everywhere* that "just works". Right now we don't have an algorithm that is known to fit this description.
And yes, I'm very aware that power management folds into this; I've worked on iPAQ's as you may or may not remember, from which other hand held Linux devices were at least inspired.
Lots of interesting work to do...
Posted Sep 13, 2011 21:40 UTC (Tue)
by arjan (subscriber, #36785)
[Link]
Posted Sep 14, 2011 1:35 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
However, it would be nice to do it automatically.
Posted Sep 14, 2011 6:26 UTC (Wed)
by njs (subscriber, #40338)
[Link] (3 responses)
Actually, there is one other way to make traffic shaping work -- if you throttle your outgoing bandwidth, then that throttling is applied in between the qdisc and the device buffers, so the device buffers stop filling up. Lots of docs on traffic shaping say that this is a good idea to work around problems with your ISP's queue management, but in fact it's needed just to work around problems within your own kernel's queue management. Also, you can't really do it for wifi since you don't know what the outgoing bandwidth is at any given time, and in the case of 802.11n, this will actually *reduce* that bandwidth because hiding packets from the device driver will screw up its ability to do aggregation.
Posted Sep 14, 2011 14:34 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
In my own tests, I was able to segregate heavy BitTorrent traffic from VoIP traffic and get good results on a puny 2MBit line.
I did lost some bandwidth, but not that much.
Posted Sep 14, 2011 16:37 UTC (Wed)
by njs (subscriber, #40338)
[Link]
There isn't much the qdisc can do here, prioritization-wise -- even if packet 2 turns out to be high priority, it can't "take back" packet 1 and send packet 2 first. Packet 1 is already gone.
In your tests you used bandwidth throttling, which moved the chokepoint so that it fell in between the qdisc buffer and your device buffer. You told the qdisc to hold onto packets and only let them dribble out at a fixed rate, and chose that rate so that it would be slower than the rate the device buffer drained. So the device buffer never filled up, and the qdisc actually had multiple packets visible at once and was able to reorder them.
Posted Sep 29, 2011 0:02 UTC (Thu)
by marcH (subscriber, #57642)
[Link]
Posted Sep 14, 2011 5:17 UTC (Wed)
by russell (guest, #10458)
[Link] (4 responses)
This would place an absolute upper bound on latency between the app and the wire.
Posted Sep 14, 2011 16:55 UTC (Wed)
by njs (subscriber, #40338)
[Link]
And there are a lot of advantages to picking the *right* packets to drop -- if the packet you drop happens to be DNS, or interactive SSH, or part of a small web page, then you'll cause an immediate user-visible hiccup, and won't get any benefits in terms of reduced contention (like you would if you had dropped a packet from a long-running bulk TCP flow that then backs off). Then again, maybe that's okay, and re-ordering packets that have already been handed off to the driver does sound pretty tricky! (And might require hardware support.)
But it's useful to try and find the "right" solution first, because that way even if you give up on achieving it, at least in plan B you know what you're trying to approximate.
Posted Sep 14, 2011 17:19 UTC (Wed)
by dmarti (subscriber, #11625)
[Link] (2 responses)
Yes, it would be a win to have hardware that can timestamp packets going into its buffers and drop "stale" ones on the way out instead of transmitting them. (relevant thread on timestamps from the bufferbloat list). Right now, hardware assumes that late is better than never, and TCP would prefer never over too late.
Posted Sep 15, 2011 7:58 UTC (Thu)
by johill (subscriber, #25196)
[Link] (1 responses)
Posted Sep 15, 2011 16:06 UTC (Thu)
by dmarti (subscriber, #11625)
[Link]
That would be useful to see. It looks like the problem of bufferbloat is that packets stay in the buffer until they get stale -- so checking staleness directly, ideally without having to involve the CPU, could be a way to save having to tune the buffer size. (If you run the café in Grand Central Station, you need to bake a bigger "buffer" of muffins than the low-traffic neighborhood place does. But customers at the two places should get the same muffin freshness as long as both have the same policy of dropping day-old muffins.)
Posted Sep 15, 2011 21:43 UTC (Thu)
by BenHutchings (subscriber, #37955)
[Link]
Not necessarily. There may be hardware support for queues with differing priority. The mqprio qdisc takes advantage of this.
Posted Sep 14, 2011 5:01 UTC (Wed)
by butlerm (subscriber, #13312)
[Link] (1 responses)
If you want to save power there is really only one good way to go about it - hardware support for packet pacing, where the driver instructs the hardware: here is a series of ten packets, schedule for transmission at X microsecond intervals. Adapting that to TCP is kind of a trick, but the point is that the kernel can then go to sleep for a considerably longer time before it has to wake up again to do proper congestion control.
Unfortunately, as far as I know, there are no Ethernet chipsets out there with support for hardware packet pacing. It is a big problem, because the kernel can only respond so fast (and actually get any other work done) to incoming traffic on a high bandwidth interface, so what we get instead is ACK compression where a much longer series of packets gets queued for transmission in a short period of time, swamping network queues and increasing packet loss and jitter.
Posted Sep 15, 2011 21:53 UTC (Thu)
by BenHutchings (subscriber, #37955)
[Link]
In the absence of pacing, it might be more useful to vary TX interrupt moderation (if there are separate interrupts for TX and RX).
Posted Sep 14, 2011 17:39 UTC (Wed)
by rgmoore (✭ supporter ✭, #75)
[Link]
Maybe the correct solution to this problem is to have a way of setting the buffer size dynamically. Before you go to sleep, you grow the buffer size because going to sleep is a big flashing sign that latency is not a big worry right now. Shrinking the buffer again when you wake up is likely to be a bigger problem, or at least shrinking it without just throwing away the extra packets you collected in the bigger buffer while you were asleep, but there ought to be some way of doing it. Coming up with good heuristics about how big it ought to be depending on your circumstances (beyond just sleep vs. awake) is probably the hardest part.
Posted Sep 28, 2011 23:21 UTC (Wed)
by marcH (subscriber, #57642)
[Link]
I do not think power management changes the picture. Buffers are required to reduce the frequency of context switchs in order to increase performance. Power management is just a not really special case where the "idle task" is taken into consideration.
The ACK-clocked and time-sensitive nature of TCP is totally at odds with the goal above. While TCP attempts to smooth throughput as much as possible, the above is about increasing burstiness in order to increase performance. Go figure.
This is the good same old "throughput versus latency" trade-off here again. Think "asynchronous, non-blocking, caches, buffers, bursts" versus... not.
Posted Sep 13, 2011 20:11 UTC (Tue)
by ncm (guest, #165)
[Link] (18 responses)
One approach that has been wildly successful, commercially, is to ignore packet loss as an indicator of congestion, and instead measure changes in transit time. As queues get longer, packets spend more time in them, and arrive later, so changes in packet delay provide a direct measure of queue length, and therefore of congestion. Given reasonable buffering and effective filtering, this backs off well before forcing packet drops. Since it doesn't require router or kernel coöperation, it can be (and, indeed, has been) implemented entirely in user space, but a kernel implementation could provide less-noisy timing measurements.
Maximum performance depends on bringing processing in the nominal endpoints (e.g. disk channel delays) into the delay totals, but below a Gbps that can often be neglected.
Posted Sep 13, 2011 21:21 UTC (Tue)
by dlang (guest, #313)
[Link]
In that case, some applications won't care at all about latency because they are doing a large file transfer, and so as long as the delays are not long enough to cause timeouts (30 seconds+), they don't care. These applicatons want to dump as much data into the pipe as possible so that the total throughput is as high as possible.
the problem comes when you have another application that does care about latency, or only has a small amount of data to transmit. This application's packets go into the queue behind the packets for the application that doesn't care about latency, and since the queue is fifo, the application can time out (for single digit seconds timeouts) before it's packets get sent.
since different applications care about different things, this is never going to be fixed in the applications.
Posted Sep 14, 2011 5:59 UTC (Wed)
by cmccabe (guest, #60281)
[Link]
There's a buffer there that all of your packets are going to have to wait in before getting serviced. It doesn't matter how carefully you measure changes in transit time. If your neighbors are downloading big files, they are probably going to fill that buffer to the brim and you're going to have to wait a length of time proportional to total buffer size. Your latency will be bad.
Posted Sep 14, 2011 10:03 UTC (Wed)
by epa (subscriber, #39769)
[Link] (15 responses)
Posted Sep 14, 2011 11:36 UTC (Wed)
by tialaramex (subscriber, #21167)
[Link] (14 responses)
But if you just have a "congested bit" what happens is that you can "optimise" by half-implementing it, you just send as much as you can, and all your data gets "politely" delivered with an ignorable flag, while any poor schmuck sharing the link who listens to the flag throttles back more and more trying to "share" with a bully who wants everything for themselves.
This is a race to the bottom, so any congestion handling needs to be willing to get heavy handed and just drop packets on the floor so that such bullies get lousy performance and give up. "Essential" means it's a necessary component, not the first line of defence.
QoS for unfriendly networks has to work the same way. If you just have a bit which says "I'm important" all the bullies will set it, so you need QoS which offers real tradeoffs like "I'd rather be dropped than arrive late" or "I am willing to suffer high latency if I can get more throughput".
Posted Sep 14, 2011 17:58 UTC (Wed)
by cmccabe (guest, #60281)
[Link] (12 responses)
There are a lot of applications where you really just don't care about latency at all, like downloading a software update or retrieving a large file. And then there's applications like instant messenger, Skype, and web browsing, where latency is very important.
If bulk really achieved higher throughput, and interactive really got reasonable latency, I think applications would fall in line pretty quickly and nobody would "optimize" by setting the wrong class.
The problem is that there's very little real competition in the broadband market, at least in the US. The telcos tend to regard any new feature as just another cost for them. Even figuring out "how much bandwidth will I get?" or "how many gigs can I download per month?" is often difficult. So I don't expect to see real end-to-end QoS any time soon.
Posted Sep 15, 2011 13:18 UTC (Thu)
by joern (guest, #22392)
[Link] (10 responses)
Tcp actually has a good heuristic. If a packet got lost, this connection is too fast for the available bandwidth and has to back off a bit. If no packets get lost, it will use a bit more bandwidth. With this simple mechanism, it can adjust to any network speed, fairly rapidly adjust changing network speeds, etc.
Until you can come up with a similarly elegant heuristic that doesn't involve decisions like "ssh, but not scp, unless scp is really small", consider me unconvinced. :)
Posted Sep 16, 2011 16:14 UTC (Fri)
by sethml (guest, #8471)
[Link] (9 responses)
Unfortunately my scheme requires routers to track TCP connection state, which might be prohibitively expensive in practice on core routers.
Posted Sep 16, 2011 21:03 UTC (Fri)
by piggy (guest, #18693)
[Link]
Posted Sep 22, 2011 5:28 UTC (Thu)
by cmccabe (guest, #60281)
[Link] (7 responses)
Due to the 3-way handshake, TCP connections which only transfer a small amount of data have to pay a heavy latency penalty before sending any data at all. It seems pretty silly to ask applications that want low latency to spawn a blizzard of tiny TCP connections, all of which will have to do the 3-way handshake before sending even a single byte. Also, spawning a blizzard of connections tends to short-circuit even the limited amount of fairness that you currently get from TCP.
This problem is one of the reasons why Google designed SDPY. The SPDY web page explains that it was designed "to minimize latency" by "allow[ing] many concurrent HTTP requests to run across a single TCP session."
Routers could do deep packet inspection and try to put packets into a class of service that way. This is a dirty hack, on par with flash drives scanning the disk for the FAT header. Still, we've been stuck with even dirtier hacks in the past, so who knows.
I still feel like the right solution is to have the application set a flag in the header somewhere. The application is the one who knows. Just to take your example, the ssh does know whether the input it's getting is coming from a tty (interactive) or a file that's been catted to it (non-interactive). And scp should probably always be non-interactive. You can't deduce this kind of information at a lower layer, because only the application knows.
I guess there is this thing in TCP called "urgent data" (aka OOB data), but it seems to be kind of a veniform appendix of the TCP standard. Nobody has ever been able to explain to me just what an application might want to do with it that is useful...
Posted Sep 22, 2011 8:23 UTC (Thu)
by kevinm (guest, #69913)
[Link]
Posted Sep 22, 2011 17:20 UTC (Thu)
by nix (subscriber, #2304)
[Link] (2 responses)
Posted Sep 23, 2011 6:44 UTC (Fri)
by salimma (subscriber, #34460)
[Link] (1 responses)
Posted Sep 23, 2011 10:57 UTC (Fri)
by nix (subscriber, #2304)
[Link]
Posted Sep 23, 2011 0:52 UTC (Fri)
by njs (subscriber, #40338)
[Link] (2 responses)
I don't think anyone does? (I've had plenty of multi-day ssh connections; they were very low throughput...)
I think the idea is that if some connection is using *less* than its fair share of the available bandwidth, then it's reasonable to give it priority latency-wise. If it could have sent a packet 100 ms again without being throttled, but chose not to -- then it's pretty reasonable to let the packet it sends now jump ahead of all the other packets that have arrived in the last 100 ms; it'll end up at the same place as it would have if the flow were more aggressive. So it should work okay, and naturally gives latency priority to at least some of the connections that need it more.
Posted Sep 23, 2011 9:39 UTC (Fri)
by cmccabe (guest, #60281)
[Link] (1 responses)
However, I think you're assuming that all the clients are the same. This is definitely not be the case in real life. Also, not all applications that need low latency are low bandwidth. For example video chat can suck up quite a bit of bandwith.
Just to take one example. If I'm the cable company, I might have some customers with a 1.5 MBit/s download and others with 6.0 MBit/s. Assuming that they all go into one big router at some point, the 6.0 MBit/s guys will obviously be using more than their "fair share" of the uplink from this box. Maybe I can be super clever and account for this, but what about the next router in the chain? It may not even be owned by my cable company, so it's not going to know the exact reason why some connections are using more bandwidth than others.
Maybe there's something I'm not seeing, but this still seems problematic...
Posted Sep 24, 2011 1:30 UTC (Sat)
by njs (subscriber, #40338)
[Link]
Obviously the first goal should be to minimize latency in general, though.
Posted Sep 29, 2011 21:40 UTC (Thu)
by marcH (subscriber, #57642)
[Link]
Interesting, but never going to happen. The main reason why TCP/IP is successful is because QoS is optional in theory and non-existent in practice.
The end to end principle states that the network should be as dumb as possible. This is at the core of the design of TCP/IP. It notably allows interconnecting any network technologies together, including the least demanding ones. The problem with this approach is: as soon as you have the cheapest and dumbest technology somewhere in your path (think: basic Ethernet) there is a HUGE incentive to align your other network section(s) on this lowest common denominator (think... Ethernet). Because the advanced features and efforts you paid in the more expensive sections are wasted.
Suppose you have the perfect QoS settings implemented in only a few sections of your network path (like many posts in this thread do). As soon as the traffic changes and causes your current bottleneck (= non-empty queue) to move to another, QoS-ignorant section then all your QoS dollars and configuration efforts become instantly wasted. Policing empty queues has no effect.
An even more spectacular way to waste time and money with QoS in TCP/IP is to have different network sections implementing QoS in ways not really compatible with each other.
The only cases where TCP/IP QoS can be made to work is when a *single* entity has a tight control on the entire network; think for instance VoIP at the corporate or ISP level. And even there I suspect it does not come cheap. In other cases bye bye QoS.
Posted Sep 16, 2011 9:22 UTC (Fri)
by ededu (guest, #64107)
[Link]
(An optimisation can be done to use the 5% part if it is partly used and the 95% part is full.)
Posted Sep 13, 2011 21:04 UTC (Tue)
by man_ls (guest, #15091)
[Link] (10 responses)
But it was a lot of work, which by now should already be automated by the OS. Sadly it isn't.
Posted Sep 13, 2011 21:24 UTC (Tue)
by dlang (guest, #313)
[Link] (9 responses)
right now there is no queuing or QoS setting or algorithm that is right for all situations
Posted Sep 13, 2011 21:43 UTC (Tue)
by man_ls (guest, #15091)
[Link] (8 responses)
Posted Sep 13, 2011 21:58 UTC (Tue)
by dlang (guest, #313)
[Link] (7 responses)
if you have a machine plugged in to a gig-E network in your house, that is then connected to the Internet via a 1.5 down/768 up DSL line, your machine has no way of knowing what the bandwidth that it should optimize for is.
the prioritization and queuing fixes need to be done on your router to your ISP, and on the ISPs router to you.
you can't fix the problem on your end because by the time you see the downloaded packets they have already gotten past the chokepoint where they needed to be prioritized.
Posted Sep 13, 2011 22:08 UTC (Tue)
by man_ls (guest, #15091)
[Link] (6 responses)
Posted Sep 13, 2011 22:19 UTC (Tue)
by dlang (guest, #313)
[Link] (1 responses)
if you think about this, the ISP router will have a very high speed connection to the rest of the ISP, and then a lot of slow speed connections to individual houses.
having a large buffer is appropriate for the high speed pipe, and this will work very well if the traffic is evenly spread across all the different houses.
but if one house generates a huge amount of traffic (they download a large file from a very fast server), the buffer can be filled up by the traffic to this one house. that will cause all traffic to the other houses to be delayed (or dropped if the buffer is actually full), and having all of this traffic queued at the ISPs local router does nobody very much good.
TCP is designed such that in this situation, the ISPs router is supposed to drop packets for this one house early on and the connection will never ramp up to have so much data in flight.
but by having large buffers, the packets are delayed a significant amount, but not dropped, so the sender keeps ramping up to higher speeds.
the fact the vendors were not testing latency and bandwith at the same time hid this fact. the devices would do very well in latency tests that never filled the buffers, and they would do very well in throughput tests that used large parts of the buffers. without QoS providing some form of prioritization, or dropping packets, the combination of the two types of traffic is horrible.
Posted Sep 14, 2011 21:09 UTC (Wed)
by jg (guest, #17537)
[Link]
Posted Sep 14, 2011 2:35 UTC (Wed)
by jg (guest, #17537)
[Link] (3 responses)
I'm going to try to write a blog posting soon about what is probably the best strategy for most to do.
And the tee shirt design is in Cafe Press, though I do want to add a footnote to the "fair queuing" line, that runs something like "for some meaning of fair", in that I don't really mean the traditional TCP fair queuing necessarily (whether that is fair or not is in the eye of the beholder).
Posted Sep 14, 2011 7:52 UTC (Wed)
by mlankhorst (subscriber, #52260)
[Link]
Enabled ECN where possible too, which helps with the traffic shaping method I was using. Windows XP doesn't support it, and later versions disable it by default.
Posted Sep 14, 2011 14:53 UTC (Wed)
by davecb (subscriber, #1574)
[Link] (1 responses)
Some practical suggestions will be much appreciated.
Changing the subject slightly, there's a subtle, underlying problem in that we tend to work with what's easy, not what's important.
We work with the bandwidth/delay product because it's what we needed in the short run, and we probably couldn't predict we'd need something more ta the time. We work with buffer sizes because that's dead easy.
What we need is the delay, latency and/or service time of the various components. It's easy to deal with performance problems that are stated in time units and are fixed by varying the times things take. It's insanely hard to deal with performance problems when all we know is a volume in bytes. It's a bit like measuring the performance of large versus small cargo containers when you don't know if they're on a truck, a train or a ship!
If you expose any time-based metrics or tuneables in your investigation, please highlight them. Anything that looks like delay or latency would be seriously cool.
One needs very little to analyze this class of problems. Knowing the service time of a packet, the number of packets, and the time between packets is sufficient to build a tiny little mathematical model of the thing you measured. From the model you can then predict what happens when you improve or disimprove the system. More information allows for more predictive models, of course, and eventually to my mathie friends becoming completely unintelligable (;-))
--dave (davecb@spamcop.net) c-b
Posted Sep 14, 2011 21:01 UTC (Wed)
by jg (guest, #17537)
[Link]
You are exactly correct that any real solution for AQM must be time based; the rate of draining a buffer and the rate of growth of a buffer are related to the incoming and outing data per unit time.
As you note, not all bytes are created equal; the best example is in 802.11 where a byte in a multicast/broadcast packet can be 100 times more expensive than a unicast payload.
Thankfully, in miac80211 is a package called Minstrel, which is, on an on-going dynamic basis, keeping track of the costs of each packet (802.11n aggregation in particular makes this "interesting".
So the next step is to hook up appropriate AQM algorithms to it such as eBDP or the "RED Light" algorithm that Kathie Nichols and Van Jacobson are again trying to make work. John Linvile's quick reimplementation of eBDP (the current patch is in the debloat-testing tree) does not do this as yet and can't go upstream in it's current form for this and other reasons. eBDP seems to help as Van predicted it should (he pointed me at it in January), but we've not tested it much as
The challenge after that is going to be to get that all working while dealing with all the buffering issues along the way in the face of aggregation and QOS classification. There are some fun challenges for those who want to make this all work well; it's at least a three dimensional problem, so there will be no easy trivial solution ultimately, and will be a challenge. It's way beyond my understanding of Linux internals.
Please come help!
Posted Sep 22, 2011 5:05 UTC (Thu)
by fest3er (guest, #60379)
[Link] (3 responses)
I'm two years into exploring Linux Traffic Control (LTC) to create a nice web-based UI to configure it; I'm still learning what it can and cannot do. A lot of the documentation isn't very good, and some of it is flat-out wrong. The UI is finally working fairly well, even though it is very wrong in a couple places.
So far, I have found:
With that in hand, I identified the most common sets of data streams (FTP, HTTP, SSH, DNS, et al) and designed an HTB scheme that fairly shares bandwidth among the many data streams while allowing all but DNS, VoIP and IM to use 100% of the available bandwidth in the absence of contention--DNS, VoIP and IM are allowed limited bandwidth, based on educated guesses as to the most they can need. I also set DNS and VoIP to priority 0 (zero), IM to priority 2, and all others to priority 1. DNS and VoIP have priority but won't keep other streams from getting bandwidth because their average bit rates are fairly low, nor will other streams see much changes in latency. I have not given ACK/NAK/etc. type packets any special treatment. (For what it's worth, I also limited unidentified traffic to 56kbit/s. It's bitten me a few times, but has also forced me to learn more.)
Before deploying this scheme, I would almost always see uploads start at a nice high rate, then plummet to 1Mbit/s and stay there, and I would see different data streams fight with each other causing choppy throughput. After deploying the scheme, throughput is smooth, uplink is stable at 2.8Mbit/s, and the various data streams share the available bandwidth very nicely. Interactive SSH proceeds as though the 500MiB tarball uploads and downloads aren't happening. Even wireless behaves nicely with its bandwidth set to match the actual wireless rate. The "uncontrollable downlink" problem? It flows fairly smoothly even though my only control is at the high-speed outbound IFs. By forcing data streams to share proportionately, none of them can take over the limited downlink bandwidth; don't forget that their uplink ACK/NAK/etc. are also shared proportionately. The "shared cable" problem? I don't remember the last time I noticed a bothersome slowdown during the normal 'overload' periods.
But that is only part of the solution: the outbound part. Inbound control is a whole other problem, one that LTC doesn't handle well. And when you add NAT to the equation, LTC all but falls apart; after outbound NAT, the source IPs and possibly ports are 'obscured' and requires iptables classifiers in mangle because 'tc filter' cannot work. When I have time, I am going to rewrite my UI to use IMQ devices to handle all traffic control. Then I'll be able to fairly share a limited link's bandwidth (read 'internet link') among the other links in the router and properly control traffic whether or not NAT is involved. It will even be much easier to control VPNs that terminate on the firewall.
In summary, I've found that it is possible to avoid flooding neighbors without horribly disfiguring latencies and without unduly throttling interfaces. Steady state transfers are clean. Bursty transfers are clean (even though they'll use some buffer space) as long as their average bit rate remains reasonable. But as has been pointed out several times and is the point of this series, controlling my end of the link is only half of the solution.
TCP/IP needs a way to query interfaces to determine their maximum sustainable throughput, and it needs a way to transmit that info to the other end (the peer), possibly in the manner of path MTU discovery. ISPs need to tell their customers their maximum sustainable throughput; the exact number, not their usual misleading 'up to' marketing BS. Packets should be dropped as close to the source as possible when necessary; dropping a packet at the first router on a 100Mb or Gb link is far better than dropping it after it's traversed the slow link on the other side of that router.
The effects of bufferbloat can be minimized, even when only one side can be controlled. But good Lord! Understanding LTC is not easy. The faint of heart and those with hypertension, anger management issues, difficulty understanding semi-mangled English, ADD, ADHD and other ailments should consult their physicians before undertaking such a trek. It is an arduous journey. Every time you think you've reached the end, you discover you've only looped back to an earlier position on the trail. Here, patience is a virtue; the patience of Job is almost required.
Posted Sep 22, 2011 13:03 UTC (Thu)
by nix (subscriber, #2304)
[Link] (1 responses)
(btw, have you considered fixing the documentation at the same time? I'm really looking for a replacement for wshaper that doesn't rely on a pile of deprecated stuff: a web UI is probably overkill for my application. Though when you have a working web UI I'd be happy to test it!)
Posted Sep 22, 2011 17:59 UTC (Thu)
by fest3er (guest, #60379)
[Link]
Try this one: http://agcl.us/cltc-configurator/; it's not quite as up-to-date as the version in GitHub for Roadster (my updated version of Smoothwall), but it should let you create a workable config. It starts with a default scheme that should serve as a guiding example. Remember to save both the scheme and the generated script. The script should work on most modern Linux platforms.
I have another version, http://agcl.us/traffic_control, that is better integrated with Smoothwall/Roadster, but the scripts it generates will require a little tweaking to work with generic Linux distros. And you have to start from scratch.
Posted Oct 12, 2011 12:02 UTC (Wed)
by jch (guest, #51929)
[Link]
And that's one of the problems. Many modern interface types don't have a "native" speed -- Wifi varies dynamically, with a factor of 100 or so between the slowest and the fastest, cable and ADSL have variable rates, even with good old Ethernet all bets are off when switches are involved.
We really need to find a way to limit latency without knowing the bottleneck speed beforehand. The delay-based end-to-end approaches are promising (Vegas, LEDBAT), but they have at least two serious fairness problems. I don't know of any router-based techniques that solve the issue.
--jch
Posted Sep 27, 2011 13:15 UTC (Tue)
by shawnjgoff (guest, #70104)
[Link]
You can easily cause multi-minute latencies by blocking the radio, then unblocking it after a few minutes - the replies all come flooding back in. You can also see this if you start a ping and then go do a few speed tests or some large downloads - you'll see the RTTs climb and climb as the transfer goes, then they come flooding back once it's done.
Posted Oct 12, 2011 16:34 UTC (Wed)
by ViralMehta (guest, #80756)
[Link] (1 responses)
Posted Oct 12, 2011 23:44 UTC (Wed)
by dlang (guest, #313)
[Link]
LPC: An update on bufferbloat
Power impact of debufferbloating
Bigger (send) buffers are essential to achieve good power management (basically the OS puts a whole load of work in the hardware queue and then goes to sleep for a long time....).
Power impact of debufferbloating
Power impact of debufferbloating
What would worry me if it's not even looked at in the various designs/ideas....
Power impact of debufferbloating
Power impact of debufferbloating
Memory is equally impacted (going between fully active and self refresh), as are other parts of the chipset. And cpu + memory + chipset do make up a significant part of system power. Sure the screen takes a lot too.. but to get reasonable power you need to look across everything that's significant.
Power impact of debufferbloating
Power impact of debufferbloating
Power impact of debufferbloating
http://www.anandtech.com/show/4686/samsung-galaxy-s-2-int...
Power impact of debufferbloating
http://www.youtube.com/watch?v=_BXWqkOexiU&feature=fv...
http://www.youtube.com/watch?v=UyalqkdVk-8 (this is a bit flawed, as it is from quite an angle).
Power impact of debufferbloating
Power impact of debufferbloating
Power impact of debufferbloating
Power impact of debufferbloating
Power impact of debufferbloating
Power impact of debufferbloating
Servers run C states ALL THE TIME.... and often are "only" 50% to 80% utilized.... and not 10% is no exception either...
Power impact of debufferbloating
Power impact of debufferbloating
Power impact of debufferbloating
Power impact of debufferbloating
Power impact of debufferbloating
Power impact of debufferbloating
- packet 1 enters qdisc
- packet 1 leaves qdisc
- packet 2 enters qdisc
- packet 2 leaves qdisc
- ...
Power impact of debufferbloating
Power impact of debufferbloating
Power impact of debufferbloating
Hardware support
Hardware support
Hardware support
Power impact of debufferbloating
So what we want is prioritization, but the only way we can get it is to reduce the device buffers as small as possible, to force packets back up into the qdisc.
Power impact of debufferbloating
Power impact of debufferbloating
Power impact of debufferbloating
Power impact of debufferbloating
I wonder if chasing bufferbloat is barking up the wrong tree. If TCP doesn't tolerate generous buffers, surely that means there's something wrong with TCP, and we need to fix that.
LPC: An update on bufferbloat
LPC: An update on bufferbloat
LPC: An update on bufferbloat
LPC: An update on bufferbloat
Essential, but not the first line
Essential, but not the first line
Essential, but not the first line
Essential, but not the first line
Essential, but not the first line
Essential, but not the first line
(See http://www.chromium.org/spdy/spdy-whitepaper)
Essential, but not the first line
Essential, but not the first line
I still feel like the right solution is to have the application set a flag in the header somewhere. The application is the one who knows. Just to take your example, the ssh does know whether the input it's getting is coming from a tty (interactive) or a file that's been catted to it (non-interactive). And scp should probably always be non-interactive. You can't deduce this kind of information at a lower layer, because only the application knows.
And SSH can do just this: if DISPLAY is unset and SSH is running without a terminal, it sets the QoS bits for a bulk transfer: otherwise, it sets them for an interactive transfer. Unfortunately scp doesn't unset DISPLAY, so if you run scp from inside an X session I suspect it always gets incorrectly marked as interactive... but that's a small thing.
Essential, but not the first line
Essential, but not the first line
Essential, but not the first line
Essential, but not the first line
Essential, but not the first line
The QoS lost cause
Essential, but not the first line
This issue was already very well explained in TLDP in 2003; I did a similar tweaking on my network connection and it worked beautifully. A bit of QoS configuration and some bandwidth shaping was enough to make the biggest downloads go smoothly while my SSH connections worked great.
At the very least Gettys got a wicked T-shirt.
At the very least Gettys got a wicked T-shirt.
There is no algorithm for the general situation, but there is a very good approximation for the endpoints, right? Prioritize your outgoing packets, limit your download rate to a little less than the available bandwidth for your incoming packets. That is what Gettys calls "dropping packets", which should be a good thing and not too hard to do in the OS.
At the very least Gettys got a wicked T-shirt.
At the very least Gettys got a wicked T-shirt.
Hmmm, true. What my router could do is drop packets for any traffic beyond 90% of its capacity; my computer in turn could drop packets for anything beyond 90% of its gig-E nominal capacity. Anything beyond my control, I cannot do anything but order a bufferbloat T-shirt.
At the very least Gettys got a wicked T-shirt.
At the very least Gettys got a wicked T-shirt.
At the very least Gettys got a wicked T-shirt.
At the very least Gettys got a wicked T-shirt.
At the very least Gettys got a wicked T-shirt.
Time, not bandwidth/delay, is the key
Time, not bandwidth/delay, is the key
yet.
LPC: An update on bufferbloat
LPC: An update on bufferbloat
LPC: An update on bufferbloat
LPC: An update on bufferbloat
LPC: An update on bufferbloat
LPC: An update on bufferbloat
LPC: An update on bufferbloat
