A damp discussion of network queuing

By Jonathan Corbet
October 15, 2014

Very few presenters at technical conferences come equipped with gallons of water and a small inflatable swimming pool to contain it. But that is just how Stephen Hemminger showed up at the 2014 Linux Plumbers Conference. Stephen was there to talk about the current state of the fight against bufferbloat; while there was some good news to share, the sad fact is that, in a number of areas, we are still all wet.

When Jim Gettys discovered and named bufferbloat some years ago, he had stumbled onto a set of problems that networking developers had been aware of for years. But nobody quite understood the extent of the problem or how it could affect everyday networking. In short, bufferbloat comes about when one or more players in the networking pipeline buffer far more data than they should. The user-visible results can include degraded download speeds, uploading not working at all, and high latencies — all with no packet loss.

The latency issue is problematic at a number of levels. If you are trying to provide remote display service, 15ms latencies will ruin the usability of the system. At 100ms delay, voice over IP protocols cease to work well. Users will generally hit the "reload" button (or give up) on a web page load after about one second. Bufferbloat can create latencies far longer than that. But it's the lack of packet loss that is the real problem; without dropped packets, the TCP congestion control algorithms cannot do their job. So the networking stack keeps trying to send more data when the proper response is to slow down and let the queues drain.

So who is to blame for the bufferbloat problem? One possible response, Stephen said, was to blame Linux. After all, Windows XP limited TCP connections to 64KB of outstanding data; there is not much buffering happening there. Windows 7, instead, added a rate limiter that would throttle all connections. Android has done something similar, adding a limit on the size of the receive window for any connection. Linux developers tend not to be enamored with artificial limits, so Linux users get to experience the full pain of the bufferbloat problem.

An alternative, he said, is to blame the customer. That is why Internet service providers like Comcast went through a period of blaming its biggest customers for its networking problems. Comcast went as far as capping bandwidth use and charging extra in some cases. But the real problem was not those customers, it was bufferbloat in their internal network.

Getting wet

At this point the aquatic games began. Stephen put together a set of demonstrations where a network queue was represented by an inverted plastic bottle. The bottle could hold a fair amount of water (packets), but there are limits on how quickly the water can drain out. So if water arrives more quickly than the bottle can drain, the bottle begins to fill. If the bottle is quite full, a drop of water added at the top will take a long time to reach the opening and exit the bottle — especially if the bottle is large. Bufferbloat, thus, was represented as bottlebloat.

In the real world, network queuing systems are more complicated than a single bottle, though. The default queuing discipline in Linux employs three parallel bottles of varying sizes; one for bulk traffic, one for high-priority traffic, and one for everything else ("normal" traffic). Almost all traffic goes through the normal bottle; SSH can use the high-priority queue, while Dropbox and BitTorrent use the bulk queue. It was an OK idea for its time, Stephen said, but it does not work on today's net. Those three bottles do nothing to prevent excess buffering.

The first attempt to come up with a smarter solution was the RED queue management algorithm. RED was represented by poking a bunch of holes into a (red, naturally) bottle. Once the water level in the bottle goes above the holes, water escapes through those holes and is lost; that corresponds to dropping packets in the real world. Rather than dropping packets, RED can set the explicit congestion notification (ECN) bit in the TCP header, notifying the receiver that it needs to reduce the size of the receive window to slow down the connection. It's a nice idea, but the net broke it. Routers will drop packets with ECN set, or, worse, simply reset the bit. As a result, Linux "will play the game," Stephen said, but only if the other side initiates it. The networking developers just do not want to deal with complaints about dropped connections.

A different approach is called "hierarchical token bucket"; it looks like a bunch of small bottles all connected in parallel. Each type of traffic gets its own bottle (queue), and packets are dispatched equally from all queues. The problem with this mechanism is that it requires a great deal of configuration to work well. That might be manageable on a server with a static workload, but it is not useful on desktop systems.

An alternative is stochastic fair queuing (SFQ). The same set of small bottles is used, but each network connection is assigned to its own bottle by way of a hash function. No configuration is required. SFQ can make things work better, but it is not a full solution to the bufferbloat problem; it was the state of the art in the Linux kernel about five years ago.

In an attempt to come up with a smarter solution, Kathie Nichols and Van Jacobson created the "Controlled Delay" or CoDel algorithm. CoDel looks somewhat like RED, in that it starts to drop packets when buffers get too full. But CoDel works by looking at the packets at the head of the queue — those which have been through the entire queue and are about to be transmitted. If they have been in the queue for too long, they are simply dropped. The general idea is to try to maintain a maximum delay in the queue of 5ms (by default). CoDel has a number of good properties, Stephen said; it drops packets quickly (allowing TCP congestion control algorithms to do their thing) and maintains reasonable latencies. The "fq_codel" algorithm adds an SFQ dispatching mechanism in front of the CoDel queue, maintaining fairness between network connections.

Stephen noted that replacing a queue with something like fq_codel is a good thing, but one should remember that there are a lot of queues in a typical system. Only the one with the smallest hole (the slowest outgoing link) matters in the end, since that's where the packets will accumulate.

After a discussion of how most network benchmarking utilities look at the wrong thing (one should examine upload speed, download speed, and latency simultaneously), he put up a set of plots showing how the network responds to load with the various queuing mechanisms. The results clearly showed the CoDel solves the bufferbloat problem well, and that fq_codel does even better.

So, are we there yet? As noted, there are a lot of queues in a typical network path, and not all of them have been addressed. Different techniques are needed at different levels. For excessive buffering at the socket layer, for example, TCP small queues can be used. A bigger problem is Internet service providers, which tend to have large amounts of legacy equipment in their racks. There is not a lot the networking developers can do about that. Still, it helps to be running the best software locally. So Stephen encouraged everybody to run a command like:

    sysctl -w net.core.default_qdisc=fq_codel

That will cause fq_codel to be used for all future connections (up to the next reboot). Unfortunately, the default queuing discipline cannot be changed, since it will certainly disturb some user's workload somewhere.

The good, the bad, and the ugly

Stephen concluded by saying that there are good, bad, and ugly parts to bufferbloat and the efforts to solve it. On the good side, the industry is aware of the problem. Bufferbloat is routinely talked about at IETF meetings, and researchers are working on it. Perhaps best of all, the solutions are all open source. In some cases (CoDel for example), open-source publication was deliberately chosen to forestall the adoption of patent-encumbered techniques.

On the bad side, there is a lot of legacy equipment and software out there. Original equipment manufacturers, Stephen said, are focused on cost, not on queue management details. So a lot of equipment out there — especially consumer-level equipment — is bad and will stay that way for some years yet. There are also issues with backbone congestion, but they tend to be more political than technical in nature.

The ugly part is wireless networking, which has a bunch of unique buffering problems of its own. Packet aggregation, for example, can help with bandwidth, but creates latency problems. Wireless systems are mostly using proprietary software and are never updated. Standards bodies are starting to pay attention, Stephen said, but a solution in this area is distant.

Even with the bad and ugly parts, though, the message was mostly positive, if a bit damp. Quite a bit of good work has been done to address the ~~bottle~~bufferbloat problem, and that work has shown up first as free software. Bufferbloat will be with us for a while yet, but solutions are far closer than they were a few years ago.

Index entries for this article
Kernel	Networking/Bufferbloat
Conference	Linux Plumbers Conference/2014

A damp discussion of network queuing

Posted Oct 15, 2014 21:55 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

Wow, that's a brilliant demonstration of network concepts!

A damp discussion of network queuing

Posted Oct 15, 2014 22:22 UTC (Wed) by marcH (subscriber, #57642) [Link] (3 responses)

> On the bad side, there is a lot of legacy equipment and software out there. Original equipment manufacturers, Stephen said, are focused on cost, not on queue management details. So a lot of equipment out there — especially consumer-level equipment — is bad and will stay that way for some years yet.

Like I already wrote a million times, a single thing would be enough for things to change and improve much quicker: if http://www.speedtest.net/ (or similar) were to finally (re-)implement MTR. Simply because zillions of speedtest screenshots are posted and made available daily. While cost is one focus, losing customers because of bad reputation is a very important other.

netalyzr comes the closest but still way too technical for the masses.

However I'm afraid speedtest.net on the one hand and bufferbloat researchers and coders on the other hand live in two different parallel universes.

thinkbroadband.com

Posted Oct 16, 2014 9:23 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

The UK's thinkbroadband.com might be an interesting target. Its staff are somewhat responsive to user requests and their forum has an entire section for talking about the actual site rather than discussing the various UK broadband solutions, ISPs etc.

They do already have IPv6 speeds for example (even though 99% of the UK still doesn't have IPv6 in 2014), and they show both a single link download rate and a multi-HTTP download which can indicate if there is something throttling specific TCP connections rather than your bandwidth as a whole.

A damp discussion of network queuing

Posted Oct 17, 2014 3:44 UTC (Fri) by mtaht (subscriber, #11087) [Link] (1 responses)

We have been discussing publishing an open letter to the web benchmark makers, asking them to change their tests to continually also measure round trip time during the download and upload portions of their test, and publish that 98th percentile "ping" result rather than their current baseline ping.

That change to these tests alone would send a message to millions of people that there is a tradeoff between bandwidth and latency that currently is biased far, far, far too much towards the bandwidth side of the equation.

Relevant thread is here, letter in progress, multiple folk have agreed to sign. (sorry for the busted cert)

https://lists.bufferbloat.net/pipermail/bloat/2014-Septem...

A damp discussion of network queuing

Posted Oct 26, 2014 16:02 UTC (Sun) by marcH (subscriber, #57642) [Link]

The idea of an open letter is nice but alone it's too soft, won't do much.

As people already answered there, to make it work you would also need a quick and dirty "netalizer lite" site that does only one thing and explains it and does it well: measuring latency while downloading and uploading. Only after such a site is implemented and published under the name http://www.speedtest-is-telling-lies.net would the open letter (published on the same site) stand a chance to make any difference.

A damp discussion of network queuing

Posted Oct 16, 2014 0:29 UTC (Thu) by rfunk (subscriber, #4054) [Link] (8 responses)

Last I paid attention (quite a while ago now), fq_codel hadn't made it into a kernel release yet. I assume it has by now, but what version first got it?

A damp discussion of network queuing

Posted Oct 16, 2014 5:11 UTC (Thu) by cglass (guest, #52152) [Link] (7 responses)

Looking at the bufferbloat wiki (http://www.bufferbloat.net/projects/codel/wiki) it looks like you need linux 3.5.

A damp discussion of network queuing

Posted Oct 16, 2014 7:07 UTC (Thu) by mtaht (subscriber, #11087) [Link] (3 responses)

A couple notes:

1) fq_codel looks more like "DRR++" + codel. A SFQ + codel implementation exists in ns2 and ns3, but not linux as yet.

2) It is now on by default in openwrt Barrier Breaker, and part of CeroWrt's SQM system, openwrt's qos-scripts, and many other third party router firmwares. It's also part of qualcomm's streamboost and netgear's dynamic QoS in their new X4 product.

The effects of applying a qos script enabled with fq_codel on a router against nearly every ISP technology is outstanding. Cable result: http://snapon.lab.bufferbloat.net/~cero2/jimreisert/resul...

3) It "does no harm" on servers and clients, and can often do good. I'd like it very much if a desktop oriented distro tried a switch by default. There are some exceptions, notably a good hi precision clock source is needed.

4) While the effects on wireless are not as good as we'd like, it does take the edge off the worst of the problems there.

Try it! On any modern linux you can toss that sysctl into an /etc/sysctl.d/bufferbloat.conf file....

A damp discussion of network queuing

Posted Oct 17, 2014 17:17 UTC (Fri) by nix (subscriber, #2304) [Link] (2 responses)

There are some exceptions, notably a good hi precision clock source is needed.

IIRC, fq_codel also needs BQL support in the NIC driver. Some embedded firewall boxes (in my case, the Soekris net5501) have NICs such as the VIA Rhine for which BQL is not implemented yet. (There are old patches for the Rhine, but nothing for recent kernels that I know of.)

Hm actually I think you mentioned this a few months ago on the cerowrt list. Great minds think alike etc etc.

A damp discussion of network queuing

Posted Oct 17, 2014 19:09 UTC (Fri) by mtaht (subscriber, #11087) [Link] (1 responses)

BQL support in the nic driver is HIGHLY desirable, but not necessary, per se':

http://snapon.lab.bufferbloat.net/~d/beagle_bql/bql_makes...

I do hope that the more companies realize that BQL support is essential to high performance (I'm looking at *you*, Arm, Cisco, AMD, and Xilinx and a dozen others), the more BQL drivers (with xmit_more support) will land on everything. Certainly nearly all the 10GigE makers "get it", but that knowledge has not fully propagated down into the older and slower devices...

http://www.bufferbloat.net/projects/bloat/wiki/BQL_enable...

There is a paper in progress on how much BQL helps - answer, quite a lot - while we (in the bufferbloat world) know this, that sort of stuff needs to land on CTO and academic and driver developer desks.

I wrote up some issues are in adding BQL to a device driver here, I had planned to write a tutorial but haven't got around to it.

https://lists.bufferbloat.net/pipermail/bloat/2014-June/0...

So far as I recall the via rhine was updated to BQL recently, but will check.

A damp discussion of network queuing

Posted Oct 21, 2014 16:25 UTC (Tue) by nix (subscriber, #2304) [Link]

Ooo that's interesting! I'll try turning on fq_codel on the firewall, then. (Since the true bottleneck is probably the closed-source ADSL routers upstream from it, I don't expect too much, but I live in hope.)

A damp discussion of network queuing

Posted Oct 19, 2014 14:33 UTC (Sun) by BenHutchings (subscriber, #37955) [Link] (2 responses)

It's also available in Debian's 3.2-based kernel, along with bql for many drivers.

A damp discussion of network queuing

Posted Oct 21, 2014 18:12 UTC (Tue) by jheiss (subscriber, #62556) [Link] (1 responses)

It is? On a Debian 7 box with the stock 3.2 kernel:

> sysctl net.core.default_qdisc
sysctl: cannot stat /proc/sys/net/core/default_qdisc: No such file or directory

On one with 3.14 from backports:

> sysctl net.core.default_qdisc
net.core.default_qdisc = pfifo_fast

A damp discussion of network queuing

Posted Oct 21, 2014 19:29 UTC (Tue) by BenHutchings (subscriber, #37955) [Link]

I didn't notice that default_qdisc was new. Then you have to select it for each interface individually, e.g.:

tc qdisc replace dev eth0 root fq_codel

A damp discussion of network queuing

Posted Oct 16, 2014 5:46 UTC (Thu) by krivenok (guest, #42766) [Link] (3 responses)

It was one of the most exciting sessions at the conference. A great example of how to give a great talk. Thanks again Stephen!

A damp discussion of network queuing

Posted Oct 19, 2014 6:47 UTC (Sun) by sitaram (guest, #5959) [Link] (2 responses)

does anyone know if there's a video of this talk available for download? I couldn't find one offhand...

Video

Posted Oct 19, 2014 13:32 UTC (Sun) by corbet (editor, #1) [Link] (1 responses)

As far as I can tell, only the plenary sessions in the big room were videotaped. The Plumbers folks had wanted to do video for the LPC sessions, but the cost was prohibitive.

Video

Posted Oct 19, 2014 23:28 UTC (Sun) by sitaram (guest, #5959) [Link]

Hmm I guess that's understandable... thanks!

A damp discussion of network queuing

Posted Oct 16, 2014 8:03 UTC (Thu) by iq-0 (subscriber, #36655) [Link] (14 responses)

So CoDel is a good idea (put a time deadline on internal buffering).

But TCP still reacts bad to some hop along the way performing bad buffering, which can be seen as significantly increased latency. Wouldn't it be just as helpful if (for TCP) you'd focus more on changes in latency in addition to watching for dropped packets?

This would not be so different as to how certain bittorrent clients automatically limit their upload/download speeds to prevent either one from clogging up the other (and even other network traffic).

The "optimum" latency is connection specific (or at least specific between endpoints) and could change over time, but the change in latency is probably more relevant than the actual latency itself for such an algorithm to work.

A damp discussion of network queuing

Posted Oct 16, 2014 8:42 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (13 responses)

It looks like latency-based queue management loses in presence of other nodes which use packet-loss based queue management.

A damp discussion of network queuing

Posted Oct 16, 2014 9:38 UTC (Thu) by iq-0 (subscriber, #36655) [Link] (2 responses)

This is not meant as a generic queue mangement (for this packet-loss based is better). More as a TCP congestion control tweak.

You need to do both of course (since you don't necessarily have bufferbloat problems). A significant increase could be considered equal to a dropped packets, thus as an indication of congestion (only not involving a retransmit obviously).

The big problem would be to identify when the latency has increased *not* due to bufferbloat along the path.
While current round-trip times are higher than the reference round-trip time you slowly increase the base round trip-time and consider the connection congested.
But when the current round-trip time is lower than the reference round-trip time you reset the base round-trip time and consider the link no longer congested.

This is of course not perfect, but by dynamically reacting to increased latency along the path (as a possible indication of buffer bloat) that you probably can't fix, you can prevent yourself from contributing to the problem.

And while you will probably loose to people not playing nicely, your network responsiveness will probably increase, which is often just as relevant for the perceived network quality.

A damp discussion of network queuing

Posted Oct 16, 2014 9:47 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> The big problem would be to identify when the latency has increased *not* due to bufferbloat along the path.
The problem is that if there's at least _one_ other packet-drop-based actor on the network then they would use up all the bandwidth.

But even if only latency-based actors are active on the network then there are still pathological behaviors when one actor can accidentally hog others' bandwidth. Or wild oscillations of the bandwidth that require explicit dampening that defeats the purpose of latency-based control.

Quite a few algorithms have been tried since the introduction of TCP timestamp option, but as far I know none of them helped much. I think people even tried to use a neural network predictor for the window size (it helped but required too much computational capacity) - http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber... .

A damp discussion of network queuing

Posted Oct 17, 2014 4:35 UTC (Fri) by mtaht (subscriber, #11087) [Link]

The above thread is a little muddled.

AQM (codel in this case) regulates TCP and encapsulated traffic better. Nearly nobody advocates codel by itself. (Many DO advocate AQM by itself.)

FQ solves multiple flows competing against each other. It's way more
effective than AQM in most circumstances. There are a lot of
FQ advocates that think FQ solves everything - but I'm not one of them. No matter how many buckets you have for flows, it still pays to keep queue lengths as short as possible)

(for eloquent defense of the properties ofFQ, see: https://www.lincs.fr/events/fair-flows-fair-network-attra...)

ECN solves packet loss being the sole indicator of congestion problem. ECN can now be negotiated with 60% or more of the alexa top 1m.

A delay based TCP would work fairly well if deployed universally (but it is impossible to have a flag day!), however the bottleneck router has a more intimate view of current congestion and can do smarter things than a TCP. Most delay based TCPs (and uTP) start backing off at 100ms of induced delay which is rather late. There is some very interesting new work out there on bufferbloat aware TCP's, notably "fq + pacing".

Furthermore, FQ+AQM technology can (and is) be deployed incrementally.

With a FQ'd network, delay based TCPs *could* begin to back off long before 100ms of buffering is reached.

A delay based AQM works on all traffic and turns delay based TCPs back into packet loss or ECN marked congestion control.

http://perso.telecom-paristech.fr/~drossi/paper/rossi13tm...

I'm hoping that clears matters up a bit.

A damp discussion of network queuing

Posted Oct 17, 2014 3:34 UTC (Fri) by mtaht (subscriber, #11087) [Link] (9 responses)

To me, the "Fair queueing" portion of fq_codel takes care of 97% of the bufferbloat problem, the codel AQM another 2.5% and ECN takes it in for the score.

A lot of people are wedded to the idea that pfifo_fast's prioritization features accomplish useful stuff, that is rare in today's environment where multiple, unclassified TCP flows dominate. Thus, I encourage people to turn on fq_codel by default everywhere and run a few tests... I certainly have been running fq_codel everywhere for 2+ years now, and my network is measurably faster, smoother and less annoying under load.

Things like usb networking, and wifi improve quite a a bit. Ethernet gets better - (way better if you have BQL in your drivers)

Here's a 100Mbit result on ethernet at 100Mbit, baseline pfifo, fq_codel/fq and fq_codel with w/wo BQL:

http://snapon.lab.bufferbloat.net/~d/beagle_bql/bql_makes...
http://snapon.lab.bufferbloat.net/~d/nuc-client/results.html

GigE with pfifo vs fq_codel with and without offloads.

http://snapon.lab.bufferbloat.net/~cero2/nuc-to-puck/resu...
(It is my hope the new xmit_more bulking patches lessen the need for TSO/GSO/GRO offloads on gigE devices)

Still, once you have all the above working, prioritization (notably deprioritization) can help a bit more, and there is work on a new qdisc (tenatively called cake) that adds a few tiers of prioritization on top of DRR + fq_codel as well as an optional rate limiter.

There is also a specific-to-servers sch_fq in mainline linux now.

None of the above means that a given distro should continue to wait to switch away from pfifo_fast to fq_codel as the default. There are nearly no circumstances where pfifo_fast has better network behavior than fq_codel.

https://kau.toke.dk/modern-aqms/

It was certainly my, and stephen hemminger's, and much of the bufferbloat communities' hope that distros would start to make the switch once the sysctl landed. We certainly continue to evolve things - sch_fq is now a very good choice for a bare metal web server (but not a vm), for example.

But I hope we've now made a set of compelling arguments that pfifo_fast must die!

A damp discussion of network queuing

Posted Oct 18, 2014 13:09 UTC (Sat) by nix (subscriber, #2304) [Link] (8 responses)

One thing I've been wondering about -- does anyone actually have a bunch of scripts that set things up properly? So far the only options I know of to get things set up properly are 'learn to do it yourself (which involves learning about tc and lartc.org and driving yourself slowly insane)' and 'install cerowrt and pick it apart to dig out the scripting', which seems a little annoying if (as I do) you don't have any appropriate hardware for it (I'm not sure if any appropriate hardware is on sale any more, particularly for relatively unusual cases like mine with channel-bonded ADSL).

So... any scripting? (Not that I can really use it yet, since as mentioned previously my firewall's NIC doesn't have BQL support yet -- but in future, it would be nice if I could arrange for my networking gear to not be bufferbloated to death. I think this means fixing my firewall and ditching the ADSL routers it's connected to and replacing them with cerowrt-capable routers or something like that -- which would be beneficial anyway, since I'd be able to have them reliably communicate the state of the ADSL link to the firewall, which could pull up/down the components of the multipath route appropriately. Right now I'm relying on horrible hacks like looking at passing inbound packets and *hoping* they come from the Internet rather than, say, the ADSL router's administrative interface, and falling back on periodic pings if none are seen... all quite horrible.)

A damp discussion of network queuing

Posted Oct 21, 2014 4:57 UTC (Tue) by dlang (guest, #313) [Link] (7 responses)

> One thing I've been wondering about -- does anyone actually have a bunch of scripts that set things up properly?

What are you trying to setup? fq_codel and BQL can be setup at compile time and need no scripts to function.

Cerowrt has additional configuration scripts for other benefits (artificially limiting outbound traffic to make this box the bottleneck rather than allowing an upstream router to be the bottleneck, and working to limit inbound bandwidth usage), these are the things that are hard to setup.

But just enabling fq_codel helps, as does BQL, and they work well when combined.

A damp discussion of network queuing

Posted Oct 21, 2014 16:27 UTC (Tue) by nix (subscriber, #2304) [Link] (6 responses)

It's basically the latter, the 'make us the bottleneck', with the extra fun that I'm using line bonding, my lines are somewhat different speeds, and the outbound interface used is basically random (determined by a hash of the src/dst/port triplet IIRC) while the incoming is evenly distributed between the lines)...

I used to use wondershaper for this but it's completely bitrotted and doesn't work at all any more.

A damp discussion of network queuing

Posted Oct 22, 2014 4:16 UTC (Wed) by dlang (guest, #313) [Link] (5 responses)

I think the right thing to do nowdays is to use the kernel HTB support to define the bandwidth of each connection. If you limit the throughput for each connection, I believe that fq_codel will end up doing the "right thing" for you.

The difficulty for outbound traffic is in automating the discovery of what you're available bandwith is.

A damp discussion of network queuing

Posted May 15, 2015 0:23 UTC (Fri) by nix (subscriber, #2304) [Link] (4 responses)

FYI, the net-next tree now has this commit:

commit 92bf200881d978bc3c6a290991ae1f9ddc7b5411
Author: Tino Reichardt <milky-kernel@mcmilk.de>
Date: Tue Feb 24 10:28:01 2015 -0800

net: via-rhine: add BQL support

Add Byte Queue Limits (BQL) support to via-rhine driver.

[edumazet] tweaked patch and changed TX_RING_SIZE from 16 to 64

Signed-off-by: Tino Reichardt <milky-kernel@mcmilk.de>
Tested-by: Jamie Gloudon <jamie.gloudon@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

So those of us with Rhine-based NICs finally have access to the codel and fq_codel goodness. And what goodness! It's as zero-configuration as advertised: with no configuration at all, even without any outbound traffic shaping and with unaltered probably bloated-as-hell queues inside the ADSL modems, all my bufferbloat symptoms have silently vanished and my line is smooth and usable under load, with ping times only a few ms up on a saturated line as on an idle one. Now that's a *nice* qdisc!

mtaht et al have done a really really *really* good job here. I can see why every distro is jumping on this as fast as they can.

A damp discussion of network queuing

Posted May 16, 2015 17:55 UTC (Sat) by mtaht (subscriber, #11087) [Link] (3 responses)

I am glad to see the rhine patches finally landed. There were a few popular devices used as firewalls that used that chipset.

However I must note that your excellent result was probably due to fq_codel taking advantage of hardware flow control exerted by the DSL modem, which then is seen by fq_codel as delay and managed appropriately, where pfifo_fast would just keep buffering until it hits the packet limit.

Most edge devices today do not exert hardware flow control. Certainly I feel that {dsl,cable}modems should use hardware flow control! It is a sane signal that can also be aware of congestion on media. But nearly everybody put switches, rather than ethernet devices in the path here, over the past 5 years, and lost that capability. So we have generally have had to use software rate limiting (sqm-scripts) to succeed here... or to push for more hardware flow control (or smarter modems)

A damp discussion of network queuing

Posted May 19, 2015 19:23 UTC (Tue) by nix (subscriber, #2304) [Link] (2 responses)

I am glad to see the rhine patches finally landed. There were a few popular devices used as firewalls that used that chipset.

Indeed there were (though I don't know if the Soekris net5501 I use could ever be defined as 'popular' except among the geekiest crowd and those who want to network up oil rigs.)

However I must note that your excellent result was probably due to fq_codel taking advantage of hardware flow control exerted by the DSL modem

That's what I presumed. Nothing else could explain how it managed to figure out my ADSL bandwidth given the total absence of any other way to detect it.

But nearly everybody put switches, rather than ethernet devices in the path here, over the past 5 years, and lost that capability.

Yeah. I guess if your modem has only one port, and you don't have a dedicated multi-port firewall box, that's all you can really do... I wonder: are the very common 'four-port ADSL modems' actually an ADSL modem and a switch in the same box? If so, I guess they're eschewing flow control too, right? :(

Thank you for a most excellent qdisc, anyway!

A damp discussion of network queuing

Posted May 19, 2015 19:40 UTC (Tue) by dlang (guest, #313) [Link] (1 responses)

> I wonder: are the very common 'four-port ADSL modems' actually an ADSL modem and a switch in the same box?

they are commonly a system running linux with an ADSL modem and an ethernet connected to a 4-port switch.

Unfortunately they usually are using a binary driver for the DSL side, so getting them supported by OpenWRT is hard :-(

If they were based on the current OpenWRT instead of a several-year-old one, they would be using fw_codel by default.

A damp discussion of network queuing

Posted May 20, 2015 10:15 UTC (Wed) by paulj (subscriber, #341) [Link]

There is one family of them that has a free software driver, at least on the Linux kernel side. The Lantiq SoCs:

http://wiki.openwrt.org/doc/hardware/soc/soc.lantiq

A damp discussion of network queuing

Posted Oct 16, 2014 10:02 UTC (Thu) by rvolgers (guest, #63218) [Link] (3 responses)

> Unfortunately, the default queuing discipline cannot be changed, since it will certainly disturb some user's workload somewhere.

This is why we can't have nice things, Linux. Sheesh. Guess it's up to distros to change the default.

A damp discussion of network queuing

Posted Oct 18, 2014 13:09 UTC (Sat) by thestinger (guest, #91827) [Link] (2 responses)

At the very least, it will be changed on all systemd distributions if they don't explicitly revert it:

http://lists.freedesktop.org/archives/systemd-devel/2014-...

A damp discussion of network queuing

Posted Oct 21, 2014 21:21 UTC (Tue) by bronson (subscriber, #4806) [Link] (1 responses)

Uh oh. Queue the "systemd is taking over your networking stack!!" comments in 3... 2... 1...

A damp discussion of network queuing

Posted Oct 21, 2014 21:26 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

I've also heard that Lennart is going to replace outdated TCP with kdtcp!

What about forward-error-correction?

Posted Oct 16, 2014 11:32 UTC (Thu) by Richard_J_Neill (subscriber, #23093) [Link] (2 responses)

There are 2 other problems that make the experience bad:

1. Wifi which is limited by interference, rather than by congestion. i.e.
where a significant fraction of packets will get dropped, but not because of other traffic, but rather because of interference/noise or distance from the AP. TCP causes the client to back-off, when in fact, I think it should be more aggressive, and perform forward-error-correction: i.e. the client should assume that many packets will not make it, and it should re-transmit everything twice or more within 100ms. (This is especially true for UDP, eg for DHCP IP allocation over a network with 50% packet loss, it's nearly impossible to get the link established).

2. Ajax stuff over https. The problem here is that each Ajax connection (after Apache has finished the keepalive) requires a complete cycle of re-establishing all the encryption layers, even if the actual data is only tiny. It would be useful to have some way to keep an https session alive for many minutes.

(The combination of Ajax, https, and slightly non-ideal wifi results in a horrible experience!)

What about forward-error-correction?

Posted Oct 16, 2014 11:58 UTC (Thu) by JGR (subscriber, #93631) [Link]

Wifi already does link-layer retransmission.
When a packet is lost at the IP layer, one or more of its fragments have already failed to be received after a number of retransmissions.

Much of the interference/noise is traffic for other APs or traffic for other 2.4GHz protocols. Sending even more data makes the noise problem worse.

If you're getting 50% packet loss, then you'd be better off fixing that rather than trying to work round with client fudges (i.e. change radio channel, move/add APs, change/move antennae, etc.).

As for 2, as I understand it HTTP 2 solves this.

What about forward-error-correction?

Posted Oct 17, 2014 15:13 UTC (Fri) by grs (guest, #99211) [Link]

Enabling ssl session reuse and caching on web servers should improve #2

A damp discussion of network queuing

Posted Oct 16, 2014 12:25 UTC (Thu) by michich (guest, #17902) [Link] (5 responses)

Alright, since kernel developers don't want to change the default in the kernel itself, I am proposing to select fq_codel in systemd:
http://lists.freedesktop.org/archives/systemd-devel/2014-...

A damp discussion of network queuing

Posted Oct 16, 2014 12:35 UTC (Thu) by TomH (subscriber, #56149) [Link] (3 responses)

Did you actually try that before posting the patch?

Only I tried more or less exactly that last night on a Fedora 20 machine and have so far failed to get it to work.

First off the "all connections" thing, which I know came from the original article, makes no sense, as that setting applies to network interfaces, not to individual connections.

So in order to take effect it needs to be set at the point when an interface is created - well actually I think from my current testing that it needs to be set when the interface is first brought up.

What I found when I set that in a dropin in /etc/sysctl.d and rebooted was that not only did it not get applied to my interfaces, but it didn't even manage to actually change the sysctl value! Oddly restarting systemd-sysctl.service did at least cause the sysctl value to be changed, but of course by then it is too late to affect the interfaces which are already up.

My current hypothesis (based on complete guesswork - need to look at the kernel source next) is that systemd-sysctl.service is setting it too early when it runs during the boot and something is resetting it afterwards.

A damp discussion of network queuing

Posted Oct 16, 2014 13:38 UTC (Thu) by michich (guest, #17902) [Link] (2 responses)

It works for me. I can see fq_codel in /proc/sys/net/core/default_qdisc after boot. "ip l" also shows fq_codel on my "enp0s25" interface. But there could still be a race and I'm just lucky.

A damp discussion of network queuing

Posted Oct 16, 2014 20:30 UTC (Thu) by TomH (subscriber, #56149) [Link] (1 responses)

The problem seems to related to an interaction with module loading - because fq_codel is a module (sch_fq_codel) writing fq_codel to that sysctl file triggers a module load and if that fails then the write effectively never happens.

That is, as best I can see, exactly what is happening to me. If I add a file in /etc/module-load.d to make sure sch_fq_codel is preloaded then everything works.

Why the module is failing to load when triggered by the kernel as a result of the systctl write is not clear however.

A damp discussion of network queuing

Posted Oct 16, 2014 20:41 UTC (Thu) by TomH (subscriber, #56149) [Link]

So the problem is that selinux is blocking it - the policy does not allow module loads from the systemd-sysctl context.

So your proposed chage to systemd is fine, it will just need a corresponding change in the Fedora selinux policy.

A damp discussion of network queuing

Posted Oct 17, 2014 3:37 UTC (Fri) by mtaht (subscriber, #11087) [Link]

+10! Go Fedora! Go!

There are (a very few) caveats to switching away from pfifo_fast. Please feel free to contact us over at cerowrt-devel or the codel list to discuss.

More importantly, run your own benchmarks against the results (measuring latency simultaneously with load), or try ours (netperf-wrapper's rrul tests in particular),

A damp discussion of network queuing

Posted Oct 17, 2014 9:53 UTC (Fri) by paulj (subscriber, #341) [Link] (5 responses)

Note, jg didn't quite discover buffer-bloat. It had been researched and reasonably well described in academia. There were even a string of papers on the benefits of "tiny buffers". However, that work somehow got stuck in academia and was never communicated to the wider engineering networking world. What jg has done is, I guess, to have discovered it for that engineering world and communicate the previous results on, and then develop that knowledge further.

I believe jg has explained that on LWN before, e.g. see comments in http://lwn.net/Articles/418918/ .

A damp discussion of network queuing

Posted Oct 20, 2014 7:55 UTC (Mon) by marcH (subscriber, #57642) [Link] (4 responses)

Among others, I reported TxDescriptors bufferbloat to "engineering" in 2004:

http://thread.gmane.org/gmane.linux.network/6366/focus=11785
http://marc.info/?l=linux-netdev&m=108462579501312
http://oss.sgi.com/archives/netdev/2004-06/msg00917.html

However I think this got eventually lost in the "agitation of life"; unlike jg I had neither fame nor a very catchy name for it :-)

---

(Good) words are incredibly important, it's funny how so many engineers don't realize they make all the difference.

http://martinfowler.com/bliki/TwoHardThings.html

A damp discussion of network queuing

Posted Oct 20, 2014 8:18 UTC (Mon) by paulj (subscriber, #341) [Link] (3 responses)

Ah kudos for pointing it out.

It's amazing that the default txqueuelen got bumped up to 1000 for all interfaces. It's even more amazing that this is *still* the default, even on wifi devices 10 years later. :(

1000 packet queues are just insane, even on high-speed links.

I've had "for H in <list of devices> ; do ip link set dev $H qlen 5; done" in my rc.local for quite a while. Unfortunately though it doesn't apply to devices brought up post-boot by, e.g., NetworkManager. I havn't yet looked into how to make NM set the qlen.

A damp discussion of network queuing

Posted Oct 23, 2014 2:54 UTC (Thu) by dcbw (guest, #50562) [Link] (2 responses)

A dispatcher script would do it, for an 'up' even take the interface name and run /sbin/ip to set the qlen. Examples here, drop it into /etc/NetworkManager/dispatcher.d/ and chmod 700.

http://cgit.freedesktop.org/NetworkManager/NetworkManager...

more information in 'man NetworkManager'.

A damp discussion of network queuing

Posted Dec 1, 2014 20:57 UTC (Mon) by paulj (subscriber, #341) [Link]

That looks like it should what I want. Thanks! :)

A damp discussion of network queuing

Posted Dec 1, 2014 22:51 UTC (Mon) by paulj (subscriber, #341) [Link]

Submitted a NetworkManager script to set txqueuelen according to nmcli's SPEED.CAPABILITIES value on interface up, as RedHat bug #1169529:

https://bugzilla.redhat.com/show_bug.cgi?id=1169529

A damp discussion of network queuing

Posted Oct 20, 2014 4:28 UTC (Mon) by fmarier (subscriber, #19894) [Link] (1 responses)

So Stephen encouraged everybody to run a command like: sysctl -w net.core.default_qdisc=fq_codel

Interestingly enough, the bufferbloat.net wiki recommends fq instead for everything except routers:

For host (rather than router) based queue management, we recomend sch_fq instead of fq_codel as of linux 3.12, for tcp-heavy workloads.

A damp discussion of network queuing

Posted Oct 20, 2014 15:41 UTC (Mon) by mtaht (subscriber, #11087) [Link]

Um, we said that badly. I just tried to improve it on the webpage.

fq_codel is a good general purpose default, no matter the workload.

sch_fq is better on servers "for tcp heavy workloads". It has been tuned for >10GigE, in particular, and does some really nice stuff with pacing.

On hosts, at lower speeds, on reverse traffic, it's not clearcut, and it seems to be a lose on wifi presently (but wifi has many other problems), and it's the wrong thing on routers entirely.

Please note I'd be just as happy if either one became the linux default and pfifo_fast went the way of the dodo.

I'd be happiest if that the "right" qdisc was chosen always, and more work went into choosing sane defaults for things like tcp small queues, txqueuelen (if you must stick with pfifo_fast), TSO/GSO sizes, etc for when you are running at rates below 10GigE.

When to make ECN on-by-default for Linux (just on IPv6)?

Posted Nov 13, 2014 13:02 UTC (Thu) by TimSmall (guest, #96681) [Link]

It looks like ECN is set to be on-by-default in Microsoft Windows Server 2012 http://serverfault.com/questions/526377/is-ecn-explicit-c...

This made me wonder if the default Linux setting of "Enable ECN when requested by incoming connections but do not request ECN on outgoing connections." should be changed?

It will be interesting to see if MS stick with this on-by-default behaviour in the next release of Windows Server and/or push it into their desktop releases - a quick web search shows the new default in Windows Server 2012 has caused problems for at least one user:

http://hardforum.com/showthread.php?t=1805750

My assumption is that most things which currently break with ECN are NAT and firewall boxes, which lead me to wondering whether Linux should request ECN by default for outgoing IPv6 connections, since I hope fewer IPv6 connections play badly with ECN.

At the moment, the ECN behaviour of Linux IPv6 TCP is controlled by the value of the sysctl variable net.ipv4.tcp_ecn (which in itself is a bit surprising) - and there's no way to control IPv6 behaviour independently of IPv4 behaviour.