LCA: Vint Cerf on re-engineering the Internet

Posted Jan 25, 2011 5:53 UTC (Tue) by gdt (subscriber, #6284)
Parent article: LCA: Vint Cerf on re-engineering the Internet

I stand by the recommendations in the leaflet I wrote.

A 16MB buffer size is appropriate for GbE users and the 0.2s of round-trip delay from the undersea network which attaches Australia to the west coast of the USA. As the leaflet explains, Australia is one of the few countries in the world where users face such a high RTT to their popular Internet resources and can afford such high bandwidth too. Given that odd situation, it isn't surprising that operating systems need some tuning.

When discussing buffer bloat you need to distinguish hosts and routers -- Jim Gettys' complaint was about excessive buffers in routers. The host needs a TCP buffer with a maximum size of the bandwidth-delay product in order to be able to fill the pipe. The routers along that pipe need nowhere near that, rather buffering appropriate for the bandwidth and delay of the next hop, and their buffer scheduling appears to be just as important to TCP throughput as the depth of the buffer. It's fair to say that the academic understanding of router buffering is much less clear than host buffering, and this makes definite recommendations of router buffer sizes difficult, which is one reason why router buffering was not mentioned at all in the leaflet.

On the plus side, I get to add "dissed by Vint Cerf" to my CV :-)

LCA: Vint Cerf on re-engineering the Internet

Posted Jan 25, 2011 8:55 UTC (Tue) by ebiederm (subscriber, #35028) [Link] (11 responses)

The distinguishing point needs to be between tcp socket buffers, and buffers in the transit path. In particular NIC queues on the hosts can cause exactly the same issues as large buffers in routers.

LCA: Vint Cerf on re-engineering the Internet

Posted Jan 25, 2011 10:40 UTC (Tue) by gdt (subscriber, #6284) [Link] (8 responses)

Yep, exactly. Re-reading my posting I wish I'd spent more time making clearer the distinction between (1) TCP buffers in end systems and (2) IP buffers on the egress interfaces of routers. The leaflet was about (1), Vint obviously thought from the context of the question that the leaflet was about (2).

LCA: Vint Cerf on re-engineering the Internet

Posted Jan 26, 2011 1:25 UTC (Wed) by mtaht (subscriber, #11087) [Link] (6 responses)

Not having seen the leaflet I don't know what, specifically, to address.

One core problem of bufferbloat is that devices and device drivers are currently doing too much *uncontrolled* buffering. The TCP/IP stack itself is fine, however, 1) once a packet lands on the txqueue, it can be shaped, but rarely is.

2) Far worse, in many cases, especially in wireless routers, once a packet gets off the txqueue it lands in the driver's queue, which can be quite large, and can incur huge delays in further IP traffic.

Once the device driver's buffers are filled, no amount of traffic shaping at a higher level will help. Death will not release you.

Here's a specific example:

Linux's default txqueuelen is 1000. Even after you cut that to something reasonable, it then hits the device driver's buffers. The driver I'm specifically hacking on (the ath9k, patches available here: https://github.com/dtaht/Cruft/raw/master/bloat/558-ath9k...
) defaults to 507 buffers, (with some limiting factors as to the size of the 10 queues applied) for which it will retry to send up to 13 times.

Assume your uplink is 3Mbit/sec, what's your maximum latency?
Assume your uplink is 128Kbit/sec, what's your maximum latency?

I'm not going to show the math here, it's too depressing. If you have an ath9k, try the above patch. there's one for the IWL going around. Many ethernet devices support ethtool...

The difference in overall network responsiveness with the crude ath9k patch above is amazing, and I've still got a long way to go with it.

This paper: http://www.cs.clemson.edu/~jmarty/papers/PID1154937.pdf
strongly suggests that a maximum of 32 uncontrolled IP buffers be used in the general case (with 1 being an ideal). It also raises some other strong points.

There's plenty of experimental data out there now too, and experiments you can easily perform on your own devices.

http://gettys.wordpress.com/2010/12/02/home-router-puzzle...

Every device driver I have looked at defaults to uncontrolled buffers far in excess of that figure, usually at around 256 buffers, even before you count txqueuelen.

The key adjective here for coping with bufferbloat is to reduce uncontrolled" buffering, starting with the device driver(s) and working up to various means of traffic shaping and providing adequate feedback to TCP/ip streams (packet drop/ECN etc) to keep the servo mechanism(s) working.

LCA: Vint Cerf on re-engineering the Internet

Posted Jan 26, 2011 3:02 UTC (Wed) by jthill (subscriber, #56558) [Link] (1 responses)

But isn't the uplink the advice is for a gigabit uplink, not 3Mb? If that's right, I make the uplink latency for a full ath9k queue well under 10ms - they should be able to run flat out. That or I borked the math.

LCA: Vint Cerf on re-engineering the Internet

Posted Jan 26, 2011 4:11 UTC (Wed) by mtaht (subscriber, #11087) [Link]

In the case of a wireless card in your laptop or a wireless link you can be running at speeds ranging from 300Mbit down to 1Mbit/sec, or less.

In the case of a home gateway, my comcast's business class service is running at about 3Mbit/sec on the uplink.

Huge dark (unmanaged) buffers in the device affect latency really badly - not just for TCP/ip, but for stuff that would ordinary jump to the head of the queue - udp, dns, voip, gaming, NTP... ... and in some cases are so big as to break TCP/ip almost entirely.

We've been sizing device buffers as if it was all on gigE backbone networks. Nor have we been using reasonable AQM. I urge you to try the experiments mentioned earlier.

LCA: Vint Cerf on re-engineering the Internet

Posted Jan 28, 2011 6:45 UTC (Fri) by The_Barbarian (guest, #48152) [Link] (1 responses)

Why, I happen to have an ath9k. I'll give this a whirl at some point. Thanks!

LCA: Vint Cerf on re-engineering the Internet

Posted Jan 29, 2011 5:12 UTC (Sat) by mtaht (subscriber, #11087) [Link]

If you are doing openwrt development on the ath9k I have builds with that patch for the wndr5700 and ubiquity. I have never been more happy to see packet loss in my life.

LCA: Vint Cerf on re-engineering the Internet

Posted Feb 11, 2011 16:43 UTC (Fri) by mcgrof (subscriber, #25917) [Link] (1 responses)

Looking forward to the final upstream patch and respective commit log entry :)

LCA: Vint Cerf on re-engineering the Internet

Posted Feb 11, 2011 17:08 UTC (Fri) by mcgrof (subscriber, #25917) [Link]

On second though, the bandwidth for 802.11 will change dynamically depending on the topology of the 802.11 environment, if you're an AP on the STAs connected and their own 802.11 counterpart. So I wonder if the internal buffers are best adjusted influenced by rate control who will have a better idea of the average bandwidth to peers through one 802.11 interface.

LCA: Vint Cerf on re-engineering the Internet

Posted Jan 26, 2011 21:15 UTC (Wed) by mcoleman (guest, #70990) [Link]

If you're Vint Cerf, you might be thinking that most conference attendees bring their own routers with them (perhaps several). ;-)

LCA: Vint Cerf on re-engineering the Internet

Posted Jan 25, 2011 11:25 UTC (Tue) by marcH (subscriber, #57642) [Link] (1 responses)

Yes, the science of TCP buffers is *unrelated* to other buffers below it. Simply because TCP is in charge of everything: reliability, end to end flow control, and network congestion avoidance, while the rest below is in charge of none of it.

Confusing these two very different buffering roles (TCP versus below TCP) is a huge mistake.

On the other hand, I wonder why a leaflet was needed at all. TCP auto-tuning is supposed to have fixed this problem already?

LCA: Vint Cerf on re-engineering the Internet

Posted Jan 25, 2011 11:32 UTC (Tue) by marcH (subscriber, #57642) [Link]

> Yes, the science of TCP buffers is *unrelated* to other buffers below it.

I take some of that back. In theory, they are not related. In practice, you can have nasty interactions between the two. In any case, they are totally different beasts.

LCA: Vint Cerf on re-engineering the Internet

Posted Jan 25, 2011 13:35 UTC (Tue) by gmaxwell (guest, #30048) [Link]

> The routers along that pipe need nowhere near that, rather buffering appropriate for the bandwidth and delay of the next hop,

Woah woah there. We're not doing hop by hop congestion control. There is no explicit back-pressure. It's TCP end to end. The routers need enough buffers to "fill the pipe" too, and it's ultimately the TCP sender on the ends that the buffers in the network need to satisfy.

In the degenerate case of a single flow across the whole network each of the routers absolutely do need the full delay-bandwidth product buffering that you point out for the host in order to keep the pipes full. This is old knowledge, established by rigorous mathematical analysis, simulation, an real world experiments. This is the classic paper on the subject.

More recent research has established that under certain assumptions the amount of router buffering can be greatly reduced: If there are a great many flows, no super-large flows that completely dominate the link on their own, and the RTTs seen by the flows are well distributed then the buffer requirements can be reduced by an amount proportional to the square root of the number of flows. More information can be found here.

In terms of router buffer bloat, I think it's more of an combination of issues of buffers far in excess of the delay/bandwidth product from manufacturers building for the worst case (e.g. aussies with ten gigabit flows), and service providers (and their most demanding customers) being far more concerned about packet loss than jitter/delay for best effort traffic. For high value jitter sensitive traffic the equipment can always be configured to handle it differently (e.g. anything with buffers big enough to cause problems can do differentiated queuing with a strict high priority queue), but that doesn't help joe-sixpack on DSL at home.

LCA: Vint Cerf on re-engineering the Internet

Posted Jan 25, 2011 16:38 UTC (Tue) by daniel (guest, #3181) [Link]

"I stand by the recommendations in the leaflet I wrote."

Good for you. I fount Cerf's comment about "those trying to help" horribly rude, but not out of character. I also found the talk to be largely empty of technical content.

Congestion-management subtleties

Posted Jan 25, 2011 19:49 UTC (Tue) by jthill (subscriber, #56558) [Link] (2 responses)

I think I've got what might be a helpful contribution here. I spent many years, long ago, fixing and building high-performance networking code in address spaces handling thousands of connections, so I hope it turns out to be worth attention. It might also be something everybody knows about already, but I get the suspicion from the discussion I'm seeing that, maybe it isn't a known and discarded idea. Your reasoning shouldn't have been so easily missed if it were.

So let me start with a slightly artificial example to get all the elements in play: on such a router with high-volume TCP endpoints of its own, the TCP buffers need to be kept separate from the routed-packet buffers because the TCP buffers are necessary only for TCP retransmit and shouldn't be allowed to clog the queue for telnet or whatnot.

No need to burden QoS for this: separate and shrink the routed-packet pool, and arrange to have the routed pool ask for another packet from the TCP pool shortly before it needs it. That will do the trick automatically if I have it right. It occurs to me, since endpoint TCP has much more info available than any router, it should be able to do that-much-better prioritizing anyway.

To keep fairness with non-local sources, local TCP gets some simple proportion of the packets in the pool relative to the packets from other sources. Packets going to local TCP never enter the routed pool at all: it's a matter of swapping a full routed-packet buffer for an empty receive-window buffer. TCP can offer ACK packets in return right there.

So, to the payload: even though doing this for local TCP achieves the purpose in that scenario, I don't see any intrinsic reason to do this only for TCP, or only locally.

When the opportunity and need coincide, why not do this kind of coordinated buffer management across links?

This isn't source-quench. The basic idea needs extension to handle more general cases, but start small.

Pick a leaf router, where one link reaches the vast majority of the net and virtually everything reaching it is going to use that link. Use the idle local bandwidth to make the backpressure explicit.

To put it in a way that might horrify some, why not have a congested leaf router convert its downstream links to half-duplex for the duration of the congestion? It's easy: "Ok. Go 'way now, I'm busy". "Gimme a packet". "Ok, send what you like".

Those need acks to avoid throttling in error, but again those are sent on links that should be idle anyway. This is one-hop link management.

Plainly, when life starts to get interesting (i.e. when more than one of the router's links is likely to get congested), the poll should explicitly list congested routing entries. An overbroad (or ignored) choke list would slow some things down unnecessarily, but if the choke is honored at all (and the router sanely reserves one or two packets for each link no matter what) the congestion gets pushed directly to its source.

When you get to nodes where combinations of inbound links are saturating combinations of outbound links I think this starts running out of steam, but as I understand it those aren't the nodes where we're seeing this problem in the first place.

So, thanks for reading, if it's a good idea I don't feel all possessive about it, and either way I'd appreciate feedback.

Congestion-management subtleties

Posted Jan 25, 2011 23:10 UTC (Tue) by ajb (subscriber, #9694) [Link] (1 responses)

Sounds vaguely like backward congestion notification, which is now in data-center grade ethernet: www.ieee802.org/3/ar/public/0505/bergamasco_1_0505.pdf

Congestion-management subtleties

Posted Jan 26, 2011 2:05 UTC (Wed) by jthill (subscriber, #56558) [Link]

Yeah, that's the idea, only IP-aware, not so unselective. As I said, I think this scheme starts running out of steam as you get towards the core. Cisco's saying theirs starts there, where the routers are already too busy to think. If more than a few simple address ranges were included in this scheme's backpressure notifications I think it'd start getting ugly. For e.g. intranet border routers it occurs to me greenlight ranges (send me what you want for these guys, you hang on to traffic for anyplace else) would be simpler.

Fwliw, seems to me from reading his links that gmaxwell has it right about the seeming contradiction between the results Gettys and Villamizar/Song get - if I recall prices then, the idea of grossly overprovisioning buffers would have seemed insane in 1994. Plus the market was more technical, so there'd be little reason for the earlier study to examine it.

Some things I like about this notion (I am, of course, completely objective on the subject) are that

Unlike Cisco's BCN, the sender can still forward e.g. network control packets (in addition to packets destined for outbound uncongested links, because it knows what those are).
Like Cisco's scheme it's incremental. If the congestion is local only, i.e. if the aggregate buffering in the route back to the source is sufficient, the sending TCPs never see it at all—and when they do hear of it, they hear via backpressure from their local router:
- The pipe is never unnecessarily drained
- they know why they're not getting ACKs if the jam lasts, they don't have to retransmit
- and they can prioritize what to send when polled using every bit of local state

There's more, they're all even more obvious than these.

Video available

Posted Jan 27, 2011 18:31 UTC (Thu) by dowdle (subscriber, #659) [Link]

Here's a direct link to an ogv video of the presentation:

http://a9.video2.blip.tv/9350007685272/Linuxconfau-Keynot...