LWN: Comments on "The CHOKe packet scheduler"

The CHOKe packet scheduler

jthill — Mon, 28 Feb 2011 06:19:29 +0000

At best it would halve the worst case throughput of a memory bandwidth bottlenecked system (e.g. 10g+ packet forwarding engines

That implies that internal bandwidth to packet metadata can be more of a bottleneck than transmission bandwidth to the actual packet data, and that one read and one write to a random packet envelope costs roughly as much as allocating and queuing it did in the first place.

So far as the envelope is concerned, that would mean each envelope is written as (some equivalent of) a single cache line. Hauling the envelope back and rewriting a bit would then actually cost the same bandwidth. That seems sensible enough.

I don't think full interlock on the update is necessary -- that's what LL/SC is for, yes? Just ignore the result from the conditional store, if it works, great, if not, so what? Just as for a duplicate tag, the inbound packet is still dropped, as it should be, and the goal is to approximate, to "differentially penalize", which this still achieves.

It's still really hard for me to imagine metadata bandwidth being more of a bottleneck than actual data bandwidth, though.

The CHOKe packet scheduler

gmaxwell — Mon, 28 Feb 2011 03:41:56 +0000

I wonder what bit of misinformation makes people think that simply because a link is fast that it won't be congested. This isn't the case and fast links tend to be major traffic aggregation points where fairness is a bigger issue, and where insufficient buffering can result in very costly under-utilization.

As I mentioned, in the sort of place where you're using a purely software forwarding path and where the design of CHOKe wouldn't be a performance impediment you could also do per-flow queuing which would be considerably more fair than CHOKe (and potentially much better for loss sensitive flows), perhaps falling back to CHOKe/SFB if the flow lookups become too expensive.

AFAIK, Linux doesn't even have a true per-flow qdisc though there have been patches and SFQ approximates it with acceptable performance. Can you suggest a case where CHOKe would be needed but the SFQ qdisc in Linux would be inappropriate?

The CHOKe packet scheduler

dlang — Mon, 28 Feb 2011 03:24:29 +0000

if you are pumping data out a 10Gb pipe, you are unlikely to need this sort of thing. this is needed when the link you are sending out of is the bottleneck of the communications path.

allow this to be configured on a per-interface basis and you can disable it on your links where the hardware can barely keep up, but still have it in place where you are going to be buffering the data that you send.

The CHOKe packet scheduler

gmaxwell — Mon, 28 Feb 2011 03:19:37 +0000

An additional read-modify-write cycle out to whatever memory is storing the en-queued descriptors, potentially for every packet ingress? And what happens if you hit a don't transmit packet on your next sample? Stall the pipeline and try again?

At best it would halve the worst case throughput of a memory bandwidth bottlenecked system (e.g. 10g+ packet forwarding engines)

Compared to RED (just needs a RNG and a fullness counter) and SFB (whos state can be made 'arbitrarily' small without compromising the queue capacity, and so can be kept on chip) this sounds pretty unfriendly.

Of course, not an issue on some small software driven router handling only a few tens of megabits. but most of those devices could also do real flow-aware queuing which was far more fine grained than CHOKe.

The CHOKe packet scheduler

jthill — Mon, 28 Feb 2011 03:05:56 +0000

not hardware friendly

Router hardware is so taut that lighting a don't-transmit-me flag in the packet header is excessive?

The CHOKe packet scheduler

gmaxwell — Mon, 28 Feb 2011 00:39:05 +0000

Well, SFB isn't this SFB uses bloom filters to discriminate flows while this uses sampled queue occupancy.

It would be interesting to see the two compared in simulation. I'd expect SFB to be more fair until its hash tables start saturating and then CHOKe to be more fair after that point.

It might be interesting to combine them, but I'm not sure what a proper update rule would look like because the CHOKe matches are stochastic. I suppose you'd have an update rule for the blue constants that clocks two different values through a filter depending on the CHOKe match status. In effect, you'd have SFBish behavior until the buckets saturate and then each saturated bucket would behave CHOKe-ish.

Unfortunately, as described, CHOKe is not hardware friendly at all. The random queue access can be pipelined (or, better, replaced with a single memory slot remembering a random flow at the time it was enqueued) but I don't see an obvious way to drop both in the case of a match without taking more memory bandwidth than RED or even SFB like schemes.

The CHOKe packet scheduler

gmaxwell — Sun, 27 Feb 2011 06:10:41 +0000

It's absolutely _trivial_ to demonstrate that 1ms is not unconditionally enough.

Take a long pipe with a several ms of delay. Run a single TCP flow across it. Observe that your flow gets nowhere near line rate, but instead it sawtooths against line rate and leaves the link idle for a significant amount of time.

Yes, a single flow is a corner case but not not an outrageous one. The behavior also holds true for a small number of flows, especially if they experience identical end to end delays.

So there is the excuse you were missing.

The CHOKe packet scheduler

jnareb — Sun, 23 Jan 2011 23:36:53 +0000

I wonder if there is adaptive variant of CHOKe, equivalent to ARED (Adaptive RED).

I wonder if there is similar algorithm, but which is not variant of RED, but of Blue... well, there is SFB.

BTW. CHOKe is mentioned in Variants of RED in 'The Earliest Deadline First Scheduling for Real-Time Traffic in the Internet' thesis.

QoS not always desirable

job — Sun, 16 Jan 2011 22:54:31 +0000

For many networks it will make more sense investing in your interfaces than investing in your processing power. Just a thought.

The CHOKe packet scheduler

njs — Fri, 14 Jan 2011 15:31:59 +0000

> By the way, is the TX ring size in Linux finally adjusted according to the link speed? This should be a very basic thing to implement...

Nope. (And it's "ring sizes", not "ring size"; every driver has its own TX queue handling code, plus there's the TX queue in the network layer itself.) I do have some patches to auto-tune the iwlwifi driver's queues, but haven't had a chance to benchmark them properly...

The CHOKe packet scheduler

marcH — Fri, 14 Jan 2011 10:50:21 +0000

> Your GigE NIC will *NOT* respond to congestion at the wifi link, it'll keep blasting out packets

For sure the NIC *alone* would, but TCP (or DCCP) do NOT.

TCP is ACK-clocked on the bottleneck. Just try it! The name of this feature is "congestion avoidance", google it.

> There's no way for Linux to know just from your local link-speed what the correct size is.

ethtool eth0

> Ethernets in the past tended to be relatively homogeneous wrt segments, but these days they can be extraordinarily mismatched

It does not matter: every link needs to adjust according to its *own* speed.

The CHOKe packet scheduler

paulj — Fri, 14 Jan 2011 09:54:25 +0000

Over-sized buffers matter greatly, *especially* when they back on to a much slower link. Your GigE NIC will *NOT* respond to congestion at the wifi link, it'll keep blasting out packets (at least, it will seem so from the POV of a much slower wifi link). Which will just make the congestion on your wifi link even worse, particularly so given 802.11's malfeature of trying to implement reliability in a low-level link-layer - which just amplifies congestion problems.

There's no way for Linux to know just from your local link-speed what the correct size is. Ethernets in the past tended to be relatively homogeneous wrt segments, but these days they can be extraordinarily mismatched - with GigE segments often bridged together with 802.11 links whose reliability, latency and bandwidth make you yearn fondly for the thinnet of 20+ years ago.

The CHOKe packet scheduler

marcH — Fri, 14 Jan 2011 09:28:55 +0000

I think "in the middle of it" is misleading.

A queue size matters only when it becomes a bottleneck. When a queue is empty its maximum size obviously does not matter.

If your traffic goes first to a gigabit wire, and then through wifi, the queue size of your gigabit wire will never matter. Because the wifi queue will bottleneck first, fill up and drop the packets first.

It does not hurt to fine tune the queue size of every link just in case. But if you only want a quick fix you just need to look at your usual bottlenecks.

By the way, is the TX ring size in Linux finally adjusted according to the link speed? This should be a very basic thing to implement...

The CHOKe packet scheduler

marcH — Fri, 14 Jan 2011 09:03:34 +0000

> If it is not TCP, yes you are right this might be too drastic and harm throughput. But it's only because your application does not behave.

By the way, if you need to smooth UDP-like traffic then DCCP has been designed expressly for you (Datagram Congestion Control Protocol).

The CHOKe packet scheduler

marcH — Fri, 14 Jan 2011 06:59:58 +0000

> The crux of the problem is that it is not entirely clera what the correct smallest size is. Indeed that optimal size may vary for different deployments. If you make the buffers too small, your router will under-perform - especially in benchmarks in high-bandwidth settings. Making them too large OTOH is unlikely to cost you sales: few people benchmark performance in real-world scenarios, with congestion - except network congestion researchers.

Agreed that the exact *optimal* size is not clear. However this is not an excuse for unreasonable sizes that harm latency with NO throughput benefit.

This research topic is *not* new! This 2004 paper demontrates that just 1ms (!) is enough: http://portal.acm.org/citation.cfm?id=1015499

The CHOKe packet scheduler

marcH — Fri, 14 Jan 2011 06:53:11 +0000

> Dropping packets tells the sender to slow down. But in this case, the sender is already sending at the proper speed! You don't want them to reduce throughput, you just want them to smooth out their sending. But dropping packets doesn't tell them to do that, it just tells them to slow down.

TCP sending is smooth by design, check the literature.

If it is not TCP, yes you are right this might be too drastic and harm throughput. But it's only because your application does not behave. And it will preserve latency. And the bandwidth lost by killing the burst might be reused by better behaved applications.

The CHOKe packet scheduler

marcH — Fri, 14 Jan 2011 06:48:43 +0000

> How do you define a burst? If it is more than one packet
sent without waiting between them, then isn't the window
size of a TCP connection its burst size?

TCP does not send a receive window size at a time because it is also constantly limited by the congestion window. Sending is then regulated by the reception of ACK (one every two packets).

So a (single!) TCP connection is practically never bursty in normal conditions.

The CHOKe packet scheduler

njs — Fri, 14 Jan 2011 02:20:44 +0000

> If you receive 100 ms worth of traffic in a lump and buffer them all then the last packets to go out will suffer a 100 ms extra delay (just on this link!). Unless you do not care a bit about latency, you do not want that. What you want is to drop a large number of these packets so a similar burst does not happen again.

Dropping packets tells the sender to slow down. But in this case, the sender is already sending at the proper speed! You don't want them to reduce throughput, you just want them to smooth out their sending. But dropping packets doesn't tell them to do that, it just tells them to slow down.

Note that if they smoothed out their sends, so you got 1 ms of traffic every 1 ms, then that last packet would just get sent 100 ms later. There's no unnecessary latency being added here.

> I do not see why.

To tell the truth, I'm not sure either, but I'm quoting Van Jacobson et al (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2...) so I believe it. Perhaps someone will come along shortly to explain better.

The CHOKe packet scheduler

Shewmaker — Thu, 13 Jan 2011 22:55:07 +0000

> Note: since TCP is ACK-clocked, it is not bursty at all.

How do you define a burst? If it is more than one packet
sent without waiting between them, then isn't the window
size of a TCP connection its burst size?

Of course, a qdisc or something at a lower level may
break up a window's worth of packets. Still, saying
TCP is not bursty at all doesn't seem accurate.

The CHOKe packet scheduler

paulj — Thu, 13 Jan 2011 22:50:41 +0000

Yes, buffer-bloat is just a plain bug. RED and CHOKe are means to tickle sender TCP's congestion control to activate in a more progressive fashion, which is more network friendly and less likely to cause those TCPs to synchronise (i.e. all back off at same time, and all ramp up again at same time) if congestion is applied uniformly to most flows. As the queue size increases above the min-threshold, the probability of dropping a newly arrived packet linearly increases, until it reaches 1 at the max-threshold.

The problem with fixing buffer-bloat is finding an economic justification for reducing buffers. Other than in quite high-rate routers, memory for buffering is generally cheap and there's little economic incentive to not over-spec buffers. The crux of the problem is that it is not entirely clera what the correct smallest size is. Indeed that optimal size may vary for different deployments. If you make the buffers too small, your router will under-perform - especially in benchmarks in high-bandwidth settings. Making them too large OTOH is unlikely to cost you sales: few people benchmark performance in real-world scenarios, with congestion - except network congestion researchers.

The CHOKe packet scheduler

paulj — Thu, 13 Jan 2011 22:26:43 +0000

There are 2 things you might wish to tweak:

- the size of the ring buffer used to transfer packets from the kernel to the NIC. See 'ethtool -g ethX' and 'ethtool -g <dev> tx X'

- the number of packets the kernel's network layer will queue, it's 1000 by default on ethernet interfaces, which is way too large if your ethernet has a wifi segment in the middle of it. Set this to something lower with 'ip link set <dev> txqueuelen Y'

The CHOKe packet scheduler

marcH — Thu, 13 Jan 2011 22:07:50 +0000

If you receive 100 ms worth of traffic in a lump and buffer them all then the last packets to go out will suffer a 100 ms extra delay (just on this link!). Unless you do not care a bit about latency, you do not want that. What you want is to drop a large number of these packets so a similar burst does not happen again.

Note: since TCP is ACK-clocked, it is not bursty at all.

> In general, the proper queue size grows with the RTT of the flows,

I do not see why. The RTT matters only for end to end buffers.

The CHOKe packet scheduler

jhhaller — Thu, 13 Jan 2011 21:07:41 +0000

The only thing you can control within your network are transmitted packets. If you have much upstream traffic, or significant interference in your wireless network, those are places which could benefit from this scheduler. But, if you have asymmetrical traffic, with most traffic received from your ISP, and have no long transmit queues, your ISP has to do this scheduling on your flows.

One could do something similar by emulating a slightly slower link after receipt from your ISP. For example, if you have a 8 Mbps link from your ISP, you could put an artificial link of 7 Mbps out of your router (method for this left as an exercise for the reader). Then, if the queues start growing on that 7 Mbps link, packets could be discarded there. This would tend to prevent your 8 Mbps link from being congested, at the expense of not getting your full 8 Mbps. But, it would be much better for the ISP to install the CHOKe scheduler, so it would only start dropping packets when the average was over 8 Mbps.

The CHOKe packet scheduler

njs — Thu, 13 Jan 2011 19:53:59 +0000

> Bufferbloat is quite obviously about unreasonable queue sizes (more than 10 milliseconds), while CHOKe or RED should also apply to manage queues of reasonable size

Bufferbloat is about unreasonable *average* queue sizes; if you receive 100 ms worth of traffic in a lump every 100 ms, then you aren't oversubscribed at all, and the right thing to do is to buffer all of it. (In general, the proper queue size grows with the RTT of the flows, which may be *way* above 10 ms.) So it makes sense to have a reasonably large queue; the point of AQM algorithms is to make sure that this queue is only used to smooth out burstiness, instead of becoming a simple delay line.

IIUC.

The CHOKe packet scheduler

Velmont — Thu, 13 Jan 2011 12:06:00 +0000

Sooo... Is this something for my local server-and-router for our home LAN? I guess it is way too small, this is more for ISPs, and more central routers? Or is it really part of our problem?

Or do I only have to think about bufferbloat? Are there any guides telling me how to fix my Ubuntu Linux-based server/router to not be a part of the bufferbloat problem.

The CHOKe packet scheduler

marcH — Thu, 13 Jan 2011 10:26:43 +0000

Yet another brilliant summary, thanks. Two comments.

The only congestion signal available to TCP is packet loss for a practical reason: *inter*-networking. Among other design goals, the *Inter*net was meant to connect different type of networks together. Now guess which congestion signal "technology" is supported across every single type of network, even the most primitive ones?
Dropping packets is also consistent with the End To End principle, which states the network should be as dumb and stateless as possible for a number of reasons, not the least scalability.

I do not think CHOKe is related to bufferbloat. Bufferbloat is quite obviously about unreasonable queue sizes (more than 10 milliseconds), while CHOKe or RED should also apply to manage queues of reasonable size. In other words, bufferbloat is a plain bug while queue management is an optimization. Queues of unreasonable sizes should not just be "managed", they should first of all be made smaller.

I suspect that Jim Gettys' opinion might differ on this latter point. Unfortunately his summary writing skills do not seem as good as LWN's and my life is too short.