LWN.net Logo

Power impact of debufferbloating

Power impact of debufferbloating

Posted Sep 13, 2011 21:24 UTC (Tue) by njs (guest, #40338)
In reply to: Power impact of debufferbloating by arjan
Parent article: LPC: An update on bufferbloat

The real problem isn't the buffer sizes; there's no intrinsic reason why power management and network latency have to be at odds. This apparent conflict is due to a limitation of the current kernel design, where buffers that are inside device drivers are treated as black holes.

So network buffering in Linux works is, first a packet enters the "qdisc" buffer in the kernel, which can do all kinds of smart prioritization and what-not. Then it drains from the qdisc into a second buffer in the device driver, which is pure FIFO.

I've experienced 10-15 seconds of latency in this second, "dumb" buffer, with the iwlwifi drivers in real usage. (Not that iwlwifi is special in this regard, and I have terrible radio conditions, but to give you an idea of the magnitude of the problem.) That's a flat 10-15 seconds added to every DNS request, etc. So firstly, that's just way too large, well beyond the point of diminishing returns for power usage benefits. My laptop is sadly unlikely to average 0.1 wakeups/second, no matter what heroic efforts the networking subsystem makes.

*But* even this ridiculous buffer isn't *necessarily* a bad thing. What makes it bad is that my low-priority background rsync job, which is what's filling up that buffer, is blocking high-priority latency-sensitive things like DNS requests. That big buffer would still be fine if we could just stick the DNS packets and ssh packets and stuff at the head of the line whenever they came in, and dropped the occasional carefully-chosen packet.

But, in the kernel we have, prioritization and AQM in general can only be applied to packets that are still in the qdisc. Once they've hit the driver, they're untouchable. So what we want is prioritization, but the only way we can get it is to reduce the device buffers as small as possible, to force packets back up into the qdisc. This is an ugly hack. The real long-term solution is to enhance the drivers so that the AQM can reach in and rearrange packets that have already been queued.

This is exactly analogous to the situation with sound drivers, actually. we used to have to pick between huge latency (large audio buffers) and frequent wakeups (small audio buffers). Now good systems fill up a large buffer of audio data ahead of time and then go to sleep, but if something unpredictable happens ("oops, incoming phone call, better play the ring tone!") then we wake up and rewrite that buffer in flight, achieving both good power usage and good latency. We need an equivalent for network packets.


(Log in to post comments)

Power impact of debufferbloating

Posted Sep 13, 2011 21:37 UTC (Tue) by njs (guest, #40338) [Link]

...Also I should add that the above is very much talking about the long-term ideal architecture. Right now IIUC everyone's still focusing on figuring out how to make buffers smaller without killing throughput. I think this is basically the wrong question -- we're going to want big buffers anyway -- but no-one's even thinking about power consumption yet AFAICT.

And in the long run, packets will be dropped intelligently according to some control law that tries to maintain fairness and provide useful feedback to senders (i.e., AQM). For now we mostly use the "drop packets when the buffer overflows" rule, which is a terrible control law, but may be less terrible with small buffers than large ones.

So in the long run buffer size doesn't matter, but in the short run it's a single knob that's coupled to a ton of theoretically unrelated issues, and decoupling it is going to be a pain.

Power impact of debufferbloating

Posted Sep 13, 2011 21:55 UTC (Tue) by dlang (✭ supporter ✭, #313) [Link]

one thing here, the main focus of the bufferbloat project is not the endpoint application, it's the routers in the middle of the path.

to a small extent, the IP stack on your endpoint device can be considered such a router as it aggregates the traffic from all your applications, but the biggest problem of large buffers is where you transition from a fast network to a slow one.

in many cases this is going from the high speed local network in someone's house to the slow DSL link to their ISP

I also think that you are mistaken if you think that any computers send a large number of packets to the network device and then shift to a low power state while the packets are being transmitted. usually shifting to such a low power state ends up powering off the network device as well.

servers almost never do something like this because as soon as they are done working on one request, they start working on another one.

shifting power states is a slow and power hungry process, it's not done frequently and on reasonably busy servers seldom takes place

Power impact of debufferbloating

Posted Sep 13, 2011 21:59 UTC (Tue) by arjan (subscriber, #36785) [Link]

I completely disagree with your last statement.
Servers run C states ALL THE TIME.... and often are "only" 50% to 80% utilized.... and not 10% is no exception either...

Power impact of debufferbloating

Posted Sep 13, 2011 22:59 UTC (Tue) by jg (subscriber, #17537) [Link]

It's currently multiple knobs to twist... Sometimes the drivers even lack the knob one might want to twist. Not good.

It isn't clear that the current network stack buffering architecture in Linux is what it needs to be (in fact, pretty clear it needs some serious surgery; and I'm not the person to say what it really needs to be).

Ultimately, we need AQM *everywhere* that "just works". Right now we don't have an algorithm that is known to fit this description.

And yes, I'm very aware that power management folds into this; I've worked on iPAQ's as you may or may not remember, from which other hand held Linux devices were at least inspired.

Lots of interesting work to do...

Power impact of debufferbloating

Posted Sep 13, 2011 21:40 UTC (Tue) by arjan (subscriber, #36785) [Link]

very good points, and I totally agree with your statement.

Power impact of debufferbloating

Posted Sep 14, 2011 1:35 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

Well, it's not hard to do better in your situation - just use proactive traffic shaping based on traffic class.

However, it would be nice to do it automatically.

Power impact of debufferbloating

Posted Sep 14, 2011 6:26 UTC (Wed) by njs (guest, #40338) [Link]

You miss the point. Yes, in principle that is the solution. But, right now, it just doesn't work. I turned on traffic shaping, and it had no effect whatsoever. That's because traffic shaping only applies to packets that are in the qdisc buffer. Packets only end up in the qdisc buffer if the device buffer is full, but device buffers are so large that they never fill up unless you either have a very high bandwidth connection, or have made the buffers smaller by hand. After I hacked my kernel to reduce the device buffer sizes, traffic shaping started working, so that makes reducing buffer sizes a reasonable short-term solution, but in the long-term it produces other bad effects like Arjan points out.

Actually, there is one other way to make traffic shaping work -- if you throttle your outgoing bandwidth, then that throttling is applied in between the qdisc and the device buffers, so the device buffers stop filling up. Lots of docs on traffic shaping say that this is a good idea to work around problems with your ISP's queue management, but in fact it's needed just to work around problems within your own kernel's queue management. Also, you can't really do it for wifi since you don't know what the outgoing bandwidth is at any given time, and in the case of 802.11n, this will actually *reduce* that bandwidth because hiding packets from the device driver will screw up its ability to do aggregation.

Power impact of debufferbloating

Posted Sep 14, 2011 14:34 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

Uhm. As far as I understand, qdisc sees all the packets (check the http://www.docum.org/docum.org/kptd/ ). Now, policies for ingress shaping are somewhat lacking in Linux, but egress shaping works perfectly fine.

In my own tests, I was able to segregate heavy BitTorrent traffic from VoIP traffic and get good results on a puny 2MBit line.

I did lost some bandwidth, but not that much.

Power impact of debufferbloating

Posted Sep 14, 2011 16:37 UTC (Wed) by njs (guest, #40338) [Link]

Yes, qdisc every packet goes through the qdisc. But here's a typical sequence of events in the situation I'm talking about:
- packet 1 enters qdisc
- packet 1 leaves qdisc
- packet 2 enters qdisc
- packet 2 leaves qdisc
- ...

There isn't much the qdisc can do here, prioritization-wise -- even if packet 2 turns out to be high priority, it can't "take back" packet 1 and send packet 2 first. Packet 1 is already gone.

In your tests you used bandwidth throttling, which moved the chokepoint so that it fell in between the qdisc buffer and your device buffer. You told the qdisc to hold onto packets and only let them dribble out at a fixed rate, and chose that rate so that it would be slower than the rate the device buffer drained. So the device buffer never filled up, and the qdisc actually had multiple packets visible at once and was able to reorder them.

Power impact of debufferbloating

Posted Sep 29, 2011 0:02 UTC (Thu) by marcH (subscriber, #57642) [Link]

Indeed: you cannot shape traffic unless/until you are the bottleneck. In other words, policing an empty queue does not much. This makes fixing bufferbloat (or QoS more generally speaking) even more difficult since bottlenecks come and go.

Power impact of debufferbloating

Posted Sep 14, 2011 5:17 UTC (Wed) by russell (subscriber, #10458) [Link]

Sounds complicated. Perhaps and easier way to deal with it is to treat buffers downstream of qdisc as "external" to the system and just drop packets from those buffers if they sit there too long, just the same as if they were transmitted and someone else dropped them.

This would place an absolute upper bound on latency between the app and the wire.

Power impact of debufferbloating

Posted Sep 14, 2011 16:55 UTC (Wed) by njs (guest, #40338) [Link]

That (plus intelligently setting up the buffers so that packets *don't* sit there too long) would be an improvement. Is it good enough? I dunno. Say we set a latency target of 10 ms. That means that sendfile()'s going to incur 100 wakeups/second, which is probably more than we'd like, but maybe acceptable (and maybe we'd need to wake up that often to deal with ACKs anyway). It's also not clear that that's an aggressive enough latency target. For a web server, that's already 10% of Amazon's "100 ms latency = 1% lost sales" guideline. For servers chatting with each other inside a datacenter, I just measured 1/4 of a ms as the average ping between two machines in our cluster, so call it an 80x increase in one-way latency. That seems like a lot, maybe?

And there are a lot of advantages to picking the *right* packets to drop -- if the packet you drop happens to be DNS, or interactive SSH, or part of a small web page, then you'll cause an immediate user-visible hiccup, and won't get any benefits in terms of reduced contention (like you would if you had dropped a packet from a long-running bulk TCP flow that then backs off). Then again, maybe that's okay, and re-ordering packets that have already been handed off to the driver does sound pretty tricky! (And might require hardware support.)

But it's useful to try and find the "right" solution first, because that way even if you give up on achieving it, at least in plan B you know what you're trying to approximate.

Hardware support

Posted Sep 14, 2011 17:19 UTC (Wed) by dmarti (subscriber, #11625) [Link]

Yes, it would be a win to have hardware that can timestamp packets going into its buffers and drop "stale" ones on the way out instead of transmitting them. (relevant thread on timestamps from the bufferbloat list). Right now, hardware assumes that late is better than never, and TCP would prefer never over too late.

Hardware support

Posted Sep 15, 2011 7:58 UTC (Thu) by johill (subscriber, #25196) [Link]

I'm pretty sure that's possible with a bunch of wireless devices, but I don't know how the timestamps are checked etc. off the top of my head.

Hardware support

Posted Sep 15, 2011 16:06 UTC (Thu) by dmarti (subscriber, #11625) [Link]

That would be useful to see. It looks like the problem of bufferbloat is that packets stay in the buffer until they get stale -- so checking staleness directly, ideally without having to involve the CPU, could be a way to save having to tune the buffer size.

(If you run the café in Grand Central Station, you need to bake a bigger "buffer" of muffins than the low-traffic neighborhood place does. But customers at the two places should get the same muffin freshness as long as both have the same policy of dropping day-old muffins.)

Power impact of debufferbloating

Posted Sep 15, 2011 21:43 UTC (Thu) by BenHutchings (subscriber, #37955) [Link]

So what we want is prioritization, but the only way we can get it is to reduce the device buffers as small as possible, to force packets back up into the qdisc.

Not necessarily. There may be hardware support for queues with differing priority. The mqprio qdisc takes advantage of this.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds