Reading about how smaller (send) buffers are helping... this makes me a little worried.
Bigger (send) buffers are essential to achieve good power management (basically the OS puts a whole load of work in the hardware queue and then goes to sleep for a long time....).
If you end up with a micro-small buffer, you may get better behavior on the wire (that is the part that is clearly interesting, and essential for Jim's work), but it also means the OS will be waking up *all the time*, which has the risk of destroying power management...
Power management matters both for mobile devices as for servers... and guess which devices are at the edges of the network ;-(
Posted Sep 13, 2011 19:08 UTC (Tue) by josh (subscriber, #17465)
[Link]
That's a tradeoff systems need to make carefully. An endpoint system can save power by incurring more latency, and in theory that doesn't affect any traffic to other systems since we (sadly) don't have widespread use of mesh networking. On the other hand, if you care about latency, you have to use a bit more power. The same tradeoff has appeared in audio and video processing as well.
Power impact of debufferbloating
Posted Sep 13, 2011 19:36 UTC (Tue) by arjan (subscriber, #36785)
[Link]
I'm ok with it being a tradeoff.
What would worry me if it's not even looked at in the various designs/ideas....
Power impact of debufferbloating
Posted Sep 13, 2011 19:27 UTC (Tue) by Richard_J_Neill (subscriber, #23093)
[Link]
Really? On my Android phone, the CPU is about the least power-hungry part. By a huge margin, the power-hog is the screen's backlight.
BTW, have you ever noticed that web-access on a 3G phone is sometimes really, really, really slow (20 seconds + for a google query, yet simultaneously quite quick for a large webpage load)? This is buffer bloat in action.
Lastly, it's worth pointing out that the Application can still buffer all it wants; we are considering the TCP buffering here. If my application wants to send a 1 MB file, there is no reason why the kernel shouldn't give userspace a whole 1MB of zero-copy buffer. That's outside the TCP end-points. The problem is that the TCP-end-points themselves need to see timely packet loss, in order to back-off slightly on transmit rates.
Power impact of debufferbloating
Posted Sep 13, 2011 19:35 UTC (Tue) by arjan (subscriber, #36785)
[Link]
I think you're mistaken in thinking such activity bursts only impact the CPU.
Memory is equally impacted (going between fully active and self refresh), as are other parts of the chipset. And cpu + memory + chipset do make up a significant part of system power. Sure the screen takes a lot too.. but to get reasonable power you need to look across everything that's significant.
Power impact of debufferbloating
Posted Sep 14, 2011 6:07 UTC (Wed) by ringerc (subscriber, #3071)
[Link]
That *can* be buffer bloat in action. It's just as often to do with the mostly-transparent base station roaming your phone does, its radio power management, the base station's load, the insane complexity of the HSPA+/HSPA/HSDPA/UMTS/GSM stack(s), the endless layers of weird and varied legacy crap that live under them, etc.
My phone will often poke away for a while trying to use a data connection, conclude it's not working, re-establish it, conclude it's still not working, switch to 2G (GPRS) and set that up, roam to a new base station as signal strenth varies, try to upgrade to 3G again, fail, and then eventually actually do what I asked.
Realistically, unless you're trying to do a google search while numerous other nearby people are watching TV / streaming video / downloading files / etc on their phones on the same network, it's probably more likely to be regular cellular quirks than bufferbloat.
As for power: Yep, the screen gets the blame for the majority of the power use on android phones. I can't help suspecting that means "display and GPU" though, simply because of the overwhelming power use. That's a total guess, but if GPU power is attributed under "System" then (a) it's insanely efficient and (b) Apple have invented new kinds of scary magic for their displays to allow them to run brighter, better displays for longer on similar batteries to Android phones.
Power impact of debufferbloating
Posted Sep 14, 2011 7:05 UTC (Wed) by Los__D (guest, #15263)
[Link]
Erm, The only thing that Apple has on Andorid phones is resolution. I.e. the Galaxy S2's Super AMOLED Plus (damn I hate that name) has a brighter screen with deeper colors.
Power impact of debufferbloating
Posted Sep 14, 2011 14:48 UTC (Wed) by Aissen (subscriber, #59976)
[Link]
OTOH, I've seen many complaints about low brightness and yellowish tint on the S2, maybe there are different versions of the screen?
Power impact of debufferbloating
Posted Sep 14, 2011 8:19 UTC (Wed) by lab (subscriber, #51153)
[Link]
>Really? On my Android phone, the CPU is about the least power-hungry part. By a huge margin, the power-hog is the screen's backlight.
Yes exactly. This is a wellknown fact. If you have an AMOLED (variant) screen, everything light/white consumes a lot of power. Black nothing.
>BTW, have you ever noticed that web-access on a 3G phone is sometimes really, really, really slow (20 seconds + for a google query, yet simultaneously quite quick for a large webpage load)? This is buffer bloat in action.
Interesting. I have the same observation, and always thought of it as "the mobile web has huge latency but quite good bandwidth".
Power impact of debufferbloating
Posted Sep 15, 2011 6:44 UTC (Thu) by Cato (subscriber, #7643)
[Link]
Your anecdote about web page load times could equally be due to changed cell usage by other phones, or by congestion on the backhaul link from the cellsite to the mobile networks IP aggregation/core network. Without repeated measurements it's not really possible to say what's causing a delay.
Power impact of debufferbloating
Posted Sep 13, 2011 21:24 UTC (Tue) by njs (guest, #40338)
[Link]
The real problem isn't the buffer sizes; there's no intrinsic reason why power management and network latency have to be at odds. This apparent conflict is due to a limitation of the current kernel design, where buffers that are inside device drivers are treated as black holes.
So network buffering in Linux works is, first a packet enters the "qdisc" buffer in the kernel, which can do all kinds of smart prioritization and what-not. Then it drains from the qdisc into a second buffer in the device driver, which is pure FIFO.
I've experienced 10-15 seconds of latency in this second, "dumb" buffer, with the iwlwifi drivers in real usage. (Not that iwlwifi is special in this regard, and I have terrible radio conditions, but to give you an idea of the magnitude of the problem.) That's a flat 10-15 seconds added to every DNS request, etc. So firstly, that's just way too large, well beyond the point of diminishing returns for power usage benefits. My laptop is sadly unlikely to average 0.1 wakeups/second, no matter what heroic efforts the networking subsystem makes.
*But* even this ridiculous buffer isn't *necessarily* a bad thing. What makes it bad is that my low-priority background rsync job, which is what's filling up that buffer, is blocking high-priority latency-sensitive things like DNS requests. That big buffer would still be fine if we could just stick the DNS packets and ssh packets and stuff at the head of the line whenever they came in, and dropped the occasional carefully-chosen packet.
But, in the kernel we have, prioritization and AQM in general can only be applied to packets that are still in the qdisc. Once they've hit the driver, they're untouchable. So what we want is prioritization, but the only way we can get it is to reduce the device buffers as small as possible, to force packets back up into the qdisc. This is an ugly hack. The real long-term solution is to enhance the drivers so that the AQM can reach in and rearrange packets that have already been queued.
This is exactly analogous to the situation with sound drivers, actually. we used to have to pick between huge latency (large audio buffers) and frequent wakeups (small audio buffers). Now good systems fill up a large buffer of audio data ahead of time and then go to sleep, but if something unpredictable happens ("oops, incoming phone call, better play the ring tone!") then we wake up and rewrite that buffer in flight, achieving both good power usage and good latency. We need an equivalent for network packets.
Power impact of debufferbloating
Posted Sep 13, 2011 21:37 UTC (Tue) by njs (guest, #40338)
[Link]
...Also I should add that the above is very much talking about the long-term ideal architecture. Right now IIUC everyone's still focusing on figuring out how to make buffers smaller without killing throughput. I think this is basically the wrong question -- we're going to want big buffers anyway -- but no-one's even thinking about power consumption yet AFAICT.
And in the long run, packets will be dropped intelligently according to some control law that tries to maintain fairness and provide useful feedback to senders (i.e., AQM). For now we mostly use the "drop packets when the buffer overflows" rule, which is a terrible control law, but may be less terrible with small buffers than large ones.
So in the long run buffer size doesn't matter, but in the short run it's a single knob that's coupled to a ton of theoretically unrelated issues, and decoupling it is going to be a pain.
Power impact of debufferbloating
Posted Sep 13, 2011 21:55 UTC (Tue) by dlang (✭ supporter ✭, #313)
[Link]
one thing here, the main focus of the bufferbloat project is not the endpoint application, it's the routers in the middle of the path.
to a small extent, the IP stack on your endpoint device can be considered such a router as it aggregates the traffic from all your applications, but the biggest problem of large buffers is where you transition from a fast network to a slow one.
in many cases this is going from the high speed local network in someone's house to the slow DSL link to their ISP
I also think that you are mistaken if you think that any computers send a large number of packets to the network device and then shift to a low power state while the packets are being transmitted. usually shifting to such a low power state ends up powering off the network device as well.
servers almost never do something like this because as soon as they are done working on one request, they start working on another one.
shifting power states is a slow and power hungry process, it's not done frequently and on reasonably busy servers seldom takes place
Power impact of debufferbloating
Posted Sep 13, 2011 21:59 UTC (Tue) by arjan (subscriber, #36785)
[Link]
I completely disagree with your last statement.
Servers run C states ALL THE TIME.... and often are "only" 50% to 80% utilized.... and not 10% is no exception either...
Power impact of debufferbloating
Posted Sep 13, 2011 22:59 UTC (Tue) by jg (subscriber, #17537)
[Link]
It's currently multiple knobs to twist... Sometimes the drivers even lack the knob one might want to twist. Not good.
It isn't clear that the current network stack buffering architecture in Linux is what it needs to be (in fact, pretty clear it needs some serious surgery; and I'm not the person to say what it really needs to be).
Ultimately, we need AQM *everywhere* that "just works". Right now we don't have an algorithm that is known to fit this description.
And yes, I'm very aware that power management folds into this; I've worked on iPAQ's as you may or may not remember, from which other hand held Linux devices were at least inspired.
Lots of interesting work to do...
Power impact of debufferbloating
Posted Sep 13, 2011 21:40 UTC (Tue) by arjan (subscriber, #36785)
[Link]
very good points, and I totally agree with your statement.
Power impact of debufferbloating
Posted Sep 14, 2011 1:35 UTC (Wed) by Cyberax (✭ supporter ✭, #52523)
[Link]
Well, it's not hard to do better in your situation - just use proactive traffic shaping based on traffic class.
However, it would be nice to do it automatically.
Power impact of debufferbloating
Posted Sep 14, 2011 6:26 UTC (Wed) by njs (guest, #40338)
[Link]
You miss the point. Yes, in principle that is the solution. But, right now, it just doesn't work. I turned on traffic shaping, and it had no effect whatsoever. That's because traffic shaping only applies to packets that are in the qdisc buffer. Packets only end up in the qdisc buffer if the device buffer is full, but device buffers are so large that they never fill up unless you either have a very high bandwidth connection, or have made the buffers smaller by hand. After I hacked my kernel to reduce the device buffer sizes, traffic shaping started working, so that makes reducing buffer sizes a reasonable short-term solution, but in the long-term it produces other bad effects like Arjan points out.
Actually, there is one other way to make traffic shaping work -- if you throttle your outgoing bandwidth, then that throttling is applied in between the qdisc and the device buffers, so the device buffers stop filling up. Lots of docs on traffic shaping say that this is a good idea to work around problems with your ISP's queue management, but in fact it's needed just to work around problems within your own kernel's queue management. Also, you can't really do it for wifi since you don't know what the outgoing bandwidth is at any given time, and in the case of 802.11n, this will actually *reduce* that bandwidth because hiding packets from the device driver will screw up its ability to do aggregation.
Power impact of debufferbloating
Posted Sep 14, 2011 14:34 UTC (Wed) by Cyberax (✭ supporter ✭, #52523)
[Link]
Uhm. As far as I understand, qdisc sees all the packets (check the http://www.docum.org/docum.org/kptd/ ). Now, policies for ingress shaping are somewhat lacking in Linux, but egress shaping works perfectly fine.
In my own tests, I was able to segregate heavy BitTorrent traffic from VoIP traffic and get good results on a puny 2MBit line.
I did lost some bandwidth, but not that much.
Power impact of debufferbloating
Posted Sep 14, 2011 16:37 UTC (Wed) by njs (guest, #40338)
[Link]
Yes, qdisc every packet goes through the qdisc. But here's a typical sequence of events in the situation I'm talking about:
- packet 1 enters qdisc
- packet 1 leaves qdisc
- packet 2 enters qdisc
- packet 2 leaves qdisc
- ...
There isn't much the qdisc can do here, prioritization-wise -- even if packet 2 turns out to be high priority, it can't "take back" packet 1 and send packet 2 first. Packet 1 is already gone.
In your tests you used bandwidth throttling, which moved the chokepoint so that it fell in between the qdisc buffer and your device buffer. You told the qdisc to hold onto packets and only let them dribble out at a fixed rate, and chose that rate so that it would be slower than the rate the device buffer drained. So the device buffer never filled up, and the qdisc actually had multiple packets visible at once and was able to reorder them.
Power impact of debufferbloating
Posted Sep 29, 2011 0:02 UTC (Thu) by marcH (subscriber, #57642)
[Link]
Indeed: you cannot shape traffic unless/until you are the bottleneck. In other words, policing an empty queue does not much. This makes fixing bufferbloat (or QoS more generally speaking) even more difficult since bottlenecks come and go.
Power impact of debufferbloating
Posted Sep 14, 2011 5:17 UTC (Wed) by russell (subscriber, #10458)
[Link]
Sounds complicated. Perhaps and easier way to deal with it is to treat buffers downstream of qdisc as "external" to the system and just drop packets from those buffers if they sit there too long, just the same as if they were transmitted and someone else dropped them.
This would place an absolute upper bound on latency between the app and the wire.
Power impact of debufferbloating
Posted Sep 14, 2011 16:55 UTC (Wed) by njs (guest, #40338)
[Link]
That (plus intelligently setting up the buffers so that packets *don't* sit there too long) would be an improvement. Is it good enough? I dunno. Say we set a latency target of 10 ms. That means that sendfile()'s going to incur 100 wakeups/second, which is probably more than we'd like, but maybe acceptable (and maybe we'd need to wake up that often to deal with ACKs anyway). It's also not clear that that's an aggressive enough latency target. For a web server, that's already 10% of Amazon's "100 ms latency = 1% lost sales" guideline. For servers chatting with each other inside a datacenter, I just measured 1/4 of a ms as the average ping between two machines in our cluster, so call it an 80x increase in one-way latency. That seems like a lot, maybe?
And there are a lot of advantages to picking the *right* packets to drop -- if the packet you drop happens to be DNS, or interactive SSH, or part of a small web page, then you'll cause an immediate user-visible hiccup, and won't get any benefits in terms of reduced contention (like you would if you had dropped a packet from a long-running bulk TCP flow that then backs off). Then again, maybe that's okay, and re-ordering packets that have already been handed off to the driver does sound pretty tricky! (And might require hardware support.)
But it's useful to try and find the "right" solution first, because that way even if you give up on achieving it, at least in plan B you know what you're trying to approximate.
Hardware support
Posted Sep 14, 2011 17:19 UTC (Wed) by dmarti (subscriber, #11625)
[Link]
Yes, it would be a win to have hardware that can timestamp packets going into its buffers and drop "stale" ones on the way out instead of transmitting them. (relevant thread on timestamps from the bufferbloat list). Right now, hardware assumes that late is better than never, and TCP would prefer never over too late.
Hardware support
Posted Sep 15, 2011 7:58 UTC (Thu) by johill (subscriber, #25196)
[Link]
I'm pretty sure that's possible with a bunch of wireless devices, but I don't know how the timestamps are checked etc. off the top of my head.
Hardware support
Posted Sep 15, 2011 16:06 UTC (Thu) by dmarti (subscriber, #11625)
[Link]
That would be useful to see. It looks like the problem of bufferbloat is that packets stay in the buffer until they get stale -- so checking staleness directly, ideally without having to involve the CPU, could be a way to save having to tune the buffer size.
(If you run the café in Grand Central Station, you need to bake a bigger "buffer" of muffins than the low-traffic neighborhood place does. But customers at the two places should get the same muffin freshness as long as both have the same policy of dropping day-old muffins.)
Power impact of debufferbloating
Posted Sep 15, 2011 21:43 UTC (Thu) by BenHutchings (subscriber, #37955)
[Link]
So what we want is prioritization, but the only way we can get it is to reduce the device buffers as small as possible, to force packets back up into the qdisc.
Not necessarily. There may be hardware support for queues with differing priority. The mqprio qdisc takes advantage of this.
Power impact of debufferbloating
Posted Sep 14, 2011 5:01 UTC (Wed) by butlerm (subscriber, #13312)
[Link]
You have to understand that TCP was not designed to save power, quite the opposite. It was designed to provide a relatively smooth transmission rate without hardware support. To that end it uses ACK clocking where ideally no more than two packets are transmitted for every incoming ACK. That requires TCP level processing for every ACK, which amounts to considerable CPU overhead on a high bandwidth transfer.
If you want to save power there is really only one good way to go about it - hardware support for packet pacing, where the driver instructs the hardware: here is a series of ten packets, schedule for transmission at X microsecond intervals. Adapting that to TCP is kind of a trick, but the point is that the kernel can then go to sleep for a considerably longer time before it has to wake up again to do proper congestion control.
Unfortunately, as far as I know, there are no Ethernet chipsets out there with support for hardware packet pacing. It is a big problem, because the kernel can only respond so fast (and actually get any other work done) to incoming traffic on a high bandwidth interface, so what we get instead is ACK compression where a much longer series of packets gets queued for transmission in a short period of time, swamping network queues and increasing packet loss and jitter.
Power impact of debufferbloating
Posted Sep 15, 2011 21:53 UTC (Thu) by BenHutchings (subscriber, #37955)
[Link]
Solarflare Ethernet controllers support per-queue TX pacing. This was meant primarily as a means to throttle specific processes that are given their own queues, though so far I don't think we've made any use of it. You probably don't put 10G network controllers in a real low-power system though!
In the absence of pacing, it might be more useful to vary TX interrupt moderation (if there are separate interrupts for TX and RX).
Power impact of debufferbloating
Posted Sep 14, 2011 17:39 UTC (Wed) by rgmoore (✭ supporter ✭, #75)
[Link]
Maybe the correct solution to this problem is to have a way of setting the buffer size dynamically. Before you go to sleep, you grow the buffer size because going to sleep is a big flashing sign that latency is not a big worry right now. Shrinking the buffer again when you wake up is likely to be a bigger problem, or at least shrinking it without just throwing away the extra packets you collected in the bigger buffer while you were asleep, but there ought to be some way of doing it. Coming up with good heuristics about how big it ought to be depending on your circumstances (beyond just sleep vs. awake) is probably the hardest part.
Power impact of debufferbloating
Posted Sep 28, 2011 23:21 UTC (Wed) by marcH (subscriber, #57642)
[Link]
> Bigger (send) buffers are essential to achieve good power management
I do not think power management changes the picture. Buffers are required to reduce the frequency of context switchs in order to increase performance. Power management is just a not really special case where the "idle task" is taken into consideration.
The ACK-clocked and time-sensitive nature of TCP is totally at odds with the goal above. While TCP attempts to smooth throughput as much as possible, the above is about increasing burstiness in order to increase performance. Go figure.
This is the good same old "throughput versus latency" trade-off here again. Think "asynchronous, non-blocking, caches, buffers, bursts" versus... not.