|
|
Log in / Subscribe / Register

Making WiFi fast

Making WiFi fast

Posted Nov 9, 2016 14:13 UTC (Wed) by sourcejedi (guest, #45153)
Parent article: Making WiFi fast

> there is no need for intelligence in the network hardware

and the wheel turns again. (It's making me think of one or two great posts about this point in general, cpu v.s. offloads, which I can't find rn).

> It is meant to to hold no more data than can be sent in two "transmission opportunities" (slots in which an aggregate of packets can be transmitted). The fq_codel queue management algorithm was generalized to work well in this setting.

> The goal is to have one aggregated frame in the hardware for transmission, and another one queued, ready to go as soon as the hardware gets to it. Only having two packets queued at this layer may not scale to the very highest data rates, he said, but, in the real world, nobody ever sees those rates anyway.

> There should be a single aggregate under preparation in the mac80211 layer; all other packets should be managed in the (short) per-station queues.

This doesn't seem quite clear.

What controls the length of the per-station queues? You could read this as saying the per-station queues are limited to an aggregate's worth overall, but I'm not sure that's right.

I assume fq_codel is being applied to the per-station queues, that's the only way I can understand, but this doesn't read like that to me.

Ah, merged code says codel applies a (currently hardcoded) 20ms target. So I assume that's what sizes the per-station queues.

Maybe "The fq_codel queue management algorithm was generalized to work well in this setting." would be better put after "all other packets should be managed in the (short) per-station queues".

(also the next step in this effort is to go from round-robin of the station queues, to airtime fairness. yet more awesome)


to post comments

Making WiFi fast

Posted Nov 9, 2016 18:04 UTC (Wed) by zlynx (guest, #2285) [Link] (2 responses)

I think you'd want to test quite a bit before trying for airtime fairness. I think the interaction of codel with the two frame limit will result in fairness even with round-robin. Because the slow devices will be getting a much more limited packet queue in software, because they take longer to transmit, the faster devices will have more packets in queue.

I am *guessing* that devices will get airtime fairness "for free."

Making WiFi fast

Posted Nov 10, 2016 2:46 UTC (Thu) by mtaht (guest, #11087) [Link] (1 responses)

while we are working on various modifications to fq_codel to make it more robust across a wide range of rates and numbers of stations, the airtime fairness patches we have currently do indeed behave better than what we call the fq-mac version without an explicit modification to codel.

I certainly welcome more testers (see the links to patches I posted earlier), ideas, and data. In addition to the data on the slides there, we have a large paper on the airtime fairness stuff pending academic review, which I can provide privately if you would like to see it.

Making WiFi fast

Posted Sep 6, 2019 21:58 UTC (Fri) by mtaht (guest, #11087) [Link]

That paper was ultimately published as "Ending the anomaly" - the capstone to solving a 16+ year old problem in wifi that nobody, until us - had figured out how to solve.

https://www.usenix.org/system/files/conference/atc17/atc1...

Making WiFi fast

Posted Nov 10, 2016 0:18 UTC (Thu) by rgmoore (✭ supporter ✭, #75) [Link] (5 responses)

> there is no need for intelligence in the network hardware

and the wheel turns again. (It's making me think of one or two great posts about this point in general, cpu v.s. offloads, which I can't find rn).

It sounds, though, as if this is almost the opposite of the traditional wheel. The traditional wheel is driven by increased complexity requiring an offload processor to take load off the CPU, followed by bringing that same level of complexity back into the CPU to save money when processing power gets cheaper. In this case, though, the process reduces complexity to the point the offload processor is redundant.

Making WiFi fast

Posted Nov 10, 2016 3:15 UTC (Thu) by drag (guest, #31333) [Link] (4 responses)

The trend has always been towards sucking as much functionality out of the computer and into the processor die as possible and out of hardware logic and into the software as much as possible.

It's a cost performance thing.. as in cost and performance and reliability improves the dumber the hardware gets and the faster the cpu gets. This is generally speaking, of course. The deal here is software is much more flexible, much cheaper, and is much easier to patch to correct bugs.

That's one of the really big take-home points about Moore's law.

It's true for everything in computing, not just networking. Phone modems had their guts ripped out and became winmodems. I hated winmodems until I learned how to chance the software drivers Linux to get different algorithms and bump up my connection speeds. Then it moved to sound cards and into network and into harddrive controllers, and now things like software raid is superior for most purposes over hardware raid. Even now there isn't really any such thing as '3D acceleration' anymore, instead you just have different types of processor cores that are optimized to graphics workloads with most of the logic in the 'drivers'.

The problem with networking is that we deal with such small MTU sizes that _sometimes_ you can get better performance by offloading some of it. But for most server purposes turning off all the 'offload features' on network cards isn't a bad idea.

Making WiFi fast

Posted Nov 10, 2016 8:38 UTC (Thu) by Sesse (subscriber, #53779) [Link] (3 responses)

While I agree with most of your point, there really is something as 3D acceleration. Even the most modern of GPUs will have a triangle rasterizer, a texture mapper and a framebuffer blend unit, all of them large fixed-function blocks (well, instantiated lots of times). This is _not_ done by the more CPU-like units (the shader cores), even though they certainly are flexible these days.

You can imagine moving all of these functions up into software, but it doesn't seem to work all that well in practice (witness e.g. Larrabee).

Fixed-function hardware

Posted Nov 10, 2016 10:04 UTC (Thu) by farnz (subscriber, #17727) [Link] (2 responses)

It's worth noting that the original Larrabee design was intended to be a pure software system on a massively parallel chip; by the time they got as far as cancelling Larrabee the GPU in favour of Knights Ferry, they'd had to add traditional fixed-function samplers to ensure that the GPU design would be competitive. Similar applies to CPUs - in some senses, the Cell Broadband Engine SPUs are what you get if you replace a fixed-function L2 cache controller with a software-controlled L2 cache, while weak memory models are what you get if you make software responsible for cache coherency only.

In general, it looks like there's a (movable) happy medium between hardware and software; where the hardware's function is well-understood, and unlikely to change in the next decade (texture samplers, cache controllers, Ethernet checksum handling etc), then it's best as fixed-function hardware. Where there's still debate about what the function should be (not just how fast you can make it), then it's best as programmable hardware (TCP offloads, graphics shaders etc) under the control of software.

Fixed-function hardware

Posted Nov 16, 2016 19:47 UTC (Wed) by mtaht (guest, #11087) [Link] (1 responses)

The core need for offloaded into the hardware firmware is that wifi has the need to do certain things under very hard realtime constraints that the Linux kernel cannot meet. In other words, it's latency, once again, driving the need for intelligence "down there". From a signal processing perspective, we care about nanoseconds - and there are like 400+ DSPs on a modern 802.11ac chip. Up from there, in the core wifi standards are need for sub 10us response times for many operations.

What we showed was that at the higher levels of the wifi stack - at the txop level - linux is more than responsive enough to fare well at the 500+us latency range, and we can put a lot more intelligence there, that can make a huge difference in actual network behavior.

I have outlined on my blog multiple ways for even smarter firmware can do even better than we do today shifting more stuff back into the core processor, instead onto the onboard firmware

As well as multiple ways to do more smart things in the core linux networking layer, building on top of this work. If made more universal, we can also make a dent in several other nagging problems in wifi, like better routing metrics.

Many of the trials, travails, missteps, and other bugs we've had
to fix along the way are in my blog and/or discussed on the make-wifi-fast list.

http://blog.cerowrt.org/post/

There is so much more that can be done to improve wifi! The best document we have on all that, is here:

https://docs.google.com/document/d/1Se36svYE1Uzpppe1HWnEy...

Fixed-function hardware

Posted Nov 16, 2016 20:09 UTC (Wed) by mtaht (guest, #11087) [Link]

As a counter-example of how better onboard firmware could cut observed latencies down below what we can achieve by moving more ops into the kernel, see:

http://blog.cerowrt.org/post/a_look_back_at_cerowrt_wifi/

Some chipsets already expose a per-station concept, in particular.

Making WiFi fast

Posted Nov 10, 2016 2:53 UTC (Thu) by mtaht (guest, #11087) [Link] (1 responses)

I'd sent a few nits regarding this section of the article to jon earlier, as the description is unclear. Let me get to that in another post.

If I said "there is no need for intelligence in the network hardware" I did not mean that. There is plenty! of room for more intelligence there, just no need (for non-mu-mimo) for more than 2 txops of queuing in the onboard memory, or firmware, or driver.

It is kind of my hope actually that by reducing the max queuing in the wifi chip that we can fit more code into the firmware, like keeping better statistics or putting in better rate control information, or doing saner things with interrupts, or presenting a better API, etc.

Making WiFi fast

Posted Nov 10, 2016 8:58 UTC (Thu) by sourcejedi (guest, #45153) [Link]

I could have quoted more

> The goal is to have one aggregated frame in the hardware for transmission

- that's all I was really thinking about. If you tell this to original hardware designers, one would expect them to be somewhat surprised. (Or maybe I'm wrong: they'd tell you how ill-suited the network stacks they targetted were for wireless, and it's about time).

Making WiFi fast

Posted Nov 10, 2016 3:11 UTC (Thu) by mtaht (guest, #11087) [Link]

The current 20ms target in the mainline merged wifi-fq_codel code is an artifact of a number of other performance problems we were having at the time - notably powersave was broken by the patch set for a long while.

So... we backed off from the more aggressive 5ms default. Most of our recent testing has been against the original 5ms target with pretty good results. We also reverted back to the quantum 1514 default, rather than 300, as that hurt us on cpu on small platforms.

So, after more stuff lands it is my hope that we will revert these two changes back to the theoretically more correct 5% of 100ms that the target represents. Also, codel's taking place 2-10ms behind the actual packet delivery, presently.

In the talk I suggested shrinking txops more explicitly under contention. There is also the idea of turning codel's drop scheduler off at either a "good sized" aggregate, or an aggregate close to the ideal size for a given station, to move the knee of the curve closer to full utilization, where currently it turns off at a single large packet outstanding. We've also discussed dynamically modifying the target via ewma based on the workload and other common delays in the system (contention and interference).

Testing this stuff is HARD! Nobody's ever applied AQM technology to wifi or aggregating macs before, so far as I know. We will never get a perfect result, the goal is merely to get one that is reasonably good across most rates, and across most common numbers of stations.

I struggle to rewrite the description - one thing that is not obvious is that the fq_codel implementation for wifi is one very large set of queues for the entire device, with the per-station pointer within it disambiguating things. This was michal kazior's innovation - prior to that I'd been stuck on the idea of a full fq_codel instance created per station, with perhaps 64 queues each (and possibly derived from cake's set associative version of fq_codel). Now there's tons of queues and if there is a station collision on a given queue, it gets sorted out. Much better than what I'd had in mind!

It was my first talk on the work, mae culpa!

Still working on rephrasing the troublesome bit in the article, give me a few hours.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds