NAPI polling in kernel threads
Once a packet arrives on a network interface, the kernel must usually perform a fair amount of protocol-processing work before the data in that packet can be delivered to the user-space application that is waiting for it. Once upon a time, the network interface would interrupt the CPU when a packet arrived; the kernel would acknowledge the interrupt, then trigger a software interrupt to perform this processing work. The problem with this approach is that, on busy systems, thousands of packets can arrive every second; handling the corresponding thousands of hardware interrupts can run the system into the ground.
The solution to this problem was merged in 2003 in the form of a mechanism that was called, at the time, "new API" or "NAPI". Drivers that support NAPI can disable the packet-reception interrupt most of the time and rely on the network stack to poll for new packets at a frequent interval. Polling may seem inefficient, but on busy systems there will always be new packets by the time the kernel polls for them; the driver can then process all of the waiting packets at once. In this way, one poll can replace dozens of hardware interrupts.
NAPI has evolved considerably since 2003, but one aspect remains the same: it still runs in software-interrupt mode. These interrupts, once queued by the kernel, will be processed at either the next return from a hardware interrupt or the next return from kernel to user mode. They thus run in an essentially random context, stealing time from whatever unrelated process happens to be running at the time. Software interrupts are hard for system administrators to manage and can create surprising latencies if they run for a long time. For this reason, kernel developers have wanted to reduce or eliminate their use for years; they are an old mechanism that is deeply wired into core parts of the kernel, though, and are hard to get rid of.
Wang's patch set (which contains work from Paolo Abeni, Felix Fietkau, and Jakub Kicinski) doesn't eliminate software interrupts, but it is a possible step in that direction. With these patches applied, the kernel can optionally (under administrator control) create a separate kernel thread for each NAPI-enabled network interface. After that, NAPI polling will be done in the context of that thread, rather than in a software interrupt.
The amount of work that needs to be done is essentially unchanged with this patch set, but the change in the way that work is done is significant. Once NAPI polling moves to its own kernel thread, it becomes much more visible and subject to administrator control. A kernel thread can have its priority changed, and it can be bound to a specific set of CPUs; that allows the administrator to adjust how that work is done in relation to the system's user-space workload. Meanwhile, the CPU scheduler will have a better understanding of how much CPU time NAPI polling requires and can avoid overloading the CPUs where it is running. Time spent handling software interrupts, instead, is nearly invisible to the scheduler.
There aren't a lot of benchmark results posted with the patch set; those that are available indicate a possible slight increase in overhead when the threaded mode is used. Users who process packets at high rates tend to fret over every nanosecond, but even they might find little to quibble about if these results hold. Meanwhile, those users should also see more deterministic scheduling for their user-space code, which is also important.
The networking developers seem to be generally in favor of this work; Eric
Dumazet indicated
a desire to merge it quickly. This feeling is not unanimous, though;
Kicinski, in particular, dislikes
the kernel-thread implementation. He believes that better performance
can be had by using the kernel's workqueue mechanism for the polling rather
than threads. Dumazet answered
that workqueues would not perform well on "server class
platforms
" and indicated a lack of desire to wait to see a new
workqueue-based implementation at some point in the future.
So it appears that this work will be merged soon; it's late for 5.10, so
landing in the 5.11 kernel seems likely. It's worth noting that the
threaded mode will remain off by default. Making the best use of it will
almost certainly require system tuning to ensure that the NAPI threads are
able to run without interfering with the workload; for now, administrators
who are unwilling or unable to do that tuning are probably well advised to
stick with the default, software-interrupt mode. Software interrupts
themselves are still not going away anytime soon, but this work may help in
the long-term project of moving away from them.
| Index entries for this article | |
|---|---|
| Kernel | NAPI |
| Kernel | Networking/NAPI |
Posted Oct 9, 2020 20:54 UTC (Fri)
by darwi (subscriber, #131202)
[Link] (6 responses)
IMHO, this would also be very helpful in making !RT and RT kernels closer...
PREEMPT_RT already runs softirqs in their own kernel threads, so that they can be prioritized and not affect random victims and real-time threads. Maybe, soon, all mainline kernels will be like RT in that regard ;-)
Posted Oct 10, 2020 4:30 UTC (Sat)
by alison (subscriber, #63752)
[Link] (3 responses)
Indeed, it makes one wonder if a new implementation is needed.
I suppose we'll call this solution the NNAPI.
Posted Oct 11, 2020 17:46 UTC (Sun)
by darwi (subscriber, #131202)
[Link] (2 responses)
Softirqs are used still used in a big number of places beyond networking. See the full list, enum *_SOFTIRQ, at include/linux/interrupt.h
Posted Oct 11, 2020 22:16 UTC (Sun)
by alison (subscriber, #63752)
[Link] (1 responses)
I am aware, but NAPI is used only in networking AFAIK. Thanks for saying "softirqs" rather than "software interrupts": ugh!
Posted Oct 12, 2020 3:40 UTC (Mon)
by darwi (subscriber, #131202)
[Link]
Yes of course. My point was that RT runs almost all softirqs at kthread/task context, not just NAPI. Thus, RT handles the generic case (almost all softirqs), while the patch set mentioned in the article only handles one of its special cases (NAPI).
Posted Oct 12, 2020 20:01 UTC (Mon)
by nevets (subscriber, #11875)
[Link] (1 responses)
https://blog.linuxplumbersconf.org/ocw/proposals/53
I may even be able to find my slides somewhere. There was a lot of skepticism about this approach (even from Eric Dumazet), but like threaded interrupts in general, I was confident that this would sooner or later be something that non RT folks would want.
Posted Oct 12, 2020 20:08 UTC (Mon)
by nevets (subscriber, #11875)
[Link]
Posted Oct 10, 2020 8:40 UTC (Sat)
by ncm (guest, #165)
[Link] (4 responses)
The user program just needs to poll for updates to this index, and then finish all its work on the packet before it gets overwritten, as little as a few ms later. That work might be just to copy the packet to a bigger ring buffer for other processes to look at under more-relaxed time constraints.
The kernel driver watches its own mapping of such a ring buffer, and copies out packets that processes have expressed interest in to regular buffers to be delivered, or to be processed according to TCP protocol, e.g. to acknowledge, or to run them past the firewall first.
Posted Oct 13, 2020 3:15 UTC (Tue)
by marcH (subscriber, #57642)
[Link] (3 responses)
Any reason not to mention any specific example(s)?
Posted Oct 13, 2020 20:59 UTC (Tue)
by wkudla (subscriber, #116550)
[Link] (2 responses)
I can't wait to get rid of softirqs from my critical CPUs. Those and tasklets are a nightmare when you're trying to reduce platform jitter to the minimum.
Posted Oct 14, 2020 22:38 UTC (Wed)
by ncm (guest, #165)
[Link] (1 responses)
Each has its own idiosyncratic filtering configuration and ring buffer layout. ExaNIC is unusual in delivering packets 120 bytes at a time, enabling partial processing while the rest of the packet is still coming in.
There are various accommodations to use in VMs, which I have not experimented with.
Keeping the kernel's greedy fingers off of my cores is one of the harder parts of the job. It means lots of custom boot parameter incantations, making deployment to somebody else's equipment a chore. It would be much, much better if the process could simply tell the kernel, "I will not be doing any more system calls, please leave my core completely alone from this point", and have that stick. Such a process does all its subsequent work entirely via mapped memory.
Posted Nov 2, 2020 17:00 UTC (Mon)
by immibis (subscriber, #105511)
[Link]
Posted Oct 10, 2020 15:41 UTC (Sat)
by tpo (subscriber, #25713)
[Link]
Posted Oct 11, 2020 12:29 UTC (Sun)
by itsmycpu (guest, #139639)
[Link]
Posted Oct 24, 2020 1:39 UTC (Sat)
by amworsley (subscriber, #82049)
[Link]
NAPI polling in kernel threads
NAPI polling in kernel threads
NAPI polling in kernel threads
NAPI polling in kernel threads
NAPI polling in kernel threads
NAPI polling in kernel threads
NAPI polling in kernel threads
NAPI polling in kernel threads
NAPI polling in kernel threads
NAPI polling in kernel threads
It's extremely popular in fintech and other latency sensitive fields.
NAPI polling in kernel threads
NAPI polling in kernel threads
NAPI polling in kernel threads
NAPI polling in kernel threads
NAPI polling in kernel threads
