NAPI polling in kernel threads

By Jonathan Corbet
October 9, 2020

Systems that manage large amounts of network traffic end up dedicating a significant part of their available CPU time to the network stack itself. Much of this work is done in software-interrupt context, which can be problematic in a number of ways. That may be about to change, though, once this patch series posted by Wei Wang is merged into the mainline.

Once a packet arrives on a network interface, the kernel must usually perform a fair amount of protocol-processing work before the data in that packet can be delivered to the user-space application that is waiting for it. Once upon a time, the network interface would interrupt the CPU when a packet arrived; the kernel would acknowledge the interrupt, then trigger a software interrupt to perform this processing work. The problem with this approach is that, on busy systems, thousands of packets can arrive every second; handling the corresponding thousands of hardware interrupts can run the system into the ground.

The solution to this problem was merged in 2003 in the form of a mechanism that was called, at the time, "new API" or "NAPI". Drivers that support NAPI can disable the packet-reception interrupt most of the time and rely on the network stack to poll for new packets at a frequent interval. Polling may seem inefficient, but on busy systems there will always be new packets by the time the kernel polls for them; the driver can then process all of the waiting packets at once. In this way, one poll can replace dozens of hardware interrupts.

NAPI has evolved considerably since 2003, but one aspect remains the same: it still runs in software-interrupt mode. These interrupts, once queued by the kernel, will be processed at either the next return from a hardware interrupt or the next return from kernel to user mode. They thus run in an essentially random context, stealing time from whatever unrelated process happens to be running at the time. Software interrupts are hard for system administrators to manage and can create surprising latencies if they run for a long time. For this reason, kernel developers have wanted to reduce or eliminate their use for years; they are an old mechanism that is deeply wired into core parts of the kernel, though, and are hard to get rid of.

Wang's patch set (which contains work from Paolo Abeni, Felix Fietkau, and Jakub Kicinski) doesn't eliminate software interrupts, but it is a possible step in that direction. With these patches applied, the kernel can optionally (under administrator control) create a separate kernel thread for each NAPI-enabled network interface. After that, NAPI polling will be done in the context of that thread, rather than in a software interrupt.

The amount of work that needs to be done is essentially unchanged with this patch set, but the change in the way that work is done is significant. Once NAPI polling moves to its own kernel thread, it becomes much more visible and subject to administrator control. A kernel thread can have its priority changed, and it can be bound to a specific set of CPUs; that allows the administrator to adjust how that work is done in relation to the system's user-space workload. Meanwhile, the CPU scheduler will have a better understanding of how much CPU time NAPI polling requires and can avoid overloading the CPUs where it is running. Time spent handling software interrupts, instead, is nearly invisible to the scheduler.

There aren't a lot of benchmark results posted with the patch set; those that are available indicate a possible slight increase in overhead when the threaded mode is used. Users who process packets at high rates tend to fret over every nanosecond, but even they might find little to quibble about if these results hold. Meanwhile, those users should also see more deterministic scheduling for their user-space code, which is also important.

The networking developers seem to be generally in favor of this work; Eric Dumazet indicated a desire to merge it quickly. This feeling is not unanimous, though; Kicinski, in particular, dislikes the kernel-thread implementation. He believes that better performance can be had by using the kernel's workqueue mechanism for the polling rather than threads. Dumazet answered that workqueues would not perform well on "server class platforms" and indicated a lack of desire to wait to see a new workqueue-based implementation at some point in the future.

So it appears that this work will be merged soon; it's late for 5.10, so landing in the 5.11 kernel seems likely. It's worth noting that the threaded mode will remain off by default. Making the best use of it will almost certainly require system tuning to ensure that the NAPI threads are able to run without interfering with the workload; for now, administrators who are unwilling or unable to do that tuning are probably well advised to stick with the default, software-interrupt mode. Software interrupts themselves are still not going away anytime soon, but this work may help in the long-term project of moving away from them.

Index entries for this article
Kernel	NAPI
Kernel	Networking/NAPI

NAPI polling in kernel threads

Posted Oct 9, 2020 20:54 UTC (Fri) by darwi (subscriber, #131202) [Link] (6 responses)

> Once NAPI polling moves to its own kernel thread, it becomes much more visible and subject to administrator control. A kernel thread can have its priority changed, and it can be bound to a specific set of CPUs; that allows the administrator to adjust how that work is done in relation to the system's user-space workload. Meanwhile, the CPU scheduler will have a better understanding of how much CPU time NAPI polling requires... Time spent handling software interrupts, instead, is nearly invisible to the scheduler.

IMHO, this would also be very helpful in making !RT and RT kernels closer...

PREEMPT_RT already runs softirqs in their own kernel threads, so that they can be prioritized and not affect random victims and real-time threads. Maybe, soon, all mainline kernels will be like RT in that regard ;-)

NAPI polling in kernel threads

Posted Oct 10, 2020 4:30 UTC (Sat) by alison (subscriber, #63752) [Link] (3 responses)

> this would also be very helpful in making !RT and RT kernels closer...

Indeed, it makes one wonder if a new implementation is needed.

I suppose we'll call this solution the NNAPI.

NAPI polling in kernel threads

Posted Oct 11, 2020 17:46 UTC (Sun) by darwi (subscriber, #131202) [Link] (2 responses)

> Indeed, it makes one wonder if a new implementation is needed.

Softirqs are used still used in a big number of places beyond networking. See the full list, enum *_SOFTIRQ, at include/linux/interrupt.h

NAPI polling in kernel threads

Posted Oct 11, 2020 22:16 UTC (Sun) by alison (subscriber, #63752) [Link] (1 responses)

> Softirqs are used still used in a big number of places beyond networking.

I am aware, but NAPI is used only in networking AFAIK. Thanks for saying "softirqs" rather than "software interrupts": ugh!

NAPI polling in kernel threads

Posted Oct 12, 2020 3:40 UTC (Mon) by darwi (subscriber, #131202) [Link]

> I am aware, but NAPI is used only in networking AFAIK.

Yes of course. My point was that RT runs almost all softirqs at kthread/task context, not just NAPI. Thus, RT handles the generic case (almost all softirqs), while the patch set mentioned in the article only handles one of its special cases (NAPI).

NAPI polling in kernel threads

Posted Oct 12, 2020 20:01 UTC (Mon) by nevets (subscriber, #11875) [Link] (1 responses)

Exactly! I proposed this work back in 2009 at Linux Plumbers. My idea was to call it "ENAPI" for "Even-Newer API".

https://blog.linuxplumbersconf.org/ocw/proposals/53

I may even be able to find my slides somewhere. There was a lot of skepticism about this approach (even from Eric Dumazet), but like threaded interrupts in general, I was confident that this would sooner or later be something that non RT folks would want.

NAPI polling in kernel threads

Posted Oct 12, 2020 20:08 UTC (Mon) by nevets (subscriber, #11875) [Link]

My slides are there: https://blog.linuxplumbersconf.org/2009/slides/Steven-Ros...

NAPI polling in kernel threads

Posted Oct 10, 2020 8:40 UTC (Sat) by ncm (guest, #165) [Link] (4 responses)

Those of us who care most about performance and minimizing overhead are using "kernel-bypass" libraries with NICs that dump incoming packets into a ring buffer mapped into user-space memory. The kernel driver for such a NIC sets up filter criteria programmed into registers in an ASIC or FPGA on the NIC, and then leaves it to run freely DMAing incoming packets sequentially into the ring buffer interspersed with annotations like length, timestamp, and checksum, and updates an atomic shared index/pointer when the packet is ready.

The user program just needs to poll for updates to this index, and then finish all its work on the packet before it gets overwritten, as little as a few ms later. That work might be just to copy the packet to a bigger ring buffer for other processes to look at under more-relaxed time constraints.

The kernel driver watches its own mapping of such a ring buffer, and copies out packets that processes have expressed interest in to regular buffers to be delivered, or to be processed according to TCP protocol, e.g. to acknowledge, or to run them past the firewall first.

NAPI polling in kernel threads

Posted Oct 13, 2020 3:15 UTC (Tue) by marcH (subscriber, #57642) [Link] (3 responses)

You mean like DPDK?

Any reason not to mention any specific example(s)?

NAPI polling in kernel threads

Posted Oct 13, 2020 20:59 UTC (Tue) by wkudla (subscriber, #116550) [Link] (2 responses)

I don't think they are referring to DPDK. It's rather about solutions such as SolarFlare NICs and kernel bypass with OpenOnload.
It's extremely popular in fintech and other latency sensitive fields.

I can't wait to get rid of softirqs from my critical CPUs. Those and tasklets are a nightmare when you're trying to reduce platform jitter to the minimum.

NAPI polling in kernel threads

Posted Oct 14, 2020 22:38 UTC (Wed) by ncm (guest, #165) [Link] (1 responses)

Agreed, I don't have any direct experience with DPDK. I have used Onload/Ef_vi for Solarflare hardware (sold by Xilinx, maybe soon AMD), libexanic for Exablaze hardware (sold by Cisco now), and Napatech. I have studied Netronome, which enables running eBPF on the packet before it hits host memory, that can drop the packet at that stage.

Each has its own idiosyncratic filtering configuration and ring buffer layout. ExaNIC is unusual in delivering packets 120 bytes at a time, enabling partial processing while the rest of the packet is still coming in.

There are various accommodations to use in VMs, which I have not experimented with.

Keeping the kernel's greedy fingers off of my cores is one of the harder parts of the job. It means lots of custom boot parameter incantations, making deployment to somebody else's equipment a chore. It would be much, much better if the process could simply tell the kernel, "I will not be doing any more system calls, please leave my core completely alone from this point", and have that stick. Such a process does all its subsequent work entirely via mapped memory.

NAPI polling in kernel threads

Posted Nov 2, 2020 17:00 UTC (Mon) by immibis (subscriber, #105511) [Link]

That sounds exactly like NOHZ_FULL

NAPI polling in kernel threads

Posted Oct 10, 2020 15:41 UTC (Sat) by tpo (subscriber, #25713) [Link]

Wow, excellent article about fundamental mechanisms and concepts and how they are evolving. So much appreciated <3 !

NAPI polling in kernel threads

Posted Oct 11, 2020 12:29 UTC (Sun) by itsmycpu (guest, #139639) [Link]

It sounds like it can be toggled through sysfs. So much better than having to (re)compile the kernel. :)

NAPI polling in kernel threads

Posted Oct 24, 2020 1:39 UTC (Sat) by amworsley (subscriber, #82049) [Link]

About time. Having packets processed in high priority software IRQs is a gift to those who want to carry out denial of service attacks against CPU limited embedded processors.