LWN.net Logo

NAPI performance - a weighty matter

Modern network interfaces are easily capable of handling thousands of packets per second. They are also capable of burying the host processor under thousands of interrupts per second. As a way of dealing with the interrupt problem (and fixing some other things as well), the networking hackers added the NAPI driver interface. NAPI-capable drivers can, when traffic gets high, turn off receive interrupts and collect incoming packets in a polling mode. Polling is normally considered to be bad news, but, when there is always data waiting on the interface, it turns out to be the more efficient way to go. Some details on NAPI can be found in this LWN Driver Porting Series article; rather more details are available from the networking chapter in LDD3.

One of the things NAPI-compliant drivers must do is to specify the "weight" of each interface. The weight parameter helps to determine how important traffic from that interface is - it limits the number of packets each interface can feed to the networking core in each polling cycle. This parameter also controls whether the interface runs in the polling mode or not; by the NAPI conventions, an interface which does not have enough built-up traffic to fill its quota of packets (where the quota is determined by the interface's weight) should go back to the interrupt-driven mode. The weight is thus a fundamental parameter controlling how packet reception is handled, but there has never been any real guidance from the networking crew on how the weight should be set. Most driver writers pick a value between 16 and 64, with interfaces capable of higher speeds usually setting larger values.

Some recent discussions on the netdev list have raised the issue of how the weight of an interface should be set. In particular, the e1000 driver hackers have discovered that their interface tends to perform better when its weight is set lower - with the optimal value being around 10. Investigations into this behavior continue, but a few observations have come out; they give a view into what is really required to get top performance out of modern hardware.

One problem, which appears to be specific to the e1000, is that the interface runs out of receive buffers. The e1000 driver, in its poll() function, will deliver its quota of packets to the networking core; only when that process is complete does the driver concern itself with providing more receive buffers to the interface. So one short-term tactic would be to replenish the receive buffers more often. Other interface drivers tend not to wait until an entire quota has been processed to perform this replenishment. Lowering the weight of an interface is one way to force this replenishment to happen more often without actually changing the driver's logic.

But questions remain: why is the system taking so long to process 64 packets that a 256-packet ring is being exhausted? And why does performance increase for smaller weights even when packets are not being dropped? One possible explanation is that the actual amount of work being done for each packet in the networking core can vary greatly depending on the type of traffic being handled. Big TCP streams, in particular, take longer to process than bursts of small UDP packets. So, depending on the workload, processing one quota's worth of packets might take quite some time.

This processing time affects performance in a number of ways. If the system spends large bursts of time in software interrupt mode to deal with incoming packets, it will be starving the actual application for processor time. The overall latency of the system goes up, and performance goes down. Smaller weights can lead to better interleaving of system and application time.

A related issue is this check in the networking core's polling logic:

	if (budget <= 0 || jiffies - start_time > 1)
		goto softnet_break;

Essentially, if the networking core spends more than about one half of one jiffy (very approximately 500 μsec on most systems) polling interfaces, it decides that things have gone on for long enough and it's time to take a break. If one high-weight interface is taking a lot of time to get its packets through the system, the packet reception process can be cut short early, perhaps before other interfaces have had their opportunity to deal with their traffic. Once again, smaller weights can help to mitigate this problem.

Finally, an overly large weight can work against the performance of an interface when traffic is at moderate levels. If the driver does not fill its entire quota in one polling cycle, it will turn off polling and go back into interrupt-driven mode. So a steady stream of traffic which does not quite fill the quota will cause the driver to bounce between the polling and interrupt modes, and the processor will have to handle far more interrupts that would otherwise be expected. Slower interfaces (100 Mb/sec and below) are particularly vulnerable to this problem; on a fast system, such interfaces simply cannot receive enough data to fill the quota every time.

From all this information, some conclusions have emerged:

  • There needs to be a smarter way of setting each interface's weight; the current "grab the setting from some other driver" approach does not always yield the right results.

  • The direct tie between an interface's weight and its packet quota is too simple. Each interface's quota should actually be determined, at run time, by the amount of work that interface's packet stream is creating.

  • The quota value should not also be the threshold at which drivers return to interrupt-driven mode. The cost of processor interrupts is high enough that polling mode should be used as long as traffic exists, even when an interface almost never fills its quota.

Changing the code to implement these conclusions is likely to be a long process. Fundamental tweaks in the core of the networking code can lead to strange performance regressions in surprising places. In the mean time, Stephen Hemminger has posted a patch which creates a sysfs knob for the interface weight. That patch has been merged for 2.6.12, so people working on networking performance problems will soon be able to see if adjustable interface weights can be part of the solution.


(Log in to post comments)

NAPI performance - a weighty matter

Posted Jun 16, 2005 9:13 UTC (Thu) by Duncan (guest, #6647) [Link]

> The quota value should not also be the threshold
> at which drivers return to interrupt-driven mode.

As soon as I saw that initially described earlier in the article, I /knew/
it was going to cause issues. I'm not a C programmer, yet even /I/ can
see that. Virtually /everything/ from common HVAC thermostats, to
filesystem cache mechanisms, to, one would /expect/, network interface
quotas, works with a high value that triggers the switch into "high
activity" mode, after which it STAYS in the new mode until the sampled
measure falls below the low value (which is somewhere below the high
value), at which point the system switches back to "low activity" mode.
Failure to have such a lo/hi range by using a single threshold value
ALWAYS triggers inefficient bouncing, for /some/ range of possible values,
or the threshold value wouldn't be needed at all as a single mode would be
all that was required. Simple logic, demonstrated in practice over and
over again, in both the real and virtual world.

<shrug> I guess this is demonstration of the open source adage that all
bugs (and their solutions) are shallow to /somebody/, and an open source
solution maximizes the possibility that said "somebody" will be exposed to
the issue.

OTOH, it's likely only clear to me due to the excellent explanation here
in LWN, which of course the original implementors didn't have the
privilege of seeing, since they were writing the code upon which the
explanation was based. <g>

Duncan

NAPI performance - a weighty matter

Posted Jun 17, 2005 12:55 UTC (Fri) by broonie (subscriber, #7078) [Link]

Of course, if a placeholder algorithm works well enough not to notice problems it's likely to stay for a while...

NAPI performance - a weighty matter

Posted Jun 23, 2005 13:12 UTC (Thu) by jonsmirl (guest, #7874) [Link]

The dual threshold triggering you describe is called hysteresis.

NAPI performance - a weighty matter

Posted Jun 17, 2005 0:38 UTC (Fri) by hadi (guest, #13196) [Link]

I just read the conclusion and although ive never posted here before, I think it is misleading, so let me set the record straight - because the entire explanation of the behavior has nothing to do with NAPI ;->

Maybe an anology would help:
Lets say you had a restaurant that could seat 64 people (weight). And lets say you also had a queue outside the restaurant that could accomodate up to 256(rx ring size) people and any person arriving when the queue was full was stamped by the bouncer to never come back (strange restaurant -it's a bad remake of the seinfield soup nazi episode).
Lets also say that this is a strange queue in which every time the bouncer allowed someone into the restaurant the person behind doesnt move forward to take the empty slot unless the bouncer told them to.

On Tuesdays the 2-for-1 day the rate of people arriving is a lot faster than they are departing departing the restaurant.

Now lets see what the (e1000 driver) bouncer was doing on that tuesday:
Bouncer allows 64 people in but doesnt move the queue to fill the empty slots until all 64 are all done eating. Because people are coming in faster than they are departing the restaurant, this means the 256 people queue is filling up and people arriving after that are sent back. All in the meantime these 64 empty slots exist in the front ;-> (i.e no replenishing is happening)
Lets say we hired a new bouncer who decides to allow the queue to move forward every time someone goes into the restaurant (this means one more person can move into the last slot of the queue). Of course the implicit assumption is everytime someone goes in, it is because someone is done eating (i.e a packet has been processed). If this mode was followed, then over a specific period of time more people will eat at that restaurant because relatively less people will be turned away by the new bouncer.

So as you can see, this really has nothing to do with the seating of the
restaurant (the weight); it has to do with how fast people can eat and how fast new ones can come in (assuming the bouncer was doing the right thing to begin with - which the e1000 wasnt).
On tuesdays people take longer to eat - to improve the capacity, we need to figure out why they take that long.

So onto the conclusions and to refute them;->
Bullet 1 of conclusion:
-Weight (the size of the restaurant) has no effect on this specific issue.
Get yourself a smarter bouncer;-> You are wrong if you think that the smarter bouncer is the one that allows only 10 people into the restaurant
on tuesdays. And 20 on wednesdays. To reiterate the smarter bouncer is the one that allows a new person into the restaurant every time someone leaves.

Bullet 2 of conclusion:
- As a result of the above, quota has nothing to do with how much work the system can handle. It's how fast the customers arrive and how fast they are fed.

Bullet 3 of conclusion:
- Thats exactly what napi does already. Interupts are never enabled unless there are absolutely no packets detected as coming in

Now on what the weight and quota are really for:
The drivers which have packets are scheduled on whats known as a
Deficit Round Robin(DRR) Algorithm to provide packets to the system.
This system is used to enforce fairness among nics with incoming packets.
If a 10Mbps nic has packets, it should not be overrun because a 10Gbps
card has more packets to send. The weight is the maximum opportunity
that a specific NIC will have packets to send onto the stack.
If you wanted to make a NIC more important than another, you give it a
higher weight (which is what Stephens patch will allow).

Overall on that thread:
I think the question that needs asking is why people are taking so long
in the restaurant?
Is it the fact they dont get their food on time, or is it because they dont
get their bills on time? Now that would be a very useful exercise. Unfortunately the majority of the thread was spent on explaining it on how to improve NAPI.
I think one thing that should have been turned off is contracking;

I actually dont think the act of replenishing the descriptor on every
packet is the best scheme - but thats an entirely different topic and i have said enough already.

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds