Accelerating netfilter with hardware offload, part 1

January 14, 2020

This article was contributed by Marta Rybczyńska

Supporting network protocols at high speeds in pure software is getting increasingly difficult, with 25-100Gb/s interfaces available now and 200-400Gb/s starting to show up. Packet processing at 100Gb/s must happen in 200 cycles or less, which does not leave much room for processing at the operating-system level. Fortunately some operations can be performed by hardware, including checksum verification and offloading parts of the packet send and receive paths.

As modern hardware adds more functionality, new options are becoming available. The 5.3 kernel includes a patch set from Pablo Neira Ayuso that added support for offloading some packet filtering with netfilter. This patch set not only adds the offload support, but also performs a refactoring of the existing offload paths in the generic code and the network card drivers. More work came in the following kernel releases. This seems like a good moment to review the recent advancements in offloading in the network stack.

Offloads in network cards

Let us start with a refresh on the functionality provided by network cards. A network packet passes through a number of hardware blocks before it is handled by the kernel's network stack. It is first received by the physical layer (PHY) processor that deals with the low-level aspects, including the medium (copper or fiber for Ethernet), frequencies, modulation, and so on. Then it is passed to the medium access control (MAC) block, which copies the packet to system memory, writes the packet descriptor into the receive queue, and possibly raises an interrupt. This allows the device driver to start the processing in the network stack.

MAC controllers, however, often include other logic, including specific processors or FPGAs, that can perform tasks far beyond launching DMA transfers. First, the MAC may be able to handle multiple receive queues that allow separating packet processing onto different CPUs in the system. It can also sort packets with the same source and destination addresses and ports, called "flows" in this context; different flows can be redirected to specific receive queues. This has performance benefits, including better cache usage. More than that, the MAC blocks can perform actions on flows, such as redirecting them to another network interface (when there are multiple interfaces in the same MAC), dropping packets in response to a denial-of-service attack, and so on.

The hardware behind that functionality includes two blocks that are important for netfilter offload: a parser and a classifier. The parser extracts fields from packets at line speed; it understands a number of network protocols, so that it can handle the packet at multiple layers. It usually extracts both well-known fields (like addresses and port numbers) and software-specified ones. In the second step the classifier uses the information from the parser to perform actions on the packet.

The hardware implementation of those blocks uses a structure called ternary content-addressable memory (TCAM), a special type of memory that uses three values (0, 1 and X) instead of the typical two (0 and 1). The additional X value means "don't care" and, in a comparison operation, it matches both 0 and 1. A typical parser provides a number of TCAM entries, with each entry associated with another region of memory containing actions to perform. That implementation allows the creation of something like regular expressions for packets; each packet is compared in hardware with the available TCAM entries, yielding the index for any matching entries with the actions to perform.

The number of TCAM entries is limited. For example, controllers in Marvell SoCs like Armada 70xx and 80xx have a TCAM with 256 entries (covered in a slide set [PDF] from Maxime Chevallier's talk about adding support for classification offload to a network driver at the 2019 Embedded Linux Conference Europe). In comparison, netfilter configurations often include thousands of rules. Clearly, one of the challenges of configuring a controller like this is to limit the number of rules stored in TCAM. It is also up to the driver to configure the device-specific actions and different types of classifiers that might be available. The hardware available is usually complex and the drivers usually support only a subset of what is available.

Offload capabilities in MAC controllers can be more sophisticated than that. They include implementations of offloading for the complete TCP stack, called TCP offload engines. Those are currently not supported by Linux, as the code needed to handle them raised many objections years ago from the network stack maintainers. Instead of supporting TCP offloading, the Linux kernel provides support for specific, mostly stateless offloads.

Interested readers can find the history of the offload development in a paper [PDF] from Jesse Brandeburg and Anjali Singhai Jain, presented at the 2018 Linux Plumbers Conference.

Kernel subsystems with filtering offloads

The core networking subsystem supports a long list of offloads to network devices, including checksumming, scatter/gather processing, segmentation, and more. Readers can view the lists of available and active offload functionality on their machine with:

    ethtool --show-offload <interface>

The lists will be different from one interface to another, depending on the features of the hardware and the associated driver. ethtool also allows configuring those offloads; the manual page describes of some of the available features.

The other subsystem making use of hardware offloads is traffic control (tc with the configuration tool of the same name); the tc manual page offers an overview of the available features, in particular the flower classifier, which allows administrators to set up scheduling of network packets. Practical examples of tc use include bandwidth limiting per service or adding priorities to some traffic. Interested readers can find more about tc flower offloads in an article [PDF] by Simon Horman presented at NetDev 2.2 in November 2017.

Up to this point, filtering offloads were possible with both tc and ethtool; these two features were implemented separately in the kernel. This duplication also required duplication of work by authors of network card drivers, as each offload implementation used different driver callbacks. With the advent of a third system adding offload functionality, the developers started working on common paths; this required refactoring some of the common code and changes in the callbacks to be implemented by the drivers.

Summary

Network packet processing with high speed interfaces is not an easy task — the number of CPU cycles available to do so is small. Fortunately, the hardware is offering offload capabilities that the kernel can use to ease the task. In this article we have provided an overview of how a network card works and some offload basics. This is to lay the foundations for the second part, where we're going to look into the details of the changes brought by the netfilter offloading functionality, both in the common code, and how it affects driver authors — and how to use the netfilter offloads, of course.

Index entries for this article
Kernel	Device drivers/Network drivers
Kernel	Networking/Packet filtering
Kernel	Packet filtering
GuestArticles	Rybczynska, Marta

Ternary Computing

Posted Jan 14, 2020 21:26 UTC (Tue) by jccleaver (guest, #127418) [Link] (7 responses)

I had no idea ternary memory and logic was being used at that low level. Brings to mind the aborted attempts to actually run full ternary computers (https://en.wikipedia.org/wiki/Ternary_computer) up into the 1970s.

Surprised to see that logic used there (although the 1/0/NULL of SQL is another example of modern usage) -- I wonder if ternary silicon is an area of research for this hardware.

Ternary Computing

Posted Jan 14, 2020 21:48 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

In TCAM the third level is more like a wildcard for addressing.

But there are many other silicon devices that use multiple levels, like MLC flash cells. After all, the world is analog.

Ternary Computing

Posted Jan 15, 2020 22:27 UTC (Wed) by leromarinvit (subscriber, #56850) [Link] (5 responses)

At least Wikipedia (the world's One True Single Source Of Truth, obviously) says it's typically implemented with a second bit rather than relatively exotic multi-level logic. That would have been my gut feeling as well, that designers would rather avoid complicating their design and process for saving what's essentially peanuts in transistor count.

Also, this is SRAM. MLC flash works by storing different charge levels in the cell. The closest equivalent I can think of for SRAM would be different voltages - more or less impossible to achieve using a single supply, without first generating a second voltage from that. Which wastes chip area and power for no real gain, making the two-bit solution look even better in comparison.

Ternary Computing

Posted Jan 15, 2020 22:36 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

For MLCs it's implemented as a true multi-level device. It basically uses different charge levels to encode different bit combinations.

Ternary Computing

Posted Jan 15, 2020 23:51 UTC (Wed) by Sesse (subscriber, #53779) [Link] (2 responses)

TCAM may be peanuts in a NIC with 256 entries, but not in a large switch/router. It's how switches manage to do route lookups in wirespeed; you have one entry (IIRC, typically 192 TCAM bits for matching, well, various stuff) per route, and then something like 512k routes. More in modern devices, now that the IPv4 routing table is larger than that… so think 1M routes, 192 bits for each, so now you have 192M SRAM cells and comparators to run in parallel! And each line card has the same amount! So if you could somehow design those with exotic logic instead of two bits, it would be a win.

Someone once described TCAM to me as “the stuff you upgrade in your router, and then the power bill goes up”.

Ternary Computing

Posted Jan 16, 2020 0:02 UTC (Thu) by leromarinvit (subscriber, #56850) [Link] (1 responses)

Interesting perspective, thanks! That is indeed a lot of SRAM.

Ternary Computing

Posted Jan 16, 2020 8:30 UTC (Thu) by leromarinvit (subscriber, #56850) [Link]

On second thought, maybe if you can implement TCAM with DRAM, you could get the X state by charging the capacitor a little less (shorter / via a higher resistance path). Then design the comparator such that it accepts both 0 and 1 if the other input is in this "middle band". If the refresh cycle is fast enough that a 1 won't decay into an X (or an X into a 0), then maybe this could work.

But I'm sure people much smarter than me have tried to optimize TCAM for many years, and are already using ideas much better than I can think of, so I'll stop now.

Ternary Computing

Posted Jan 31, 2020 21:07 UTC (Fri) by brouhaha (subscriber, #1698) [Link]

At least Wikipedia (the world's One True Single Source Of Truth, obviously) says it's typically implemented with a second bit rather than relatively exotic multi-level logic.

The TCAM used in network switches, routers etc. definitely works that way, storing the ternary values as two bits each. It is ternary in the same sense that BCD is decimal; both are encoded using only binary digits. A TCAM cell is effectively much more than twice the size of a normal SRAM cell because it also contains the comparator logic. This is one reason why TCAM chips are orders of magnitude more expensive than an equivalent amount of SRAM.

It would be possible to build SRAM using multilevel cells, but most likely that would result in larger and slower memory than using binary.

On the other hand, two-bit-per-cell masked ROM technology exists. Each cell has transistors chosen from four transistor sizes resulting in four possible on-state resistances. Reading from it works the same way as MLC flash; the sense amplifier feeds analog comparators to distinguish the levels. The microcode of the original Intel 8087 numeric coprocessor was stored in two-bit-per-cell masked ROM.

200 cycles or less

Posted Jan 15, 2020 16:39 UTC (Wed) by ale2018 (guest, #128727) [Link] (5 responses)

Those 200 cycles seem to be a very short timeframe to do something useful. Can one implement a firewall, querying a database on some packets? Using FPGAs??

Except for routers, to be able to communicate faster than one can think sounds nonsensical. Something like arriving before leaving...?

200 cycles or less

Posted Jan 15, 2020 19:00 UTC (Wed) by hkario (subscriber, #94864) [Link] (2 responses)

remember that those 200 cycles are for processing header only, the length of payload doesn't matter (here it's averaged over typical frame sizes)

it's just like navigation: handling a 20t truck in principle is not different than a 3.5t truck

200 cycles or less

Posted Jan 15, 2020 22:45 UTC (Wed) by leromarinvit (subscriber, #56850) [Link]

I think the 200 cycles number is just meant as a reminder that it's "not much" time per packet. The linked article seems to be talking about a single 3 GHz CPU. Obviously the available cycles vary with average packet length and CPU clock, and processing can be split over multiple cores. That is, of course, no reason not to try making the best use of the available cycles, since latency will suffer if you just rely on parallellism to stem the load.

200 cycles or less

Posted Jan 17, 2020 6:55 UTC (Fri) by ghane (guest, #1805) [Link]

Thanks, I will apply for my licence today :-)

200 cycles or less

Posted Jan 27, 2020 14:23 UTC (Mon) by robbe (guest, #16131) [Link] (1 responses)

The 200 cycles number is really a ballpark figure and should not be taken for granite. A simple optimisation is to use mainly jumbo frames, which raises your per-packet budget to more than 1000 cycles.

I also think that no OS achieves CPU-involved forwarding speeds of even 10Gbps without a lot of NIC offloading (coalescing, TSO, etc.)

200 cycles or less

Posted Jan 27, 2020 17:47 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

You can easily get more than 1 million packets per second on general CPUs without any offloading, this works out to more than 10GBps easily.

Accelerating netfilter with hardware offload, part 1

Posted Jan 16, 2020 18:50 UTC (Thu) by BenHutchings (subscriber, #37955) [Link]

TCAM is necessary for the most general filter matching, but somewhat more restricted packet filtering can be done using a hash table with open addressing.