The return of nftables

By Jonathan Corbet
August 20, 2013

Some ideas take longer than others to find their way into the mainline kernel. The network firewalling mechanism known as "nftables" would be a case in point. Much of this work was done in 2009; despite showing a lot of promise at the time, the work languished for years afterward. But, now, there would appear to be a critical mass of developers working on nftables, and we may well see it merged in the relatively near future.

A firewall works by testing a packet against a chain of one or more rules. Any of those rules may decide that the packet is to be accepted or rejected, or it may defer judgment for subsequent rules. Rules may include tests that take forms like "which TCP port is this packet destined for?", "is the source IP address on a trusted network?", or "is this packet associated with a known, open connection?", for example. Since the tests applied to packets are expressed in networking terms (ports, IP addresses, etc.), the code that implements the firewall subsystem ("netfilter") has traditionally contained a great deal of protocol awareness. In fact, this awareness is built so deeply into the code that it has had to be replicated four times — for IPv4, IPv6, ARP, and Ethernet bridging — because the firewall engines are too protocol-specific to be used in a generic manner.

That duplication of code is one of a number of shortcomings in netfilter that have long driven a desire for a replacement. In 2009, it appeared that such a replacement was in the works when Patrick McHardy announced his nftables project. Nftables replaces the multiple netfilter implementations with a single packet filtering engine built on an in-kernel virtual machine, unifying firewalling at the expense of putting (another) bytecode interpreter into the kernel. At the time, the reaction to the idea was mostly positive, but work stalled on nftables just the same. Patrick committed some changes in July 2010; after that, he made no more commits for more than two years.

Frustrations with the current firewalling code did not just go away, though. Over time, it also became clear that a general-purpose in-kernel packet classification engine could find uses beyond firewalls; packet scheduling is another fairly obvious possibility. So, in October 2012, current netfilter maintainer Pablo Neira Ayuso announced that he was resurrecting Patrick's nftables patches with an eye toward relatively quick merging into the mainline. Since then, development of the code has accelerated, with nftables discussion now generating much of the traffic on the netfilter mailing list.

Nftables as it exists today is still built on the core principles designed by Patrick. It adds a simple virtual machine to the kernel that is able to execute bytecode to inspect a network packet and make decisions on how that packet should be handled. The operations implemented by this machine are intentionally basic: it can get data from the packet itself, look at the associated metadata (which interface the packet arrived at, for example), and manage connection tracking data. Arithmetic, bitwise, and comparison operators can be used to make decisions based on that data. The virtual machine is capable of manipulating sets of data (typically IP addresses), allowing multiple comparison operations to be replaced with a single set lookup. There is also a "map" type that can be used to store packet decisions directly under a key of interest — again, usually an IP address. So, for example, a whitelist map could hold a set of known IP addresses, associating an "accept" verdict with each.

Replacing the current, well-tuned firewalling code with a dumb virtual machine may seem like a step backward. As it happens, there are signs that the virtual machine may be faster than the code it replaces, but there are a number of other advantages independent of performance. At the top of the list is removing all of the protocol awareness from the decision engine, allowing a single implementation to serve everywhere a packet inspection engine is required. The protocol awareness and associated intelligence can, instead, be pushed out to user space.

Nftables also offers an improved user-space API that allows the atomic replacement of one or more rules with a single netlink transaction. That will speed up firewall changes for sites with large rulesets; it can also help to avoid race conditions while the rule change is being executed.

The code worked reasonably well in 2009, though there were a lot of loose ends to tie down. At the top of Pablo's list of needed improvements to nftables when he picked up the project was a bulletproof compatibility layer for existing netfilter-based firewalls. A new rule compiler will take existing firewall rules and compile them for the nftables virtual machine, allowing current firewall setups to migrate with no changes needed. This compatibility code should allow nftables to replace the current netfilter tables relatively quickly. Even so, chances are that both mechanisms will have to coexist in the kernel for years. One of the other design goals behind nftables — use of the existing netfilter hook points, connection-tracking infrastructure, and more — will make that coexistence relatively easy.

Since the work on nftables restarted, the repository has seen over 70 commits from a half-dozen developers; there has also been a lot of work going into the user-space nft tool and libnftables library. The kernel changes have added missing features (the ability to restore saved counter values, for example), compatibility hooks allowing existing netfilter extensions to be used until their nftables replacements are ready, many improvements to the rule update mechanism, IPv6 NAT support, packet tracing support, ARP filtering support, and more. The project appears to have picked up some momentum; it seems unlikely to fall into another multi-year period without activity before being merged.

As to when that merge will happen...it is still too early to say. The developers are closing in on their set of desired features, but the code has not yet been exposed to wide review beyond the netfilter list. All that can be said with certainty is that it appears to be getting closer and to have the development resources needed to finish the job.

See the nftables web page for more information. A terse but useful HOWTO document has been posted by Eric Leblond; it is probably required reading for anybody wanting to play with this code, but a quick, casual read will also answer a number of questions about what firewalling will look like in the nftables era.

Index entries for this article
Kernel	Networking/Packet filtering
Kernel	Nftables

The return of nftables

Posted Aug 21, 2013 5:48 UTC (Wed) by kugel (subscriber, #70540) [Link] (5 responses)

Interesting, however i thought the kernel already has a virtual machine for packet inspection: the Berkeley packet filter (bpf)?

The return of nftables

Posted Aug 21, 2013 15:55 UTC (Wed) by johill (subscriber, #25196) [Link] (4 responses)

A cursory look through the article and the website suggests that there are, for example, extra data structures, like a hashtable (?) for IPv4 lookups etc. Such things don't exist in BPF.

The return of nftables

Posted Aug 21, 2013 22:01 UTC (Wed) by ncm (guest, #165) [Link] (3 responses)

I guess the question, then, is why not add this stuff to bpf? In the best case, it would be that bpf proved not to be a good enough foundation. The way these things go, it could as well be that improving bpf was less exciting than replacing it.

The return of nftables

Posted Aug 21, 2013 23:50 UTC (Wed) by wahern (subscriber, #37304) [Link] (2 responses)

The original author was well aware of BPF, and used it as the model. But clearly he thought it preferable to start writing code from scratch, and the project has already surpassed BPF in functionality. Plus, he who writes the code gets the say-so. (Also, the BPF virtual machine is actually quit tiny, and the line between re-writing it and copy+pasting it is rather thin.)

So, it's kind of a moot point. It would be one thing if the project stalled before surpassing BPF in functionality. Then we could all jeer "I told you so". But this doesn't seem to be one of those occasions. nftables seemed to stall simply because too many people are comfortable with iptables, and are heavily invested in the arcane common-line syntax. And those who aren't can shift to using PF on OpenBSD or FreeBSD. Plus NetBSD has NPF, now, which is pretty cool.

The return of nftables

Posted Aug 22, 2013 16:57 UTC (Thu) by intgr (subscriber, #39733) [Link] (1 responses)

> Also, the BPF virtual machine is actually quit tiny, and the line between re-writing it and copy+pasting it is rather thin

One of the advantages of BPF is that Linux already has a working BPF JIT compiler for many architectures (x86, ARM, SPARC, POWER and S/390). This is a non-trivial amount of code.

The return of nftables

Posted Aug 22, 2013 18:25 UTC (Thu) by raven667 (subscriber, #5198) [Link]

Could this work the other way around, consolidating on nftables as the backend for BPF processing in the kernel rather than maintaining two similar systems.

The return of nftables

Posted Aug 21, 2013 9:08 UTC (Wed) by rvfh (guest, #31018) [Link] (1 responses)

A conversion tool from iptables would be handy too, and as the nftables compiler can take iptables rules, maybe it could take that result and un-compile it into nftables rules?

The return of nftables

Posted Aug 22, 2013 8:07 UTC (Thu) by tobur (guest, #89244) [Link]

There is already a project for this.
Check this out: https://git.netfilter.org/iptables-nftables/

The idea is of course not to mess up too much with users habits, this makes using nftables totally transparent for those who will want to keep using iptables. And I bet most will do for a while.

The return of nftables

Posted Aug 21, 2013 9:48 UTC (Wed) by patrick_g (subscriber, #44470) [Link] (1 responses)

What about Xtables2 ?

The return of nftables

Posted Aug 21, 2013 20:26 UTC (Wed) by jengelh (guest, #33263) [Link]

The project is practically abandoned.

At first nftables chugged along (it was presented on the workshop as early as 2008, corbet!) until I pushed for early merging of a nextgen packetfilter (which happened to be xt2 code, but that is not the point), in the style of btrfs. But the powers that be wanted rather complete implementations instead. Doable, tho...

With NFWS 2013, everybody (especially new contributors) sided with nftables by way of submitting many patches there. xt2 got no support and has been rendered incompetetive too, with nft taking up and reimplementing ideas I had for xt2. That situation is very discouraging.

I temporarily work on more rewarding projects.

another bytecode interpreter ?

Posted Aug 21, 2013 14:03 UTC (Wed) by dambacher (subscriber, #1710) [Link] (12 responses)

I am only aware of the acpi bytecode interpreter,
Then there was (is?) a java bytecode module somewhere

Wich ones am I missing?

another bytecode interpreter ?

Posted Aug 21, 2013 15:51 UTC (Wed) by johill (subscriber, #25196) [Link] (11 responses)

There's BPF, as somebody else has commented on, and even though it's also related to packets I doubt it's similar - BPF is likely much simpler.

There's also an ASN.1 decoder, though I'm not really sure what that really is.

another bytecode interpreter ?

Posted Aug 21, 2013 18:35 UTC (Wed) by aliguori (subscriber, #30636) [Link] (10 responses)

ASN.1 is an RPC description language. There are multiple encoding rules for encoding an ASN.1 grammar such as BER, CER, and DER.

Many protocols use ASN.1 such as CIFS, X509, etc.

another bytecode interpreter ?

Posted Aug 22, 2013 0:21 UTC (Thu) by wahern (subscriber, #37304) [Link] (9 responses)

It's the original Protocol Buffers, which Google decided to reinvent for some ridiculous reason. ASN.1 is also used for signaling on 3GP and LTE cellular networks, for LDAP, and for a host of other things:

http://www.itu.int/ITU-T/asn1/uses/

It's a darned shame that ASN.1 support isn't more widespread in the FOSS world. Even Perl modules are crappy. OpenSSL has a fairly complete library, but it's a gigantic PITA to use, like most FOSS ASN.1 tools. Really, ASN.1 was meant to be automatically compiled from the message description to code, with your application code manipulating a real data structure. This is the best open source ASN.1 project I'm aware of:

http://lionet.info/asn1c/compiler.html

It supports streaming parsing and composition without being tied to any I/O model. The only downside is that strings and arrays are always dynamically allocated, which makes constructing and destroying messages fairly verbose, especially if you care about malloc failure. Some proprietary ASN.1 compilers support fixed length arrays which make life a little easier when you're dealing with several simple string fields or lists with a small, finite limit on their length. That makes it easier to use message caches with simpler initialization.

another bytecode interpreter ?

Posted Aug 22, 2013 3:51 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (7 responses)

ASN.1 is too complicated, nobody knows how to use it.

BTW, if something was initially designed by telecom guys then it is a great reason to avoid it like a plague.

another bytecode interpreter ?

Posted Aug 22, 2013 8:52 UTC (Thu) by gdt (subscriber, #6284) [Link] (5 responses)

ASN.1 is a complex encoding designed for an era of 48Kbps links. Because of the small bandwidths of that era it codes to very few bits. But in these days of 10Gbps links it asks too much of the CPU. ASN.1's complexity also lead to a history of implementation errors, and thus vulnerabilities.

Your "telecom guys" comment serves no purpose. It references an issue of twenty years ago. These days at IETF meetings you're just as likely to see a "telecom guy" arguing against some hacked-together draft and asking that time be taken to do it better, with the IP equipment vendors opposing that for reasons of "time to market".

another bytecode interpreter ?

Posted Aug 22, 2013 9:17 UTC (Thu) by dlang (guest, #313) [Link] (3 responses)

actually, as I see it, cpu power is increasing faster than line rate, so while it may not make sense to trade CPU for tight encoding in all cases, I can sure see it being a worthwhile trade in many cases.

This is like the claims that transparent disk compression was made obsolete by faster disks (including, but not limited to SSDs). In some cases this is true and the system does have better things to spend it's processor time on, but in other cases, the system has more processing power than it needs while waiting for the I/O, so spending even a substantial amount of processing power to cut down on the amount of I/O needed can be a substantial win.

another bytecode interpreter ?

Posted Aug 22, 2013 10:25 UTC (Thu) by khim (subscriber, #9252) [Link] (2 responses)

Single-core CPU power basically flatlined: nine years ago top-of the line CPU was Pentium 4 570J @ 3.8GHz while today it's Core i7 4770K @ 3.9GHz. Micro-architectural improvements mean that today's Core i7 is faster then identically-clocked Pentium 4 from last decade, but difference is not striking.

Meanwhile ethernet went from 10GbE to 100GbE, USB went from 480Mbit/s to 10GBit/s, even PCIExpress went from 250MB/s to 985MB/s!

Sure, if you'll include number of cores in your analysis you'll find out that CPU copes more-of-less fine - but then latency is often limiting factor in communication protocols and SMP is not a big help there.

another bytecode interpreter ?

Posted Aug 22, 2013 12:57 UTC (Thu) by intgr (subscriber, #39733) [Link]

> nine years ago top-of the line CPU was Pentium 4 570J @ 3.8GHz while today it's Core i7 4770K @ 3.9GHz. [...] but difference is not striking.

Depends on what you mean by "striking". I find that current server CPUs are 3-5 times faster at single-threaded workloads than 8-year-old single-core ones. Every generation of Intel processors still has performance gains of 10% or so while consuming less power.

I see your point about the overhead of complex protocol encodings. But if we go back to the original topic of firewalling: increasing network speeds would not be such a big problem if we weren't still be stuck with packet sizes that were designed for 10 Mbps networks.

another bytecode interpreter ?

Posted Aug 22, 2013 22:05 UTC (Thu) by dlang (guest, #313) [Link]

clock speed isn't the only factor in cpu speed.

It's also _extremely_ unlikely that you are going to need to limit your computation to a single core. Using multiple cores is trivial if you have multiple communication streams to process (put each stream on it's own core, or processing of data from each interface on it's own core, etc). But even if you have one stream to process, you can almost always find a way to split the workload across multiple cores (one core works on what you are sending now, the second works on what you will be sending in a few hundred ms, etc)

so the move to multiple cores does end up helping the processing of things like this.

another bytecode interpreter ?

Posted Aug 22, 2013 12:40 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

IETF produces standards with a wonderful feature - they are written by humans, not by some humanoid aliens.

As for compactness, GZIP+XML is about as compact as ASN.1 BER for most use-cases. Google's protobufs are also pretty compact.

another bytecode interpreter ?

Posted Aug 29, 2013 14:24 UTC (Thu) by moltonel (subscriber, #45207) [Link]

While ASN.1 *is* riddled by some design-by-commitee issues that make the grammar hard to master and the implementations rare and buggy, it remains supperior in flexibility, features and packet size. As with any tool, it's not a one-size-fits-all, but there are a lot of areas where it is a winner.

Concerning CPU usage, while hard to compare (the codec gives you many data verifications for free, which you have to write yourself with the likes of protobuf and msgpack), it is nothing to be ashamed of. And while some people have 10Gbits to play with, others are more concerned with the per-MB data roaming fee that is drilling a hole through their wallet.

another bytecode interpreter ?

Posted Aug 22, 2013 14:43 UTC (Thu) by dw (guest, #12017) [Link]

Having built protocols in both, saying ASN.1 is the original Protocol Buffers is like saying Python is the original COBOL. It's a markedly simpler design, with vastly less optional fluff, and a definition language you can master in an hour instead of a few weeks.

Also to my knowledge, nobody has yet to ship a Protocol Buffers implementation riddled with security holes due to specification complexity, although that might just be because nobody thought to look there yet..

The return of nftables

Posted Aug 21, 2013 17:26 UTC (Wed) by luto (guest, #39314) [Link] (1 responses)

I wonder how this compares to the demultiplexing packet filter.

The return of nftables

Posted Aug 22, 2013 2:41 UTC (Thu) by ras (subscriber, #33059) [Link]

Yikes.

To give Patrick his due, he has implemented good academic proposals in the past. (The HFSC qdisc springs to mind).

I hope he has seen this one.

The return of nftables

Posted Aug 31, 2013 20:15 UTC (Sat) by compte (guest, #60316) [Link] (3 responses)

I still use Peerguardian, based on libnetfilter_queue & libfnetlink, although it's not maintained.
I tried to make rules with iptables but it could not read 150 000 (but managed 100 000), while Peerguardian read more than that.

The return of nftables

Posted Aug 31, 2013 21:05 UTC (Sat) by nybble41 (subscriber, #55106) [Link] (2 responses)

That sounds like a prime use case for the new(-ish) IP set rules. You wouldn't want to do 150,000 separate tests on every packet anyway. IP set matches are much more efficient.

ipset create peerguard hash:ip family inet maxelem 262144
xargs -i ipset add peerguard {} < ip-list
iptables -A INPUT -m set --match-set peerguard src -j DROP

ipset create peerguard6 hash:ip family inet6 maxelem 262144
xargs -i ipset add peerguard6 {} < ip6-list
ip6tables -A INPUT -m set --match-set peerguard6 src -j DROP

Note that once you've created the set, you can use the "save" and "restore" ipset commands to avoid running 150,000 "add" commands every time you set up your firewall rules. You can also add a range of IPs with a single command, e.g. "ipset add peerguard 1.2.3.0/24".

Requires Linux 2.6.39 or later with CONFIG_IP_SET_HASH_IP enabled.

The return of nftables

Posted Aug 31, 2013 23:41 UTC (Sat) by compte (guest, #60316) [Link] (1 responses)

Thanks, I was using a lot of:
-A INPUT -s 1.2.3.4/24 -j REJECT
lines. So the trick is in "peerguard {} < ip-list"
Is peerguard{} an existing Peerguardian function pointing to a p2p file?

The return of nftables

Posted Sep 1, 2013 1:51 UTC (Sun) by nybble41 (subscriber, #55106) [Link]

> So the trick is in "peerguard {} < ip-list"
> Is peerguard{} an existing Peerguardian function pointing to a p2p file?

Not quite. This doesn't depend on any code other than ipset and iptables. The list of IP addresses/ranges to block is in the file "ip-list". In the command

> xargs -i ipset add peerguard {} < ip-list

the "{}" is an argument to "xargs -i" which serves as a placeholder. The xargs tool (with the "-i" option) runs the given command once for each line in the standard input (here redirected from the file ip-list), replacing any occurrences of "{}" with the data from the input. This is equivalent to a series of commands like:

> ipset add peerguard 12.23.34.45
> ipset add peerguard 21.32.43.54

The ipset tool adds each IP address to the IP set named "peerguard", which was created in the previous "ipset create" command as a hash-based set of IP addresses, and referenced with the "-m ipset --match-set peerguard src" option to iptables to search the set for the source IP address of the packet.

You'll probably want to change the "-j DROP" in my example to "-j REJECT", to match your previous rules. I wasn't sure which approach Peerguardian took. Also, if you have a large number of address ranges, like the /24 in your example, you may want to use "hash:net" rather than "hash:ip" when creating the IP set so that the ranges are stored more efficiently in the kernel. You can pass ranges to "ipset add" either way, but in the "hash:ip" case they're expanded to individual addresses in the table, whereas "hash:net" keeps separate tables for each prefix length and stores only the network address.