BPF comes to firewalls

By Jonathan Corbet
February 19, 2018

The Linux kernel currently supports two separate network packet-filtering mechanisms: iptables and nftables. For the last few years, it has been generally assumed that nftables would eventually replace the older iptables implementation; few people expected that the kernel developers would, instead, add a third packet filter. But that would appear to be what is happening with the newly announced bpfilter mechanism. Bpfilter may eventually replace both iptables and nftables, but there are a lot of questions that will need to be answered first.

It may be tempting to think that iptables has been the kernel's packet-filtering implementation forever, but it is a relative newcomer, having been introduced in the 2.4.0 kernel in 2001. Its predecessors (ipchains, introduced in 2.2.10, and ipfwadm, which dates back to 1.2.1 in 1995) are mostly forgotten at this point. Iptables has served the Linux community well and remains the firewalling mechanism that is most widely used, but it does have some shortcomings; it has lasted longer than the implementations that came before, but it is clearly not the best possible solution to the problem.

The newer nftables subsystem, merged for the 3.13 kernel release in early 2014, introduced an in-kernel virtual machine to implement firewall rules; users have been slowly migrating over, but the process has been slow. For some strange reason, system administrators have proved reluctant to throw away their existing firewall configurations, which were painful to develop and which still function as well as they ever did, and start over with a new and different system.

Still, it was logical to assume that nftables would eventually take over, especially as the iptables compatibility layers improved. Some people started to doubt this story, though, when serious development started on the BPF virtual machine. There seemed to be a lot of overlap between the two virtual machines, and BPF was being quickly extended in ways that improved its performance, functionality, and security. Even so, nftables development has continued, and there has been little talk — until now — of pushing BPF into the core of the firewalling code.

Bringing in BPF

The announcement of bpfilter changes that situation, though. In short, bpfilter enables the creation of BPF programs that can be attached to points in the network packet path and make filtering decisions. In the proof-of-concept patches, those programs are attached at the express data path (XDP) layer, where they are run from the network-interface drivers. But, as Daniel Borkmann noted in the introduction to the patches, BPF programs could be just as easily attached at any other point in the path, allowing them to make decisions at the same points that iptables rules do.

There are a number of advantages claimed for the bpfilter approach. BPF programs can be just-in-time compiled on most popular architectures, so they should be quite fast. The work that has been done to enable the offloading of XDP-level programs to the network interface itself can come into play here, moving firewall processing off the host CPU entirely. The use of BPF enables the writing of firewall rules in C, which may appeal to some developers who are starting from the beginning. And firewall code would be subject to the BPF verifier, adding a layer of security to the whole system.

One of the core design features for bpfilter is the ability to translate existing iptables rules into BPF programs. This feature is intended to make it easy for existing firewall configurations to be moved over to the new scheme, perhaps without system administrators even knowing that it is happening. This translation is done in an interesting manner. Iptables rules are passed to the kernel, so the kernel must take responsibility for doing that work, but the task can be a complex one that would benefit from a user-space implementation.

To enable such an implementation, the bpfilter developers have created a new mechanism that supports the creation of a special type of kernel module to handle this kind of task. These modules would be part of the kernel and would be shipped by distributors as just another .ko file, but they would contain an ordinary ELF executable. After the module has been loaded, its code can be run in a separate user-space process; all that is required is a call to a special version of call_usermodehelper().

This mechanism allows the translation code to be managed as if it were just another part of the kernel. That code can be developed in user space, though. When it runs, the translation code will be separated from the kernel, making it harder to attack the kernel via that path. If this mechanism catches on, one can imagine that a number of other tasks could eventually be pushed out of the kernel proper into one of these special user-space modules. Developers should be careful, though; this could prove to be a slippery slope leading toward something that starts to look like a microkernel architecture.

Early responses

There have not been a whole lot of comments thus far on the code itself. That may be partly because, in their haste to get a proof of concept out to illustrate the idea, the developers never quite got around to writing comments in the code — or even changelogs for the patches. The idea itself, though, has raised concerns for some developers.

Harald Welte, who is not often seen in this community these days, showed up with a number of questions. At the top of his list was the decision to emulate iptables rules with the new BPF mechanism. If the new subsystem is to ever replace the iptables implementation, it will need to implement exactly the same behavior; small and subtle differences could introduce security problems into deployed firewall configurations. Given the complexity of iptables, the chances of such differences happening are significant.

More fundamentally, the networking developers have wanted to phase out iptables and its user-space interfaces for some time. Iptables has not aged entirely well. For example, there is no way to add or replace a single rule (or small set of rules); iptables can only wipe out the entire configuration and start from scratch. That makes firewall changes expensive; it also gets difficult to coordinate changes when they are being made by multiple actors at once. The increasing use of containers has created just this kind of situation; addressing this problem requires moving away from the iptables API. The fact that iptables requires separate rule sets for IPv4 and IPv6 creates a pain point for administrators as well.

Implementing the iptables API with bpfilter, Welte said, will "risk perpetuating the design mistakes we made in iptables some 18 years ago for another decade or more". It will push back the (already distant) date when that API could be deprecated and removed. Rather than focusing on iptables, Welte said, the developers should create an emulation of the newer nftables API, which was designed with the lessons from iptables in mind. That would support sites that have already migrated and encourage that migration to continue.

Networking maintainer David Miller (who authored some of the new code) replied that iptables is still far more widely used, so implementing that interface provides for better testing coverage in the near term. Welte answered, though, that most of the biggest use cases (Docker and Kubernetes, for example) use the command-line tools rather than the iptables API, so there is no need to implement emulation of the API itself to test with those systems. Miller, however, disagreed with the idea that the iptables binaries could be easily replaced on deployed systems: "Like it or not iptables ABI based filtering is going to be in the data path for many years if not a decade or more to come".

Interestingly, while there was talk of implementing the nftables API, nobody has yet questioned the idea of applying the BPF virtual machine to firewalls, even though it would be likely to supplant nftables relatively quickly. Instead, Miller said in the discussion that nftables failed to address the performance problems in Linux's packet-filtering implementation, driving users toward user-space networking technologies instead. There is a real possibility that nftables could end up being one of those experiments that is able to shed some light on the problem space but never takes over in the real world.

Overall, bpfilter is an extremely young project and there are a lot of questions yet to be answered about it. While much of the packet-filtering logic can likely be expressed in BPF code, there are more advanced features (like connection tracking, pointed out by Florian Westphal) that are still likely to need a fair amount of kernel support. There are no performance numbers with the patch set, so any performance gains are still theoretical at this point. And the code itself is quite young, lacking both features and documentation.

The end result is that we'll probably not see bpfilter in the mainline kernel in the immediate future. Given the developers who have worked on it, though, bpfilter is clearly a serious initiative that is firmly aimed at getting into the mainline eventually. If it truly proves to be a better solution to the network packet-filtering problem, those developers seem likely to prevail eventually.

Index entries for this article
Kernel	BPF/Networking
Kernel	Modules/ELF modules
Kernel	Networking/Packet filtering

BPF comes to firewalls

Posted Feb 20, 2018 4:35 UTC (Tue) by eahay (guest, #110720) [Link] (6 responses)

Iptables can delete or insert a single rule at a time...

BPF comes to firewalls

Posted Feb 20, 2018 6:46 UTC (Tue) by kay (guest, #1362) [Link]

iptables command can ... but not the API

BPF comes to firewalls

Posted Feb 20, 2018 12:19 UTC (Tue) by bernat (subscriber, #51658) [Link] (1 responses)

It will download the whole ruleset from the kernel, modify it to add/remove the single rule and upload it again. When your ruleset becomes huge, adding/removing a single rule takes a significant time.

BPF comes to firewalls

Posted Feb 24, 2018 20:07 UTC (Sat) by kleptog (subscriber, #1183) [Link]

Well that explains things... I heard someone mumbling about how iptables updates can get lost and I couldn't see how, until now.

In any case, if we do firewall rules as BPF we end up with the same problem surely? The performance improvement would be that you can pass your firewall through an compiler/optimiser to make it more efficient, but as a side effect you end up with the same problem, namely, to update a single rule you need to replace the whole program. Only now you've added an optimise step in between.

Unless you change your API to transactional one where you can send updates and get a confirmation asynchronously and the backend is smart enough to avoid actually updating the kernel for every change.

BPF comes to firewalls

Posted Apr 19, 2018 2:26 UTC (Thu) by manhnt (guest, #123784) [Link] (2 responses)

Well, as kleptog mentioned, there are cases where iptables update can get lost some rules. I've met such cases. Does anyone know how to solve that properly? What I did was simply retrying until success, which may not be an optimum solution.

BPF comes to firewalls

Posted Aug 13, 2018 4:07 UTC (Mon) by fest3er (guest, #60379) [Link]

How many rules are you talking about? In some testing 4-6 years ago, I found that iptables could not handle more than about 20 000 rules at a time. Any more and some rules would be 'lost'. IPtables was happy to add 1 000 000 rules as long as I added them around 15 000 at a time (meaning a COMMIT every 15 000 or so). Adding so many rules wasn't real speedy, but it also wasn't outrageously slow.

BPF comes to firewalls

Posted Aug 13, 2018 16:37 UTC (Mon) by antiphase (subscriber, #111993) [Link]

Use ipset to create address lists instead of using individual per-address rules. It doesn't change the reload behaviour, but it will potentially hugely reduce the number of rules if you're matching in similar ways just with different addresses, and is also faster shifting packets as a bonus.

BPF comes to firewalls

Posted Feb 20, 2018 6:42 UTC (Tue) by valberg (guest, #83862) [Link]

Thanks, Jonathan, for another great article.

BPF comes to firewalls

Posted Feb 20, 2018 7:01 UTC (Tue) by epa (subscriber, #39769) [Link] (2 responses)

So what is the origin of BPF? I thought the PF stood for packet filter because it had originated as a way to compile firewall rules — but according to the article this is the first time it has been used there.

BPF comes to firewalls

Posted Feb 20, 2018 7:17 UTC (Tue) by dgm (subscriber, #49227) [Link]

And the "B" stands for Berkely, where it originated. The Wikipedia article (https://en.wikipedia.org/wiki/Berkeley_Packet_Filter) is a bit light on details, but you will find them in the original paper (http://www.tcpdump.org/papers/bpf-usenix93.pdf).

BPF comes to firewalls

Posted Feb 20, 2018 16:17 UTC (Tue) by josh (subscriber, #17465) [Link]

It was originally used for filtering of packets for tools like tcpdump, so that when you say "show me traffic on tcp port 80 to this IP" the kernel can very quickly select the data you want and feed it to you at wire speed.

BPF comes to firewalls

Posted Feb 20, 2018 7:51 UTC (Tue) by vadim (subscriber, #35271) [Link] (4 responses)

nftables is a quite nice idea. I think the problem with it was that they were slow at implementing the last few features that were actually quite important.

For instance, nftables can do MSS clamping only since kernel 4.14. This was released this November. nftables has been around since 2014, like this article says. MSS clamping is a feature in wide use for DSL and fiber setups, and this is important precisely to the kinds of people that want to run their own firewall.

IMO, the other problem with it is that the documentation is still not great, and the syntax leaves a lot to be desired.

For instance, nftables involves stringing together commands in a way that highly resembles a run-on sentence:

    nft add rule ip filter forward oifname ppp0 tcp flags syn tcp option maxseg size set 1452

It's not immediately obvious how the syntax works and what words fit in where in the hierarchy. The way "ppp0" is not quoted or delimited in any way also makes it hard to tell apart commands from data, though this can be done as seen below. There's a C-ish form that looks a bit nicer, but then when you run into a command that starts with "nft add" it's not obvious how to put that into your config file, which looks like:

table ip filter {
	# allow packets from LAN to WAN, and WAN to LAN if LAN initiated the connection
	chain forward {
		iifname "lan0" oifname "wan0" accept
        }
}

Note how it is subtly different: we go from "ip filter" to "table ip filter", and from "forward" to "chain forward", and for someone not familiar with the syntax it's not really apparent that "oifname" in the first example is the point where you'd want to start copy/pasting.

I hope that besides the technical details, the makers of BPF also take care of producing a better syntax and good documentation.

BPF comes to firewalls

Posted Feb 20, 2018 15:25 UTC (Tue) by ringerc (subscriber, #3071) [Link] (2 responses)

Yeah, it's a lot like someone looked at the "tc" and "ip" commands and thought "what a great UI, lets do that".

BPF comes to firewalls

Posted Feb 20, 2018 16:17 UTC (Tue) by flussence (guest, #85566) [Link] (1 responses)

I've got a working (AFAIK) nftables setup. The end result looks pretty after months of tweaking, but I completely agree on how unnecessarily painful it was to get there. Spitting nothing but strerror(-ENOENT) at the user whenever any module is missing from the kernel is a nasty thing to do…

BPF comes to firewalls

Posted Feb 21, 2018 0:15 UTC (Wed) by florianfainelli (subscriber, #61952) [Link]

Fortunately we now have extended netlink acks to give you a more meaningful error code...

BPF comes to firewalls

Posted Mar 1, 2018 11:31 UTC (Thu) by jengelh (guest, #33263) [Link]

>For instance, nftables involves stringing together commands in a way that highly resembles a run-on sentence:
>
> nft add rule ip filter forward oifname ppp0 tcp flags syn tcp option maxseg size set 1452
>
>It's not immediately obvious how the syntax works and what words fit in where in the hierarchy.

This is where the iptables UI excels - the tokens for "options" and tokens for "values" never ever overlap, I am tempted to say *context-free*. The nft "tcp" instead could either mean "-p tcp" or "--tcp-flags ..." depending on where it's located, and what makes the bpf/ip/tc/nft syntax so terrible.

BPF comes to firewalls

Posted Feb 20, 2018 12:56 UTC (Tue) by iq-0 (subscriber, #36655) [Link]

I'm in favor of a jit-able packet filter that might partially be offloaded to hardware.

But the real challenges are often not the ruleset overhead, but are related to connection tracking, matching against advanced set datastructures and in the interaction with the rest of the network stack. I feel like here is a basic conflict between calling kernel functions to get better access to advanced algorithms and datastructures and the basic JIT and offloading story of bpfilter.

And didn't BPF programs have a size constraint? Or is that something that can be worked around using BPF_MAP_TYPE_PROG_ARRAY?

BPF comes to firewalls

Posted Feb 20, 2018 13:40 UTC (Tue) by kooky (subscriber, #92468) [Link] (2 responses)

I thought nftables ruleset already used BPF?

I've been using nftables and find it just works now I've got the hang.

Tim

BPF comes to firewalls

Posted Feb 21, 2018 0:20 UTC (Wed) by florianfainelli (subscriber, #61952) [Link] (1 responses)

You would think it would, but this was actually a custom VM, Pablo just posted patches to do exactly that though:

https://www.mail-archive.com/netdev@vger.kernel.org/msg21...

BPF comes to firewalls

Posted Feb 21, 2018 13:26 UTC (Wed) by pomac (subscriber, #94901) [Link]

I find this really interesting, I've wondered and tried to push a change to bpf for a while but =)

Anyway, for those that want to follow the threads in a easier manner:

https://marc.info/?l=linux-netdev&m=151905824829539&... - [PATCH RFC PoC 0/3] nftables meets bpf
https://marc.info/?l=netfilter-devel&m=15187884440366... - [PATCH RFC 0/4] net: add bpfilter

BPF comes to firewalls

Posted Feb 23, 2018 23:01 UTC (Fri) by ofranja (guest, #11084) [Link]

> Developers should be careful, though; this could prove to be a slippery slope leading toward something that starts to look like a microkernel architecture.

Or, even further, an exokernel architecture.

In the original MIT exokernel research (~1995), a packet filter language w/JIT compiler is mentioned as a way to filter and delegate network traffic to userspace with minimal [1] kernel support (although not necessarily using these terms, but the general idea is the same).

[1] https://pdos.csail.mit.edu/archive/exo/exo-slides/sld011.htm

BPF comes to firewalls

Posted Jul 26, 2019 20:13 UTC (Fri) by valentine (guest, #133435) [Link]

Hi all,
I've some small questions about the post.
- What is the relationship between BPF and eBPF?
- I still haven't understood how to work BP: are attached at the express data path (XDP) layer, so they are run from the network-interface drivers or not? If yes the NIC drivers must be rewritten?