LWN: Comments on "A memory allocator for BPF code"

A memory allocator for BPF code

MaZe — Sun, 01 May 2022 12:01:13 +0000

Cluster stuff is now (finally!) basically done and ipv6-only.
Though of course there's always weird special cases and exceptions, like critical ipv4-only hardware (temperature sensors, ntp/gps and the like).
The current true battlefront is in cloud... and to a lesser extent trying to get to ipv6-only corp (due to rfc1918 [incl. cgnat] exhaustion).

A memory allocator for BPF code

nybble41 — Wed, 16 Feb 2022 20:14:54 +0000

TUBA (RFC1347) keeps TCP and UDP but replaces the IP layer with Connectionless Network Layer Protocol (CNLP). No compatibility was offered via translation or tunneling between IP and CNLP networks and the CNLP protocol diverges more from IPv4 than IPv6 does. Ergo, TUBA would not have been any less disruptive than dual-stack IPv4+IPv6 and represents a step backward from something like 464XLAT which permits IPv6-only networks to communicate with IPv4-only hosts via stateful NAT and protocol translation. Plus, of course, all the hardware engineers' perfectly legitimate objections to variable-length addresses.

A memory allocator for BPF code

flussence — Wed, 16 Feb 2022 19:49:52 +0000

> IPv6 evolved out of a couple of experimental IPng protocols, one of which was TUBA.

Wait, the real answer to the ignoramuses going “LOL IPv6, why don't we just make IPv4 addresses longer?” was sitting right there in an RFC (1347) all this time?!

I almost want them to find it and try and implement it, just for the schadenfreude.

A memory allocator for BPF code

Sesse — Sat, 12 Feb 2022 01:10:09 +0000

> It is easy to wave your hands and be dismissive of this but all of the so called hyperscalers had massive difficulty in rolling out ipv6.

Citation needed, please? Or at least a clarification of what “rolling out IPv6” would entail, because e.g. www.google.com has had AAAA records since June 6th, 2012 (World IPv6 Launch).

A memory allocator for BPF code

Sesse — Sat, 12 Feb 2022 01:07:13 +0000

FWIW, I did large chunks of Google's IPv6 porting back in the day, and nearly all code was indeed written protocol-independently, so slotting in IPv6 support was fairly easy. You had to fix a few central abstractions, and some stuff around logging and ACLs and such, but it turns out most code doesn't care much about what an IP address _is_, just that you can store it and send it around to other parts of the code. (We were, de facto, three people who did most of it over a period of 1–2 years, not all of us full-time.) IIRC, you could have pulled up an IPv6-only Borg cluster around early 2011 or so if you wanted, and have it run real production-like workloads. But changing how cluster operations work is a completely different game; one that I never pursued, and I have no idea how it works internally now (I've since quit and then rejoined Google, but in a completely different part of the company). And public cloud happened after all of us had moved on to other systems, so no, I'm not responsible for any deficiencies that might have :-)

A memory allocator for BPF code

jhoblitt — Fri, 11 Feb 2022 21:29:38 +0000

There is also the small issue that until extremely recently, ipv6 support in network equipment often could not be trusted. It is easy to wave your hands and be dismissive of this but all of the so called hyperscalers had massive difficulty in rolling out ipv6. It is easy to say supports ipv6 on the box of a switch but does it work with the same reliability and hardware acceleration as ipv4? Say with NDP working reliability over an EVPN based spine+leaf deployment? The answer was obviously no until practical last week.

A memory allocator for BPF code

jd — Fri, 11 Feb 2022 21:18:15 +0000

Well, given the way IPv6 has been used in practice, you're absolutely right. And if you are content to only consider what is mainstream practice, nothing after this point will be of the slightest interest.

According to how it was originally designed (a prefix which defines the route, and a suffix that defines any given physical or virtual machine that is connected via that route), there was a bit more freedom as any physical or virtual interface was absolutely guaranteed a unique IPv6 address. This was when autoconfiguring networks were to be the way and DHCP was seen as a maintenance-heavy dead end.

(This is how the original mobility system worked. If you moved a machine from one network to another, you simply notified everyone connected that your prefix had changed and all packets - including those in transit - would be diverted to your new prefix. Your old address would be marked transient and would remain usable for existing packets but not new ones. This necessitates a unique suffix.)

This means that to have a new subnet, you simply add a byte to the prefix to identify the new network. As long as there were bytes left you could use, defining new subnets was trivial. Which meant one interface had one IP address. There was no concept of multihome. In order to have more than one IP address go through an interface, you needed a virtual network, where each virtual interface had one IP address. Yes, it's still an overlay network, but it became part of the design rather than an add-on, and there was no distinction between what was software and what was hardware. It was just one network.

Telebit devised an extension for this, although I think it vanished with them. As far as I could understand, their system allowed the creation of networks that were utterly transparent to the outside world through some sort of NAT. Your traffic could even go over these transparent segments and back onto the visible network but it would look like a single hop to the outside world, which wasn't bad for nascent IPv6 in 1996. That would have let you get past the prefix length limit.

Of course, nothing actually stops you from using the autoconfigure protocol for containers and virtual switches in a virtual network of containers - well, other than this really doesn't play nice with software that assumes a static IP. You're only guaranteed a static suffix.

If you're happy with mainstream protocols, whether or not it's standard use, then stop here. Because it's about to get scary.

IPv6 evolved out of a couple of experimental IPng protocols, one of which was TUBA. Now, with TUBA, the idea was that addresses were variable length. (I did say it was about to get scary.) IIRC, routing would be done by moving the cursor up or down from the current position to just the next few bytes to see where the packet should go. Something hardware engineers were getting ready to mutiny over, from the sounds of things. How this would have ever worked is left as an exercise for Steven King fans. But in principle, it would have meant that you couldn't run out of space on the prefix.

A memory allocator for BPF code

zdzichu — Fri, 11 Feb 2022 20:39:39 +0000

It's worth noting there's another company with great engineering accomplishments, namely Facebook (or maybe Meta now?). They went with IPv6-first internal networks in DCs. They also have own container-orchestration solution – Tupperware.
I'd love to see a comparison (with emphasis on networking) between Tupperware and Kubernetes/Borg.

A memory allocator for BPF code

jd — Fri, 11 Feb 2022 20:12:45 +0000

Google was founded in 1998, two years after I'd set up Manchester University's IPv6 node. So, yes, it's before IPv6 became mainstream, but stacks were in Linux as of 2.0.20 (with a patch) and were in the main kernel by early 2.1 IIRC. (They'd already been in Windows - thanks to FTP Software, and Solaris.) KAME was already out for the BSDs, I think.

At that time, IIRC, NRL were distributing a library for stack-independent network code. (The connection could be IPv4, IPv6 or indeed any other supported protocol and that detail would be hidden from the application.) Since Google's use case was not that complicated, Google could even have written a small bit of code to do the same thing, although it would have meant getting the web server to support it, which would have added to what they had to maintain.

So Google could have either supported IPv6 directly or developed their code to simply not care about that layer at all. Now, in hindsight, the extra work then (when things were simple) might have made sense, although it's just as possible that they simply didn't have the time or money to throw in features that they couldn't sell at the time. And, let's be honest, there was quite a lot of cynicism about IPv6 back then as well.

Today, to have a protocol-independent communications layer would be a LOT of work for everyone, I'm not sure you could introduce such a system at this point, and switching a massive heap of interdependent legacy IPv4 code to IPv6 would be almost as painful.

A memory allocator for BPF code

Cyberax — Wed, 09 Feb 2022 07:51:18 +0000

K8s network is not open to the public Internet at all. If you receive a packet from it, there's a guarantee that it has been sent by somebody within the trusted net.

A memory allocator for BPF code

bartoc — Wed, 09 Feb 2022 07:34:08 +0000

That's sort of true I suppose, but it's true in the same way that nat provides security guarantees (you can't have an overlay network without doing that stuff, but you can do that stuff without having an overlay network)

A memory allocator for BPF code

bartoc — Wed, 09 Feb 2022 07:31:38 +0000

no, "overlay networks" are not the same as "giving each container it's own IP"

Overlay networks imply some kind of packet encapsulation, with ipv6 that's not necessary.

A memory allocator for BPF code

bartoc — Wed, 09 Feb 2022 06:58:37 +0000

Yeah having to administer bgp servers of all things always seemed more painful to me than just getting ipv6 working correctly.

I think part of it is that it comes from google's borg, and google is the main contributor to the ecosystem, and google built out their datacenter architecture and management tools before ipv6 was a "thing", and deploying ipv6 at the lowest levels of the "hyperscalers" data centers is .... really, really scary, so they didn't, and then k8s needs to actually work on google infrastructure (and indeed, there's no incentive to make it work anywhere else).

A memory allocator for BPF code

jhoblitt — Wed, 09 Feb 2022 01:19:37 +0000

Giving each container its own network address which is distinct from the container host is called an overlay network. If it is ipv4, ipv6, ipx, or appletalk is just an implementation detail.

k8s supports ipv4, ipv6, and last year dual stack came out of beta.

A memory allocator for BPF code

Cyberax — Tue, 08 Feb 2022 23:02:26 +0000

You certainly can, and that's what the newest K8s does. But K8s overlay network also provides some security guarantees, because it's trusted and is not open to the public Internet.

A memory allocator for BPF code

Sesse — Tue, 08 Feb 2022 22:59:26 +0000

OK, that's pretty insane (it's not like IPv6 has been particularly exotic for the last decade or two). But I guess people are so incredibly wed to their IPv4 thinking that they'd rather add thirty layers of complexity than switch to the version with more address space :-)

A memory allocator for BPF code

atnot — Tue, 08 Feb 2022 22:50:52 +0000

It does theoretically, except that:

* It is only somewhat usable in recent versions
* docker still does not enable it out of the box so nobody builds oci images with ipv6 in mind
* very little of the tooling has ever encountered a v6 address nevermind a v6-only environment
* neither have most developers working in the space
* the cloud and corporate environments these systems are usually rigidly v4-only

Sometimes I wonder where we'd be today if the people at docker had decided to use v6 internally right away. Not like anyone would have batted an eye with all of the other quirks of docker. But instead every organization using 172.16.0.0/16 internally now has to deal with an endless stream of users running docker complaining not being able to access the network. Oh well.

A memory allocator for BPF code

Sesse — Tue, 08 Feb 2022 21:43:46 +0000

> It doesn't particularly matter how large the addressable space is. The option is to either try to coordinate the ports used by potentially 100s of containers dynamically scheduled onto the same host or use an abstraction layer such that every container can have a service listening on port 80.

Why is it not an option to give each container an IPv6 subnet?

> I can think of no examples of a container orchestrator that went with port coordination.

Borg did.

A memory allocator for BPF code

jhoblitt — Tue, 08 Feb 2022 21:40:59 +0000

It doesn't particularly matter how large the addressable space is. The option is to either try to coordinate the ports used by potentially 100s of containers dynamically scheduled onto the same host or use an abstraction layer such that every container can have a service listening on port 80. I can think of no examples of a container orchestrator that went with port coordination. docker swarm, meos, ecs, k8s, cloud foundry, etc. support an overlay.

A memory allocator for BPF code

Sesse — Tue, 08 Feb 2022 21:27:02 +0000

Really, “has no”? I assume these things support IPv6?

A memory allocator for BPF code

atnot — Tue, 08 Feb 2022 21:22:10 +0000

I think they are unfortunately pretty inevitable when everyone has to share the same 24/16 bits of address space, especially over existing networks usually not built to the security and scalability requirements of running thousands of containers.

RAMFS?

k3ninho — Tue, 08 Feb 2022 11:45:07 +0000

I was so hoping we could apply everything-is-a-file to this filter-all-the-things pattern...

K3n.

RAMFS?

matthias — Tue, 08 Feb 2022 08:56:36 +0000

I guess the answer is a simple no.

Does RAMFS even support block sizes smaller than pagesize? The new allocator uses 64B as blocksize. Also the filesystem has a huge overhead, as each file needs an inode and a directory entry. All this is not needed for BPF programs. Knowing their address (in memory) and maybe size is enough to use them. All the overhead a filesystem has to represent a single file will be much bigger than the program itself.

And the filesystem cannot manage the permissions. Filesystem permissions are a completely different thing. Here, we are talking about the access bits that are used by the memory management unit. And a filesystem will not help with that. You cannot mix data with executable code, as both things need different bits set in the pagetables. However these bits are only available per page. So you need a pool in memory that does not hold any data, just executable code.

A memory allocator for BPF code

bartoc — Tue, 08 Feb 2022 07:53:45 +0000

I remain baffled that overlay networks are so ingrained in Kubernetes

A memory allocator for BPF code

k3ninho — Mon, 07 Feb 2022 10:41:46 +0000

This 'packing and fragmentation' pattern looks like a filesystem -- would extending RAMFS with de-allocation (so the kernel can manage the pool) be a better way to manage permissions and pack smaller-than-4K items across pages?

K3n.

A memory allocator for BPF code

Sesse — Sun, 06 Feb 2022 17:46:15 +0000

More precisely: You'd need hundreds just to break even, and in the thousands to see any gains.

My server seems to have 20–30 BPF programs, but I guess this will eventually increase. Maybe this allocator will be optional?

A memory allocator for BPF code

ttuttle — Sun, 06 Feb 2022 17:29:04 +0000

BPF programs come in as bytecode, and this allocator holds machine code. By the time it's running, the code is already stored in a temporary buffer (from kvmalloc), so it knows how much space it needs.

How does the BPF code itself know how big a buffer to ask kvmalloc for? Or is it one of those "ask for a reasonable size and double it every time it fills up" kind of deals?

A memory allocator for BPF code

plugwash — Sun, 06 Feb 2022 16:25:52 +0000

A huge page is 512 regular pages. So it sounds like you would need hundreds of bpf programs to see a memory usage benefit from this patch. From your numbers it sounds like that would not be the case on your laptop.

Of course you could argue that a couple of wasted megabytes is lost in the noise on a modern desktop/laptop but.............

A memory allocator for BPF code

jhoblitt — Sun, 06 Feb 2022 14:30:00 +0000

There are already kubernetes CNIs that use bpf to create the overlay network. Maybe someone is pushing packet filtering into bpf as one rule per prog?

A memory allocator for BPF code

zdzichu — Sun, 06 Feb 2022 08:10:02 +0000

There could be more than you expect. You can use bpftool prog command to see them.

On my typical Fedora laptop, there are 21 programs loaded. On my home server there are 470. All of them loaded by systemd and libvirtd, nothing custom.

A memory allocator for BPF code

epa — Sun, 06 Feb 2022 08:00:40 +0000

I think you must have a lot if you care at all that each one uses a whole four kilobytes normally.

A memory allocator for BPF code

Sesse — Sun, 06 Feb 2022 00:06:58 +0000

You sure have a lot of BPF programs if you feel that a huge page (2MB) is a good minimum size to spend on them!

Pointed question, I guess...

warrax — Fri, 04 Feb 2022 20:56:28 +0000

I guess it's not that much about the allocator specifically, but I'm wondering... as this BPF thing gains more and more capabilities... is there *any* attempt at formal verification that extensions don't break the critical guarantees of the verification step?