LWN: Comments on "A memory allocator for BPF code" https://lwn.net/Articles/883454/ This is a special feed containing comments posted to the individual LWN article titled "A memory allocator for BPF code". en-us Mon, 20 Oct 2025 11:17:40 +0000 Mon, 20 Oct 2025 11:17:40 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net A memory allocator for BPF code https://lwn.net/Articles/893245/ https://lwn.net/Articles/893245/ MaZe <div class="FormattedComment"> Cluster stuff is now (finally!) basically done and ipv6-only.<br> Though of course there&#x27;s always weird special cases and exceptions, like critical ipv4-only hardware (temperature sensors, ntp/gps and the like).<br> The current true battlefront is in cloud... and to a lesser extent trying to get to ipv6-only corp (due to rfc1918 [incl. cgnat] exhaustion).<br> </div> Sun, 01 May 2022 12:01:13 +0000 A memory allocator for BPF code https://lwn.net/Articles/885041/ https://lwn.net/Articles/885041/ nybble41 <div class="FormattedComment"> TUBA (RFC1347) keeps TCP and UDP but replaces the IP layer with Connectionless Network Layer Protocol (CNLP). No compatibility was offered via translation or tunneling between IP and CNLP networks and the CNLP protocol diverges more from IPv4 than IPv6 does. Ergo, TUBA would not have been any less disruptive than dual-stack IPv4+IPv6 and represents a step backward from something like 464XLAT which permits IPv6-only networks to communicate with IPv4-only hosts via stateful NAT and protocol translation. Plus, of course, all the hardware engineers&#x27; perfectly legitimate objections to variable-length addresses.<br> </div> Wed, 16 Feb 2022 20:14:54 +0000 A memory allocator for BPF code https://lwn.net/Articles/885037/ https://lwn.net/Articles/885037/ flussence <div class="FormattedComment"> <font class="QuotedText">&gt; IPv6 evolved out of a couple of experimental IPng protocols, one of which was TUBA.</font><br> <p> Wait, the real answer to the ignoramuses going “LOL IPv6, why don&#x27;t we just make IPv4 addresses longer?” was sitting right there in an RFC (1347) all this time?!<br> <p> I almost want them to find it and try and implement it, just for the schadenfreude.<br> </div> Wed, 16 Feb 2022 19:49:52 +0000 A memory allocator for BPF code https://lwn.net/Articles/884603/ https://lwn.net/Articles/884603/ Sesse <div class="FormattedComment"> <font class="QuotedText">&gt; It is easy to wave your hands and be dismissive of this but all of the so called hyperscalers had massive difficulty in rolling out ipv6.</font><br> <p> Citation needed, please? Or at least a clarification of what “rolling out IPv6” would entail, because e.g. www.google.com has had AAAA records since June 6th, 2012 (World IPv6 Launch).<br> </div> Sat, 12 Feb 2022 01:10:09 +0000 A memory allocator for BPF code https://lwn.net/Articles/884602/ https://lwn.net/Articles/884602/ Sesse <div class="FormattedComment"> FWIW, I did large chunks of Google&#x27;s IPv6 porting back in the day, and nearly all code was indeed written protocol-independently, so slotting in IPv6 support was fairly easy. You had to fix a few central abstractions, and some stuff around logging and ACLs and such, but it turns out most code doesn&#x27;t care much about what an IP address _is_, just that you can store it and send it around to other parts of the code. (We were, de facto, three people who did most of it over a period of 1–2 years, not all of us full-time.) IIRC, you could have pulled up an IPv6-only Borg cluster around early 2011 or so if you wanted, and have it run real production-like workloads. But changing how cluster operations work is a completely different game; one that I never pursued, and I have no idea how it works internally now (I&#x27;ve since quit and then rejoined Google, but in a completely different part of the company). And public cloud happened after all of us had moved on to other systems, so no, I&#x27;m not responsible for any deficiencies that might have :-)<br> </div> Sat, 12 Feb 2022 01:07:13 +0000 A memory allocator for BPF code https://lwn.net/Articles/884583/ https://lwn.net/Articles/884583/ jhoblitt <div class="FormattedComment"> There is also the small issue that until extremely recently, ipv6 support in network equipment often could not be trusted. It is easy to wave your hands and be dismissive of this but all of the so called hyperscalers had massive difficulty in rolling out ipv6. It is easy to say supports ipv6 on the box of a switch but does it work with the same reliability and hardware acceleration as ipv4? Say with NDP working reliability over an EVPN based spine+leaf deployment? The answer was obviously no until practical last week.<br> </div> Fri, 11 Feb 2022 21:29:38 +0000 A memory allocator for BPF code https://lwn.net/Articles/884569/ https://lwn.net/Articles/884569/ jd <div class="FormattedComment"> Well, given the way IPv6 has been used in practice, you&#x27;re absolutely right. And if you are content to only consider what is mainstream practice, nothing after this point will be of the slightest interest.<br> <p> According to how it was originally designed (a prefix which defines the route, and a suffix that defines any given physical or virtual machine that is connected via that route), there was a bit more freedom as any physical or virtual interface was absolutely guaranteed a unique IPv6 address. This was when autoconfiguring networks were to be the way and DHCP was seen as a maintenance-heavy dead end.<br> <p> (This is how the original mobility system worked. If you moved a machine from one network to another, you simply notified everyone connected that your prefix had changed and all packets - including those in transit - would be diverted to your new prefix. Your old address would be marked transient and would remain usable for existing packets but not new ones. This necessitates a unique suffix.)<br> <p> This means that to have a new subnet, you simply add a byte to the prefix to identify the new network. As long as there were bytes left you could use, defining new subnets was trivial. Which meant one interface had one IP address. There was no concept of multihome. In order to have more than one IP address go through an interface, you needed a virtual network, where each virtual interface had one IP address. Yes, it&#x27;s still an overlay network, but it became part of the design rather than an add-on, and there was no distinction between what was software and what was hardware. It was just one network.<br> <p> Telebit devised an extension for this, although I think it vanished with them. As far as I could understand, their system allowed the creation of networks that were utterly transparent to the outside world through some sort of NAT. Your traffic could even go over these transparent segments and back onto the visible network but it would look like a single hop to the outside world, which wasn&#x27;t bad for nascent IPv6 in 1996. That would have let you get past the prefix length limit.<br> <p> Of course, nothing actually stops you from using the autoconfigure protocol for containers and virtual switches in a virtual network of containers - well, other than this really doesn&#x27;t play nice with software that assumes a static IP. You&#x27;re only guaranteed a static suffix.<br> <p> If you&#x27;re happy with mainstream protocols, whether or not it&#x27;s standard use, then stop here. Because it&#x27;s about to get scary.<br> <p> IPv6 evolved out of a couple of experimental IPng protocols, one of which was TUBA. Now, with TUBA, the idea was that addresses were variable length. (I did say it was about to get scary.) IIRC, routing would be done by moving the cursor up or down from the current position to just the next few bytes to see where the packet should go. Something hardware engineers were getting ready to mutiny over, from the sounds of things. How this would have ever worked is left as an exercise for Steven King fans. But in principle, it would have meant that you couldn&#x27;t run out of space on the prefix.<br> <p> <p> </div> Fri, 11 Feb 2022 21:18:15 +0000 A memory allocator for BPF code https://lwn.net/Articles/884571/ https://lwn.net/Articles/884571/ zdzichu <div class="FormattedComment"> It&#x27;s worth noting there&#x27;s another company with great engineering accomplishments, namely Facebook (or maybe Meta now?). They went with IPv6-first internal networks in DCs. They also have own container-orchestration solution – Tupperware. <br> I&#x27;d love to see a comparison (with emphasis on networking) between Tupperware and Kubernetes/Borg.<br> </div> Fri, 11 Feb 2022 20:39:39 +0000 A memory allocator for BPF code https://lwn.net/Articles/884567/ https://lwn.net/Articles/884567/ jd <div class="FormattedComment"> Google was founded in 1998, two years after I&#x27;d set up Manchester University&#x27;s IPv6 node. So, yes, it&#x27;s before IPv6 became mainstream, but stacks were in Linux as of 2.0.20 (with a patch) and were in the main kernel by early 2.1 IIRC. (They&#x27;d already been in Windows - thanks to FTP Software, and Solaris.) KAME was already out for the BSDs, I think.<br> <p> At that time, IIRC, NRL were distributing a library for stack-independent network code. (The connection could be IPv4, IPv6 or indeed any other supported protocol and that detail would be hidden from the application.) Since Google&#x27;s use case was not that complicated, Google could even have written a small bit of code to do the same thing, although it would have meant getting the web server to support it, which would have added to what they had to maintain.<br> <p> So Google could have either supported IPv6 directly or developed their code to simply not care about that layer at all. Now, in hindsight, the extra work then (when things were simple) might have made sense, although it&#x27;s just as possible that they simply didn&#x27;t have the time or money to throw in features that they couldn&#x27;t sell at the time. And, let&#x27;s be honest, there was quite a lot of cynicism about IPv6 back then as well.<br> <p> Today, to have a protocol-independent communications layer would be a LOT of work for everyone, I&#x27;m not sure you could introduce such a system at this point, and switching a massive heap of interdependent legacy IPv4 code to IPv6 would be almost as painful.<br> </div> Fri, 11 Feb 2022 20:12:45 +0000 A memory allocator for BPF code https://lwn.net/Articles/884178/ https://lwn.net/Articles/884178/ Cyberax <div class="FormattedComment"> K8s network is not open to the public Internet at all. If you receive a packet from it, there&#x27;s a guarantee that it has been sent by somebody within the trusted net.<br> </div> Wed, 09 Feb 2022 07:51:18 +0000 A memory allocator for BPF code https://lwn.net/Articles/884176/ https://lwn.net/Articles/884176/ bartoc <div class="FormattedComment"> That&#x27;s sort of true I suppose, but it&#x27;s true in the same way that nat provides security guarantees (you can&#x27;t have an overlay network without doing that stuff, but you can do that stuff without having an overlay network)<br> </div> Wed, 09 Feb 2022 07:34:08 +0000 A memory allocator for BPF code https://lwn.net/Articles/884171/ https://lwn.net/Articles/884171/ bartoc <div class="FormattedComment"> no, &quot;overlay networks&quot; are not the same as &quot;giving each container it&#x27;s own IP&quot;<br> <p> Overlay networks imply some kind of packet encapsulation, with ipv6 that&#x27;s not necessary.<br> </div> Wed, 09 Feb 2022 07:31:38 +0000 A memory allocator for BPF code https://lwn.net/Articles/884168/ https://lwn.net/Articles/884168/ bartoc <div class="FormattedComment"> Yeah having to administer bgp servers of all things always seemed more painful to me than just getting ipv6 working correctly.<br> <p> I think part of it is that it comes from google&#x27;s borg, and google is the main contributor to the ecosystem, and google built out their datacenter architecture and management tools before ipv6 was a &quot;thing&quot;, and deploying ipv6 at the lowest levels of the &quot;hyperscalers&quot; data centers is .... really, really scary, so they didn&#x27;t, and then k8s needs to actually work on google infrastructure (and indeed, there&#x27;s no incentive to make it work anywhere else).<br> </div> Wed, 09 Feb 2022 06:58:37 +0000 A memory allocator for BPF code https://lwn.net/Articles/884155/ https://lwn.net/Articles/884155/ jhoblitt <div class="FormattedComment"> Giving each container its own network address which is distinct from the container host is called an overlay network. If it is ipv4, ipv6, ipx, or appletalk is just an implementation detail.<br> <p> k8s supports ipv4, ipv6, and last year dual stack came out of beta.<br> </div> Wed, 09 Feb 2022 01:19:37 +0000 A memory allocator for BPF code https://lwn.net/Articles/884141/ https://lwn.net/Articles/884141/ Cyberax <div class="FormattedComment"> You certainly can, and that&#x27;s what the newest K8s does. But K8s overlay network also provides some security guarantees, because it&#x27;s trusted and is not open to the public Internet.<br> </div> Tue, 08 Feb 2022 23:02:26 +0000 A memory allocator for BPF code https://lwn.net/Articles/884140/ https://lwn.net/Articles/884140/ Sesse <div class="FormattedComment"> OK, that&#x27;s pretty insane (it&#x27;s not like IPv6 has been particularly exotic for the last decade or two). But I guess people are so incredibly wed to their IPv4 thinking that they&#x27;d rather add thirty layers of complexity than switch to the version with more address space :-)<br> </div> Tue, 08 Feb 2022 22:59:26 +0000 A memory allocator for BPF code https://lwn.net/Articles/884136/ https://lwn.net/Articles/884136/ atnot <div class="FormattedComment"> It does theoretically, except that:<br> <p> * It is only somewhat usable in recent versions<br> * docker still does not enable it out of the box so nobody builds oci images with ipv6 in mind<br> * very little of the tooling has ever encountered a v6 address nevermind a v6-only environment<br> * neither have most developers working in the space<br> * the cloud and corporate environments these systems are usually rigidly v4-only<br> <p> Sometimes I wonder where we&#x27;d be today if the people at docker had decided to use v6 internally right away. Not like anyone would have batted an eye with all of the other quirks of docker. But instead every organization using 172.16.0.0/16 internally now has to deal with an endless stream of users running docker complaining not being able to access the network. Oh well.<br> </div> Tue, 08 Feb 2022 22:50:52 +0000 A memory allocator for BPF code https://lwn.net/Articles/884132/ https://lwn.net/Articles/884132/ Sesse <div class="FormattedComment"> <font class="QuotedText">&gt; It doesn&#x27;t particularly matter how large the addressable space is. The option is to either try to coordinate the ports used by potentially 100s of containers dynamically scheduled onto the same host or use an abstraction layer such that every container can have a service listening on port 80.</font><br> <p> Why is it not an option to give each container an IPv6 subnet?<br> <p> <font class="QuotedText">&gt; I can think of no examples of a container orchestrator that went with port coordination.</font><br> <p> Borg did.<br> </div> Tue, 08 Feb 2022 21:43:46 +0000 A memory allocator for BPF code https://lwn.net/Articles/884130/ https://lwn.net/Articles/884130/ jhoblitt <div class="FormattedComment"> It doesn&#x27;t particularly matter how large the addressable space is. The option is to either try to coordinate the ports used by potentially 100s of containers dynamically scheduled onto the same host or use an abstraction layer such that every container can have a service listening on port 80. I can think of no examples of a container orchestrator that went with port coordination. docker swarm, meos, ecs, k8s, cloud foundry, etc. support an overlay.<br> </div> Tue, 08 Feb 2022 21:40:59 +0000 A memory allocator for BPF code https://lwn.net/Articles/884128/ https://lwn.net/Articles/884128/ Sesse <div class="FormattedComment"> Really, “has no”? I assume these things support IPv6?<br> </div> Tue, 08 Feb 2022 21:27:02 +0000 A memory allocator for BPF code https://lwn.net/Articles/884126/ https://lwn.net/Articles/884126/ atnot <div class="FormattedComment"> I think they are unfortunately pretty inevitable when everyone has to share the same 24/16 bits of address space, especially over existing networks usually not built to the security and scalability requirements of running thousands of containers.<br> </div> Tue, 08 Feb 2022 21:22:10 +0000 RAMFS? https://lwn.net/Articles/884068/ https://lwn.net/Articles/884068/ k3ninho <div class="FormattedComment"> I was so hoping we could apply everything-is-a-file to this filter-all-the-things pattern...<br> <p> K3n.<br> </div> Tue, 08 Feb 2022 11:45:07 +0000 RAMFS? https://lwn.net/Articles/884067/ https://lwn.net/Articles/884067/ matthias <div class="FormattedComment"> I guess the answer is a simple no.<br> <p> Does RAMFS even support block sizes smaller than pagesize? The new allocator uses 64B as blocksize. Also the filesystem has a huge overhead, as each file needs an inode and a directory entry. All this is not needed for BPF programs. Knowing their address (in memory) and maybe size is enough to use them. All the overhead a filesystem has to represent a single file will be much bigger than the program itself.<br> <p> And the filesystem cannot manage the permissions. Filesystem permissions are a completely different thing. Here, we are talking about the access bits that are used by the memory management unit. And a filesystem will not help with that. You cannot mix data with executable code, as both things need different bits set in the pagetables. However these bits are only available per page. So you need a pool in memory that does not hold any data, just executable code. <br> </div> Tue, 08 Feb 2022 08:56:36 +0000 A memory allocator for BPF code https://lwn.net/Articles/884063/ https://lwn.net/Articles/884063/ bartoc <div class="FormattedComment"> I remain baffled that overlay networks are so ingrained in Kubernetes<br> </div> Tue, 08 Feb 2022 07:53:45 +0000 A memory allocator for BPF code https://lwn.net/Articles/883981/ https://lwn.net/Articles/883981/ k3ninho <div class="FormattedComment"> This &#x27;packing and fragmentation&#x27; pattern looks like a filesystem -- would extending RAMFS with de-allocation (so the kernel can manage the pool) be a better way to manage permissions and pack smaller-than-4K items across pages?<br> <p> K3n.<br> </div> Mon, 07 Feb 2022 10:41:46 +0000 A memory allocator for BPF code https://lwn.net/Articles/883934/ https://lwn.net/Articles/883934/ Sesse <div class="FormattedComment"> More precisely: You&#x27;d need hundreds just to break even, and in the thousands to see any gains.<br> <p> My server seems to have 20–30 BPF programs, but I guess this will eventually increase. Maybe this allocator will be optional?<br> </div> Sun, 06 Feb 2022 17:46:15 +0000 A memory allocator for BPF code https://lwn.net/Articles/883932/ https://lwn.net/Articles/883932/ ttuttle <div class="FormattedComment"> BPF programs come in as bytecode, and this allocator holds machine code. By the time it&#x27;s running, the code is already stored in a temporary buffer (from kvmalloc), so it knows how much space it needs.<br> <p> How does the BPF code itself know how big a buffer to ask kvmalloc for? Or is it one of those &quot;ask for a reasonable size and double it every time it fills up&quot; kind of deals?<br> </div> Sun, 06 Feb 2022 17:29:04 +0000 A memory allocator for BPF code https://lwn.net/Articles/883930/ https://lwn.net/Articles/883930/ plugwash <div class="FormattedComment"> A huge page is 512 regular pages. So it sounds like you would need hundreds of bpf programs to see a memory usage benefit from this patch. From your numbers it sounds like that would not be the case on your laptop.<br> <p> Of course you could argue that a couple of wasted megabytes is lost in the noise on a modern desktop/laptop but.............<br> </div> Sun, 06 Feb 2022 16:25:52 +0000 A memory allocator for BPF code https://lwn.net/Articles/883927/ https://lwn.net/Articles/883927/ jhoblitt <div class="FormattedComment"> There are already kubernetes CNIs that use bpf to create the overlay network. Maybe someone is pushing packet filtering into bpf as one rule per prog?<br> </div> Sun, 06 Feb 2022 14:30:00 +0000 A memory allocator for BPF code https://lwn.net/Articles/883919/ https://lwn.net/Articles/883919/ zdzichu There could be more than you expect. You can use <code>bpftool prog</code> command to see them.<br></br> On my typical Fedora laptop, there are 21 programs loaded. On my home server there are 470. All of them loaded by systemd and libvirtd, nothing custom. Sun, 06 Feb 2022 08:10:02 +0000 A memory allocator for BPF code https://lwn.net/Articles/883918/ https://lwn.net/Articles/883918/ epa <div class="FormattedComment"> I think you must have a lot if you care at all that each one uses a whole four kilobytes normally. <br> </div> Sun, 06 Feb 2022 08:00:40 +0000 A memory allocator for BPF code https://lwn.net/Articles/883906/ https://lwn.net/Articles/883906/ Sesse <div class="FormattedComment"> You sure have a lot of BPF programs if you feel that a huge page (2MB) is a good minimum size to spend on them!<br> </div> Sun, 06 Feb 2022 00:06:58 +0000 Pointed question, I guess... https://lwn.net/Articles/883862/ https://lwn.net/Articles/883862/ warrax <div class="FormattedComment"> I guess it&#x27;s not that much about the allocator specifically, but I&#x27;m wondering... as this BPF thing gains more and more capabilities... is there *any* attempt at formal verification that extensions don&#x27;t break the critical guarantees of the verification step?<br> </div> Fri, 04 Feb 2022 20:56:28 +0000