LWN: Comments on "Custom out-of-memory killers in BPF"

I'm from the back-times...

NYKevin — Mon, 05 May 2025 20:22:02 +0000

Borg and k8s are designed for use with many identical pod-equivalents (pods in k8s, allocs in Borg) as a means of horizontal scaling, preferably under the control of some kind of automation. In order for this to work well, each pod-equivalent should have predictable performance behavior, and should not abruptly get much slower when it begins to swap. Otherwise, you could get into a vicious cycle:

1. There is momentarily too much load. Let's assume for the sake of argument that physical RAM is the limiting resource.
2. Some of your pod-equivalents begin to swap due to (by assumption) a shortage of physical RAM. If the swapping is bad enough, their performance deteriorates considerably.
3. Automation notices that your performance is failing to keep up with demand, and spawns more pod-equivalents. It believes it has the resources to afford this operation, because more swap is still available and the binpacking algorithm believes the jobs will fit into that swap.
4. This is exactly the wrong thing to do. The problem is not that you lack pod-equivalents, the problem is that you lack physical RAM. Now you have more pod-equivalents competing for exactly the same amount of physical RAM. Each pod-equivalent probably has nonzero memory overhead, so this means less RAM is available for doing useful work, and overall swapping increases.
5. GOTO 3 until you run out of swap or a human takes over and kills the excess pod-equivalents.

If you don't support swap, then this never happens, and instead the automation reports that it is unable to scale up due to a lack of physical resources. You can then have other automation (such as load-balancing or work planning) take appropriate steps at a global level (move work to another physical location, drop lower priority work, etc.), or wake up a human and have them somehow intervene.

You might instead "support" swap, but somehow break it out as a separate resource dimension so that horizontal autoscaling does not engage in the above behavior. But swap is not a cleanly independent resource dimension - it is derived from the difference between the virtual memory allocation and the physical memory allocation. That means a human or *vertical* autoscaler cannot directly measure a pod-equivalent's swap and physical RAM demands separately, and instead needs to measure something like the page fault rate to get a complete picture. This is not impossible, but it's significantly more complex than just measuring memory usage, and is likely prone to misbehavior under the wrong conditions (some pod-equivalents will have a ton of page faults on startup, when reading data files, etc., and you have to filter out all of that noise).

It should also be emphasized that most autoscaling implementations will assume that resource-related metrics are linear in demand. The highly nonlinear behavior of swapping presents a difficult engineering problem in its own right (even if you could solve all of the other engineering problems I have described above).

> and will just shout-chant "cattle not pets" and "stateless stateless stateless" at you.

You are, of course, under no obligation to architect your systems in this manner (cattle, not pets). But it seems strange to use a tool specifically intended for that use case, and then complain that nobody is interested in contorting it into serving some other purpose instead.

Disclaimer: As a Google SRE, I work with Borg on a daily basis, but not so much with k8s. I've done my best to use the correct k8s terminology and concepts to get myself across. I am not specifically responsible for Borg itself, so it's likely that Borg SRE would explain this differently than I do.

Walking towards BPF overdependency

Jandar — Sat, 03 May 2025 23:06:50 +0000

> All things being equal, wouldn't we want more eBPF?

Only if an out of the box linux kernel is functional. Maybe a couple of necessary eBPF programs could be within the kernel sources and optionally exposed like the vsyscall memory segment or preloaded.

Walking towards BPF overdependency

quotemstr — Sat, 03 May 2025 20:22:59 +0000

So what? Functionality in eBPF is sandboxed; functionality outside it isn't. All things being equal, wouldn't we want more eBPF?

I'm from the back-times...

davecb — Fri, 02 May 2025 20:58:59 +0000

Indeed: if I wanted the limited Solaris functionally, I'd set MemoryHigh=bytes, as per
https://docs.kernel.org/admin-guide/cgroup-v2.html#memory...

Now, how can we improve on that with eBPF? (Hint: I mentioned Teamquest earlier)

--dave

(When I last looked, LimitRSS wasn't implemented on Linux. Also, the ulimit/LimitXXX mechanism is the one that causes the system to return SIGSEGV/EMFILE/EFBIG/SIGXCPU/EAGAIN errors, rather than making it page.

The MemoryXXX functions are
- MemoryMin=bytes, the amount of memory you’re guaranteed. If your usage is below this, you won’t be reclaimed/paged.
- MemoryHigh=bytes, the point at which the system starts penalising the process. If you exceed this, your process will be throttled and memory reclaimed, to avoid it being oom-killed.
- MemoryMax=bytes, the limit. If your process exceeds this, the oom-killer will be called.)

Walking towards BPF overdependency

karim — Fri, 02 May 2025 14:06:25 +0000

It sort of feels like we're slowly getting into a situation where having a "fully functional system" won't be possible without having a whole library of BPF programs running at all times.

I'm from the back-times...

Sesse — Fri, 02 May 2025 12:49:52 +0000

You can set LimitRSS= on a systemd service (which uses cgroups to this effect).

I'm from the back-times...

davecb — Fri, 02 May 2025 10:19:15 +0000

Excellent!
That's probably why Solaris makes you say what "normal" is (:-))

Of course, I probably want a computer program to measure past usage and tell me that. TeamQuest did, but that was for capacity planning.

I'm from the back-times...

taladar — Fri, 02 May 2025 08:50:25 +0000

The problem with that idea is that it is often not the largest user of memory that is at fault for physical memory running out, e.g. you might have a database server that legitimately uses 70% of your RAM and a webserver that legitimately uses 20% of RAM but some temporary process is the one that pushes you over the 100% threshold and then the database server is the most likely to get slowed.

would you have memory left to be able to initialzie the BPF script?

corbet — Fri, 02 May 2025 05:05:42 +0000

The BPF OOM program would be loaded early on, so it would be present and ready when the crisis strikes.

I'm from the back-times...

Cyberax — Fri, 02 May 2025 02:36:37 +0000

That doesn't actually turn off ALL the overcommit.

would you have memory left to be able to initialzie the BPF script?

atai — Thu, 01 May 2025 22:28:09 +0000

If you cannot create the BPF runtime or load the BPF program because of out of memory condition, then what?

I'm from the back-times...

ringerc — Thu, 01 May 2025 22:18:25 +0000

Unless you're stuck on Kubernetes, where they're incredibly hostile to the concept of swap, and will just shout-chant "cattle not pets" and "stateless stateless stateless" at you.

Ahem.

In reality k8s has more recently gained some limited and somewhat crippled swap support. But it's rather limited, and because it's behind a feature flag it's impossible to rely on if you have to support cloud-provider hosted k8s environments.

test comment

nicothieb — Thu, 01 May 2025 19:40:41 +0000

test comment (sorry)

I'm from the back-times...

davecb — Thu, 01 May 2025 16:08:15 +0000

Yup! I did that on certain production machines I manged in the past.

Mind you, I think an oom-killer is a strange but sometimes good idea. I was looking for possible better algorithms.
Solaris, for example, had a mechanism to cap rss for a process. If it exceeded that, it had to page.

That was one part of a general solution. That particular scheme assumed you knew how much rss to allocate to a program. Sometimes true, sometimes not (:-))

Userspace

danobi — Thu, 01 May 2025 15:47:52 +0000

> Since they run in user space, they will necessarily be slower than the kernel to respond to a low-memory situation, which is a problem that urgently requires a solution.

Not necessarily - oomd and friends are designed to kick in before the kernel OOM killer by looking at PSI and other metrics. It's a bit of a layered design where we prefer the kernel OOM killer to not be invoked, but still existing as a backstop.

> Running the user-space OOM killer may, itself, require allocating memory at the worst possible time.

I think memory.min can be used to reserve memory for these situations.

I'm from the back-times...

ballombe — Thu, 01 May 2025 15:27:10 +0000

You can go back to olden times by issuing
echo 2 > /proc/sys/vm/overcommit_memory

I'm from the back-times...

davecb — Thu, 01 May 2025 14:27:11 +0000

.. where you could overcommit memory, but only up to the total of the swap space and real memory.

The reasoning was that swap was slow, but large. The equivalent of cgroups allowed the offending programs to be slowed, limiting the amount of real memory they used and therefor forcing them to swap. That allowed the memory-hogs to be penalised for their excessive use of memory, and freed memory, and indirectly CPU, for the other program on the system.

These days, flash disks aren't all that slow. Is it time to think about different mitigations, knowing that "disk" is still large, but not slow.

As an example, in theory, we could use the cgroup primitives with the eBPF oom-detector to penalise just the memory hogs. Is that possible with the current eBPF and cgroups-v2?

--dave (unix greybeard) c-b