Out-of-memory victim selection with BPF

By Jonathan Corbet
August 17, 2023

In its default configuration, the Linux kernel will allow processes to allocate more memory than the system can actually provide; this policy enables better utilization of physical memory and works just fine — most of the time. On occasions, though, the kernel may find itself unable to provide memory that processes may think already belongs to them. If the situation gets bad enough, the only solution (short of rebooting) is to declare a sort of memory bankruptcy and write off some of the kernel's debts by killing one or more processes. Over the years, a great deal of effort has gone into heuristics to select the processes that the user is least likely to miss. This problem is still clearly not solved to everybody's satisfaction, though, so it was only a matter of time before somebody introduced a way to select the out-of-memory (OOM) victim using BPF.

There are numerous ways to go hunting for a process to sacrifice when memory runs out. The process using the most memory is an obvious choice, but that process is often something important: a window-system server or a database manager, for example. So developers have naturally tried, over the years, to enable the kernel to make a better choice; see the LWN kernel index to see how things have evolved over time. In current kernels, this decision comes down to a function called oom_badness() which, after exempting processes that cannot be killed for one reason or another, makes a simple calculation. A process's "OOM score" comes down to the amount of memory it uses, adjusted by that process's oom_score_adj value. By tweaking those knobs, user space can shelter some processes from the OOM-killer's depredations while directing its attention toward others.

That, evidently, is not enough control for some users. The BPF patch series from Chuyi Zhou is the latest in a series of attempts to improve that control.

In current kernels, the OOM killer will iterate through all of the possible target processes, call oom_badness() on each, then target the process that is given the highest score. Zhou's patch set allows the oom_badness() check to be replaced with a call to a BPF function, which should be defined as an "fmod_ret" tracing function (meaning it is invoked on return from an internal kernel function and can change that function's return value) with this name and prototype:

    int bpf_oom_evaluate_task(struct task_struct *task, struct oom_control *oc);

This function will be called at the beginning of the evaluation of each potential victim and, if it makes a decision on the given task, will cause the normal evaluation to be skipped. The oom_control structure describes the context in which the OOM kill is taking place; the BPF function has access to it but probably (the rules are not actually documented anywhere) should not make changes to it. That function can also look at the task under consideration and make a decision regarding its fate, as reflected in its return value:

NO_BPF_POLICY: no policy is in effect, so the normal oom_badness() method should be used.
BPF_EVAL_ABORT: abort the selection process entirely with no process chosen to kill.
BPF_EVAL_NEXT: move on to the next process, passing over this one.
BPF_EVAL_SELECT: select this process as the one to kill.

Returning BPF_EVAL_SELECT does not bring an end to the iteration through the list of processes; there will be further calls to bpf_oom_evaluate_task() if there are more processes to examine. As a result, the function can change its mind and return BPF_EVAL_SELECT again if a more appealing victim comes along later in the sequence.

It is possible to use BPF_EVAL_NEXT for some processes while using NO_BPF_POLICY for others. The end result will be to shield some processes from the OOM killer while letting the kernel make a decision in the usual way by looking at the rest. Mixing BPF_EVAL_SELECT and NO_BPF_POLICY looks like it could create surprising results, though; this combination does not appear to be intended and should, unless something changes in a future version, be avoided.

Specifically, the oom_control structure contains a pointer called chosen identifying the currently selected victim, and an integer chosen_points holding its badness score. In the absence of a BPF program, the kernel compares each process's score against chosen_points, and updates both if the new process has a higher score. Returning BPF_EVAL_SELECT sets chosen without setting chosen_points to anything. If BPF_NO_POLICY is returned for a later process, its score will be compared against a chosen_points that has no connection to the process selected earlier.

There are two related hooks provided by the patch set as well. One of them allows the name of the current victim-selection policy to be stored in the kernel; that name will be propagated through to the log when an actual kill is done. To do so, the program should define an fmod_ret function:

    void bpf_set_policy_name(struct oom_control *oc);

That function, which will be called at the beginning of the OOM-kill procedure, can then turn around and call:

    void set_oom_policy_name(struct oom_control *oc, const char *name, size_t sz);

Where oc is the oom_control structure passed to bpf_set_policy_name(), name is the policy name to use, and sz is the length of that name. Names are limited to 16 bytes, including the terminating NUL byte.

There is also a new tracepoint, select_bad_process_end, that fires if the OOM-kill procedure fails to find a process to kill. It is intended to be a helper for developers who are trying to develop a new OOM-kill policy.

This series is currently in its second revision. In response to the first posting, memory-management developer Michal Hocko suggested simplifying the interface somewhat. Roman Gushchin, instead, argued for taking a more general approach, where the BPF program is called once at OOM time and is expected to figure out a way to free some memory somewhere. Hocko responded that it would be better to start with "something that is good enough" add complexity later if it seems warranted. In response to the second revision, Alexei Starovoitov also supported a more general callback, though, and Zhou has started considering the implications of such a change.

Both Hocko and Gushchin expressed worries that introducing BPF into this code, which runs when the system is in a distressed state, could further reduce the stability of an out-of-memory situation. An attempt by a BPF program to allocate a lot of memory in this situation seems likely to end in tears, for example. That is true of any code that hooks into the OOM-killer, though, and is not a problem specific to BPF.

The conversation has shown that there is some interest in the use of BPF to select victims for the out-of-memory killer. Thus far, though, there is not a clear consensus on the approach that this work should take. It would not be surprising, at this point, to see this feature go through some significant changes before it gets closer to the mainline; until then, the kernel will just have to continue choosing processes to sacrifice the old-fashioned way.

Index entries for this article
Kernel	BPF/Memory management
Kernel	Memory management/Out-of-memory handling

Out-of-memory victim selection with BPF

Posted Aug 17, 2023 16:57 UTC (Thu) by rrolls (subscriber, #151126) [Link] (44 responses)

Ah yes, the OOM killer.

The handling of out-of-memory situations is one of the few things I still believe Windows does better than Linux, from my personal user experience; this is across Debian (my current go-to), Arch, CentOS 5-7 and a handful of other distros.

Windows sets up a default amount of swap space, and when RAM is close to exhaustion, programs tend to slow down a bit, but the system remains usable and stable, and all I need to do is realise what has happened, close enough programs to free up memory, and wait a minute or two, and it'll be as good as it was before.

Linux - or at least the distros I've used - also by default sets up an amount of swap space. However, the _moment_ it starts using any of that swap, the system becomes unusably slow - even basic interactions become virtually impossible - even switching to a text console and attempting to log in there takes so long that if it ever completes, whatever runaway process has filled up my RAM has probably eaten up all of swap by then as well. I have used some systems where it was possible to run swapoff to get the system back in working order, but otherwise, the only answer is the physical reset button. As a result, I now have a personal policy of always disabling swap on any Linux installation, resulting in the OOM killer getting invoked immediately instead, which usually still results in me needing to reset the system to recover, but at least this way I get to instantly find out what the problem is, and there's a chance there might be something left to salvage.

Personally, I've never bought the argument that overcommit provides more benefits than drawbacks. To me, the intuitive and obvious approach is that programs should be written in the first place to gracefully handle situations where malloc returns NULL, and failing that, a program should only ever get OOM-killed if that program is the one trying to allocate memory in the first place. One program calling malloc should not cause another program to get killed. (Yes, I understand that programs can cause memory to be allocated by writing to pages and so forth: I'm advocating that malloc, and its various higher-level analogues, should be the one and only point of possible OOM - and that if the system cannot guarantee memory will be available, it should not allocate the requested memory at all.)

I do wonder, overall, if all the effort spent on overcommit and OOM-selection could have been better spent designing programs to work without overcommit and handle failure gracefully. Of course, given that just about everything on Linux is currently written to assume overcommit is a thing, it'd be a mammoth task to change the status quo now, but am I really the only one to question whether, with hindsight, this was the best path to take?

Or, at least, why can't Linux at least do as well as Windows - in my experience - in this one respect?

Out-of-memory victim selection with BPF

Posted Aug 17, 2023 18:32 UTC (Thu) by wsy (subscriber, #121706) [Link]

I always install the maximum amount of RAM I can buy and disable swap. Exactly because when swap happens, I'd rather hard reset the computer than waiting indefinitely for the system to respond.

Out-of-memory victim selection with BPF

Posted Aug 17, 2023 18:33 UTC (Thu) by roc (subscriber, #30627) [Link] (11 responses)

Handling OOM "gracefully" in application code just doesn't make sense for most applications. In almost every case you simply cannot proceed and might as well just treat it the same as a crash, power-down, or other process-kill, in which case just killing the process at the point of OOM is much much simpler than having code to propagate errors up the stack until you get to some handler that does effectively the same thing.

And even when there is something other than just exit that you'd like to do for OOM, it's really hard to implement. You have to do whatever it is without allocating more memory, which is often impossible. When it is possible, you have to thoroughly test these code paths, which is more hard work. It's hardly ever worth it.

Out-of-memory victim selection with BPF

Posted Aug 17, 2023 18:36 UTC (Thu) by roc (subscriber, #30627) [Link] (3 responses)

It *might* be good for Linux to have a way to disable overcommit on a per-process basis, so when you've got some special processes that really can handle OOM, they can get that special treatment.

Out-of-memory victim selection with BPF

Posted Aug 17, 2023 21:20 UTC (Thu) by flussence (guest, #85566) [Link] (2 responses)

I'd like that for the opposite - most of my desktop behaves fine with overcommit off, but a handful of programs insist on allocating a hundred gigabytes and crash when they don't get it.

Out-of-memory victim selection with BPF

Posted Aug 18, 2023 10:36 UTC (Fri) by khim (subscriber, #9252) [Link]

And who would write such desktop for you? Currently the worst offenders which allocate tons of memory they would never use are precisely programs that show desktop, launch other programs and so on.

Out-of-memory victim selection with BPF

Posted Aug 18, 2023 23:40 UTC (Fri) by roc (subscriber, #30627) [Link]

I don't know what you're running on your desktop but I don't believe "most" programs are resilient to OOM on most people's desktops. Certainly anything using Python, Javascript, Java or almost any other language than C or maybe C++ cannot be resilient to OOM. Any application that seriously tries to be resilient to OOM must have a rigorous testing regime using fault injection to test the OOM paths, and I haven't seen *any* Linux desktop application or toolkit with such a testing regime. Heck, do X or Wayland have such a regime? Because if they don't, there's no hope at all.

Out-of-memory victim selection with BPF

Posted Aug 18, 2023 10:35 UTC (Fri) by khim (subscriber, #9252) [Link] (6 responses)

Try any program which was written before Windows craze began. Or even early Windows apps. If you don't have memory they would politely inform you about that fact and then you can close some documents or do some other thing.

Around the beginning of XXI century memory have become cheap enough that trying to ration it have become less important and, as usual, something that user couldn't see started degrading in hurry.

But it's absolutely feasible and possible to do, just there are no way to make that happen in a market economy where people don't know or don't care that program doesn't handle memory exhaustion gracefully.

Out-of-memory victim selection with BPF

Posted Aug 18, 2023 23:50 UTC (Fri) by roc (subscriber, #30627) [Link] (5 responses)

Applications and libraries became more and more complex and handling OOM became less and less feasible. Popping up a messagebox, receiving input, and handling document close events without allocating memory is a lot of hard work and as the OS components involved became more complex (e.g. window compositor, more complex graphics) that work became harder and harder, so people stopped trying.

Personally I don't hark back to those "good old days". They were also the days when a bug in one application or a power flicker would crash the entire system and cause you to lose all work not manually saved. Modern applications continually save your data so you lose practically nothing if the application dies for any reason --- a much better approach, which also makes specialized OOM handling unnecessary.

Out-of-memory victim selection with BPF

Posted Aug 19, 2023 14:49 UTC (Sat) by khim (subscriber, #9252) [Link] (4 responses)

> Modern applications continually save your data so you lose practically nothing if the application dies for any reason --- a much better approach, which also makes specialized OOM handling unnecessary.

Tell that fairy tale to someone else, please. People wouldn't have clamored for phones with gigabytes of RAM if that approach actually worked. Apple Lisa provided that capability fourty years ago and in Apple's walled garden this even works, to some degree. But still with a lot of glitches.

Everywhere else? Nope.

> Popping up a messagebox, receiving input, and handling document close events without allocating memory is a lot of hard work and as the OS components involved became more complex (e.g. window compositor, more complex graphics) that work became harder and harder, so people stopped trying.

It's not hard to set aside some emergency memory which can be used for showing such message boxes. TurboVision and OWL did that and it worked.

That's not entirely trivial, but certainly is much easier than robustly saving the app state and restoring it after crash. Saying that solution to the task which nobody was able to show is easier than something that was done routinely in the post is just useless posturing.

But as I have said: at some point computers got enough memory for the problem not to become obvious in 2 weeks in which you may demand your money back and thus incentive to write robust programs which one may actually trust disappeared.

In places where it actually matters they are still written and used.

> Personally I don't hark back to those "good old days".

History doesn't repeat itself, but it rhymes. I don't think we would ever return back to “good old days”, but after collapse of our modern infrastructure people would, hopefully, rebuild it in more robust fashion.

Not 100% bullet-proof one, but, hopefully, in a way that doesn't require throwing 1000x more resources on a problem then it intrinsically needs.

Out-of-memory victim selection with BPF

Posted Aug 20, 2023 5:12 UTC (Sun) by roc (subscriber, #30627) [Link] (3 responses)

Personally I'm pretty satisfied with the Android apps and Web apps I use not losing data. Doesn't feel like a fairy tale.

Out-of-memory victim selection with BPF

Posted Aug 20, 2023 10:02 UTC (Sun) by khim (subscriber, #9252) [Link] (2 responses)

> Personally I'm pretty satisfied with the Android apps and Web apps I use not losing data.

Let me try. “Reply to this comment”, type something, kill the tab (like OOM killer would do), try to reload. “Conform form resubmission” and no matter what you do comment is lost.

Let's try Reddir. “Comment”, kill. Same. Or maybe GitHub… nope.

Or do you want to say that they are saving “before doing important work” like Turbo Pascal did 40 years ago which could save file before trying to compile and run it? That's not a new trick and is not a replacement to proper handling of memory.

And the same with Android apps. Android assumes that it may kill them at any moment and they would just restore their state, somehow, but that's hit-and-miss and that's why phones with many gigabytes of memory are valuable.

> Doesn't feel like a fairy tale.

That's mostly thanks to modern OSes which properly isolate apps. The apps themselves are not any better and in data retention aspects they are even worse than they were 40 years ago.

You just could afford to turn on various “auto-save” options because modern SSDs are so much faster than old floppies. But these options, themselves, existed for decades and were available way before memory sloppy programming have become the norm.

Out-of-memory victim selection with BPF

Posted Aug 21, 2023 15:29 UTC (Mon) by jkingweb (subscriber, #113039) [Link] (1 responses)

>> Personally I'm pretty satisfied with the Android apps and Web apps I use not losing data.
>
> Let me try. “Reply to this comment”, type something, kill the tab (like OOM killer would do), try to reload. “Conform form resubmission” and no matter what you do comment is lost.

I can't speak to Web apps and closing tabs (and LWN is not a Web app, anyway), but if I type into the comment form and kill my Android browser (Vivaldi) explicitly, the text is retained when I start it back up. If I close the tab and use Vivaldi's "undo" function immediately the text is retained. If I wait for the "undo" prompt to go away and re-open the tab from the list of closed tabs the text is gone (but this is not very surprising).

Out-of-memory victim selection with BPF

Posted Aug 21, 2023 15:55 UTC (Mon) by khim (subscriber, #9252) [Link]

Yes. Some web apps are preserving content. Some Android apps are preserving it, too. Vivaldi is pretty good there.

But, as I have said, auto-save is not a new invention and yet it was always spotty. It's still spotty thus I'm not sure what kind of “progress” is discussed here.

Out-of-memory victim selection with BPF

Posted Aug 17, 2023 19:24 UTC (Thu) by mpr22 (subscriber, #60784) [Link] (2 responses)

Fundamentally, overcommit (in the form of copy-on-write pages) exists so that a process whose current RAM allocation exceeds the amount of remaining unallocated RAM can launch another program using the traditional Unix sequence "call fork(2), then have the child call execwhatever(2) after some small amount of housekeeping".

Also, the kernel knows nothing about malloc(3). The kernel only knows about mmap(2) and brk(2).

Out-of-memory victim selection with BPF

Posted Aug 17, 2023 19:47 UTC (Thu) by gfernandes (subscriber, #119910) [Link] (1 responses)

The kernel may not know about malloc - but that does not invalidate the point that malloc is the (correct? certainly is for Java VMs) place to make this decision and penalise the memory hog (or rogue), as opposed to an innocent bystander...

Out-of-memory victim selection with BPF

Posted Aug 20, 2023 17:57 UTC (Sun) by jmspeex (subscriber, #51639) [Link]

The this is that the program calling malloc() may very well be the innocent bystander here. You can have the memory hog allocating tons of memory, and then an innocent process tries to allocate a few bytes and trigger an OOM event. The whole point of the OOM killer is to avoid that innocent process getting killed while leaving the memory hog running. Also, as others have pointed out, it's not just malloc() that allocates memory. Growing the stack means memory gets allocated without the process even making a system call.

Out-of-memory victim selection with BPF

Posted Aug 17, 2023 21:25 UTC (Thu) by donald.buczek (subscriber, #112892) [Link]

> programs should be written in the first place to gracefully handle situations where malloc returns NULL,

Sure. However, a memory allocation might also happen because of automatic stack expansion. Without an explicit allocation call, which could return an error, it is much more difficult for an application to gracefully handle a failure.

Out-of-memory victim selection with BPF

Posted Aug 17, 2023 22:16 UTC (Thu) by WolfWings (subscriber, #56790) [Link] (2 responses)

Just switching off memory overcommit with the proper tunings and adding a large enough swap makes it behave basically identically to Windows.

While it won't overcommit it will still allow unfaulted pages to be used for page cache AFAIK so there's really no downsides.

vm.overcommit_memory=2
vm.overcommit_ratio=100

Some may bicker about the _ratio value or using _kbytes instead, but =100 works equally regardless of swap space or lack thereof, while anything lower makes a no-swap environment (might as well for VMs) just have inaccessible RAM left on the table.

And if there's some program that refuses to load because it relies on the page-fault handler as a lazy sparse array implementation, honestly I wrap that in it's own VM at that point.

Out-of-memory victim selection with BPF

Posted Aug 18, 2023 12:47 UTC (Fri) by Sesse (subscriber, #53779) [Link] (1 responses)

That “large enough” swap can easily go into hundreds of gigabytes for a pretty normal server, though. Anything that forks (e.g. Apache) eats into the overcommit space really fast.

Out-of-memory victim selection with BPF

Posted Aug 22, 2023 10:17 UTC (Tue) by WolfWings (subscriber, #56790) [Link]

"Large enough" hasn't been "RAM * 1.5" for years if not decades at this point. Any articles claiming that are cargo-culting a recommendation from many years ago when 1GB was a lot of RAM, much like generating CSRs with SNI records in OpenSSL haven't needed custom .cnf files for a very long time now but many tutorials still go that route.

Even MS starts at "RAM / 8, max 32GB" for their page file size recommended minimum. So beyond 256GB of RAM the baseline swap space doesn't go up any further. And even if you leave it on auto it will upper-limit it to 1/8th of the system volume size, and quite often large servers such as you're describing will have a separate boot volume of say a pair of RAID-1 480GB drives in my experience and the main data on separate volumes, so that puts a hard cap of 60GB virtual memory even if the server has over a terabyte of RAM.

https://learn.microsoft.com/en-us/troubleshoot/windows-cl...

The only time 'full RAM' sizes figure into things is if you expect and want to capture a full memory dump for either crash debugging or uncompressed hibernation. Even Ubuntu recommends (and I quote) round(sqrt(RAM)) so 1TB = 32G of swap... hey there's that number again.

https://help.ubuntu.com/community/SwapFaq

And disabling overcommit doesn't mean an allocation is instantly faulted in and removed from page cache. That's two separate things. Disabling overcommit just prevents the kernel from offering more pages than exist on the system, but doesn't change how those pages are actually managed. Copy-on-write still applies, etc.

Out-of-memory victim selection with BPF

Posted Aug 18, 2023 5:59 UTC (Fri) by timrichardson (subscriber, #72836) [Link] (3 responses)

Actually by default, windows and macos grow swap dynamically, which is a version of over-commit, I would say.
You can easily do the same thing in linux; in debian/linux use the package swapspace.

Swap is always slow if it is being used to consistently compensate for lack of ram. Windows and macos can't magically make it faster. The drama for desktop users with linux is the oom killer. The kernel one was very slow to act, at least before MGLRU, in which case you can face minutes, maybe hours, of an effectively frozen system while the OOM killer waits to see what happens.

In my stress testing, MGLRU radically improves kernel OOM handling for desktop users. user space killers are hard too; the systemd attempt based on memory pressure seems very, very hard to configure properly for the desktop, it either acts too late or kills too early. I gave up on it. I think tweaking systemd-oomd for a good out of the box experience for desktop users is going to be very, very hard.

I found the best option for desktop use where there is room for a swapfile on low memory systems is zswap, dynamic swap via swapspace, no user user-space killer and MGLRU activated. I don't think there is a single distribution which defaults to that combination.

Out-of-memory victim selection with BPF

Posted Aug 18, 2023 21:00 UTC (Fri) by intelfx (subscriber, #130118) [Link]

> I found the best option for desktop use where there is room for a swapfile on low memory systems is zswap, dynamic swap via swapspace, no user user-space killer and MGLRU activated. I don't think there is a single distribution which defaults to that combination.

I believe Arch does almost that, modulo swapspace:

$ grep -E '(LRU_GEN|ZSWAP)' config
CONFIG_ZSWAP=y
CONFIG_ZSWAP_DEFAULT_ON=y
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_DEFLATE is not set
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_LZO is not set
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_842 is not set
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_LZ4 is not set
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_LZ4HC is not set
CONFIG_ZSWAP_COMPRESSOR_DEFAULT_ZSTD=y
CONFIG_ZSWAP_COMPRESSOR_DEFAULT="zstd"
# CONFIG_ZSWAP_ZPOOL_DEFAULT_ZBUD is not set
# CONFIG_ZSWAP_ZPOOL_DEFAULT_Z3FOLD is not set
CONFIG_ZSWAP_ZPOOL_DEFAULT_ZSMALLOC=y
CONFIG_ZSWAP_ZPOOL_DEFAULT="zsmalloc"
CONFIG_LRU_GEN=y
CONFIG_LRU_GEN_ENABLED=y
# CONFIG_LRU_GEN_STATS is not set

Out-of-memory victim selection with BPF

Posted Aug 21, 2023 15:47 UTC (Mon) by patrakov (subscriber, #97174) [Link] (1 responses)

> In my stress testing, MGLRU radically improves kernel OOM handling for desktop users

For desktop systems this is true. The reason is that MGLRU protects the working set, and the interactive applications that the user cares about are exactly the working set.

On servers the situation is different. During a disaster, the sysadmin cares about the ability to ssh in, but, during normal operation, nobody is logged in, so sshd is an unused daemon, which gets evicted first and needs to be paged in. In other words, parts of sshd that are used only during the login procedure are not a part of the working set and are not protected by MGLRU. Result: existing connections with e.g. top (if they exist) have good response time, but new connections time out because the system cannot load the swapped-out parts of sshd fast enough.

Out-of-memory victim selection with BPF

Posted Aug 21, 2023 20:41 UTC (Mon) by atnot (subscriber, #124910) [Link]

I'd strongly recommend setting a memory.min setting on vital system services like sshd for this reason:

> Hard memory protection. If the memory usage of a cgroup is within its effective min boundary, the cgroup's memory won't be reclaimed under any conditions. If there is no unprotected reclaimable memory available, OOM killer is invoked. Above the effective min boundary (or effective low boundary if it is higher), pages are reclaimed proportionally to the overage, reducing reclaim pressure for smaller overages.

Out-of-memory victim selection with BPF

Posted Aug 18, 2023 8:36 UTC (Fri) by eduperez (guest, #11232) [Link] (1 responses)

> To me, the intuitive and obvious approach is that programs should be written in the first place to gracefully handle situations where malloc returns NULL

I do not think that the main argument in favor of overcommit are those programs that cannot handle a "no" from malloc; returning a valid pointer, to a memory that cannot be used because there is no more memory available, is just delaying the problem. Besides, that situation should be easy to take into account by the program, in most cases.

The main argument in favor of overcommit are those programs that allocate far more memory than they really need, because (in some use cases) it is easier to allocate a large memory block, and only use some parts of it, rather than allocate multiple smaller memory blocks. With overcommit, the kernel can delay the allocation until the program really uses each block; in many cases, most of that memory will never have to be allocated at all, and can be used by other programs.

In general, OOM kicks in when a program uses a memory block, not during the malloc.

Out-of-memory victim selection with BPF

Posted Aug 25, 2023 4:18 UTC (Fri) by ssmith32 (subscriber, #72404) [Link]

Most of the programs I've seen that try to do that have decided based on some wonky micro benchmark that They Know Best, and My Memory Allocator Algorithm Idea Is Just So Awesome, so it's time to take control of their destiny, and ignore the hard work that went into the standard allocator(s).

So, they try to play games to work around the actual memory allocator(s), blissfully ignorant of the fact they've accomplished exactly zero, and the actual memory allocator just effectively ignores their requests, and goes ahead and allocates when it decides is best.

Seems odd that we should worry about that scenario, but I can see why it happens, from both sides. E.g. Java programs working around the GC, which is on top of the userspace allocator, which is on top of the kernel allocator. They'd be better off with not Java, or understanding and configuring the many layers beneath them.

Out-of-memory victim selection with BPF

Posted Aug 18, 2023 13:28 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

> a program should only ever get OOM-killed if that program is the one trying to allocate memory in the first place.

A program allocating all memory it can and just sitting on it can then cause tiny utilities to fail to allocate a reasonably sized request.

There's actually a leak in macOS of Mach ports we're running into on our CI machines that causes process launch to fail, so any kind of remediation without an already-open System Monitor is DOA (e.g., SSH can't launch `zsh` for any kind of administration or even investigation).

I think that some more structured decision making behind which process needs to go is better than "whoever happens to get unlucky". Not that Linux's OOM killer is a *good* implementation, but I'd much rather something other than the processes I'm using to try and wrangle things be eligible for being next in line.

Out-of-memory victim selection with BPF

Posted Aug 19, 2023 14:10 UTC (Sat) by vadim (subscriber, #35271) [Link] (12 responses)

Overcommit doesn't exist because of malloc(). It exists because of fork()

So for instance, right now I have a firefox process taking 6 GB RAM. Lots of stuff running. If it wanted to fork(), this means the allocation of 6 GB RAM more. If it wants to say, run a tiny commandline tool then the outcome would be that you'd need to satisfy the demand for 12 GB RAM for a few ms, until execve freed the additional 6GB.

Windows doesn't have this issue because it doesn't have fork(). A process just starts from scratch, so this intermediary step of huge process having two copies of itself is never a thing that happens.

So a solution would be to kill off fork(), but it's much easier said than done. You could deal with the execve case easily, but that leaves fork() for multiprocessing. Right now I have 65 firefox processes running, which without overcommit would take about 250 GB RAM. Fixing this would be very difficult because it'd be an entirely new model. You might have to fork earlier, or explicitly mark memory as unshared, or something else along those lines.

Out-of-memory victim selection with BPF

Posted Aug 19, 2023 16:02 UTC (Sat) by mb (subscriber, #50428) [Link] (10 responses)

Well, do programs actually still use fork()?
There is vfork(), and clone() with all its flags and stuff to avoid a memory COW duplication.
COW is still kind of expensive, because it requires page table duplication. So it's worth avoiding, even without overcommit in mind.

Out-of-memory victim selection with BPF

Posted Aug 19, 2023 17:05 UTC (Sat) by vadim (subscriber, #35271) [Link] (6 responses)

vfork() blocks the parent. I don't think much uses that.

clone() is Linux specific and thus non-portable, and much more complicated to use than fork(). I'm sure there's stuff that uses it, but I think most things just won't bother without a good reason.

Out-of-memory victim selection with BPF

Posted Aug 19, 2023 17:42 UTC (Sat) by mb (subscriber, #50428) [Link] (5 responses)

> vfork() blocks the parent. I don't think much uses that.

Well, it blocks until the child calls execve(). Which is the only thing the child is supposed to do. That takes a microsecond or so (Plus two context switches).

Out-of-memory victim selection with BPF

Posted Aug 19, 2023 20:32 UTC (Sat) by kleptog (subscriber, #1183) [Link] (4 responses)

> Well, it blocks until the child calls execve(). Which is the only thing the child is supposed to do.

Well, processes almost always do other thing like close FDs, setup pipes, change permissions, configure signals, configure network/pid/ipc namepsaces, etc. The fact you need to actually do things between the fork() and execve() is why stuff like posix_spawn() never goes anywhere. There's an awful lot of state that gets inherited and you need to be able to manipulate all of it before starting the new process.

Ideally you'd like a way to create a new (empty) process and be able to manipulate its execution state using the standard syscalls without actually forking and then at the last moment kick it off with the new ELF image directly. Probably some smart cookie has designed such an interface, but I don't see it taking off any time soon.

Maybe an execve() with an io_uring-like list of syscalls to execute in the new process? Or via BPF?

Out-of-memory victim selection with BPF

Posted Aug 19, 2023 21:03 UTC (Sat) by izbyshev (guest, #107996) [Link]

> Well, processes almost always do other thing like close FDs, setup pipes, change permissions, configure signals, configure network/pid/ipc namepsaces, etc.

Technically, all these things are the kernel state, not the libc state, so they can be configured after vfork() via direct syscalls without ever touching libc. But this is rarely a good option for a typical application because there are some footguns with direct syscall usage, as well as with vfork() itself.

> Maybe an execve() with an io_uring-like list of syscalls to execute in the new process? Or via BPF?

This has been discussed: https://lwn.net/Articles/908268. No news after that, I'm afraid.

Out-of-memory victim selection with BPF

Posted Aug 20, 2023 3:01 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> Well, processes almost always do other thing like close FDs, setup pipes, change permissions, configure signals, configure network/pid/ipc namepsaces, etc.

For 99% of cases none of that is needed.

> Maybe an execve() with an io_uring-like list of syscalls to execute in the new process? Or via BPF?

Ideally? Create a process in a suspended state, returning its file descriptor, then poke at it with process management functions that accept FDs, and finally let it continue.

Out-of-memory victim selection with BPF

Posted Aug 21, 2023 16:33 UTC (Mon) by ibukanov (subscriber, #3942) [Link] (1 responses)

One does not even need to extends kernel API with syscalls that takes a process handle. Just have a single api to set the process handle on the current thread the following APIs will use.

Out-of-memory victim selection with BPF

Posted Aug 21, 2023 16:48 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

Oh man, that sounds like quite a footgun. Error paths forgetting to restore it, what about other threads, etc. But I guess that's nothing new to those coding close to the kernel. Are there any other kinds of syscalls that can manipulate a scoped resource like that (I'm excepting intrinsic properties like pid, tid, etc. here…I suppose `cwd` is such a resource, but that feels more user-y than kernel-y to me)?

Out-of-memory victim selection with BPF

Posted Aug 19, 2023 20:50 UTC (Sat) by izbyshev (guest, #107996) [Link] (2 responses)

> Well, do programs actually still use fork()?

Last time I checked the situation across the languages was the following:

* musl and modern glibc implement posix_spawn() via vfork().
* OpenJDK has been using vfork() by default since forever.
* Modern .NET (nee .NET Core) switched to vfork() at some point.
* CPython uses vfork() in subprocess on Linux by default since 3.10 (and can use posix_spawn() in some non-default cases since 3.8).
* Go uses vfork()-equivalent clone(CLONE_VM|CLONE_VFORK).
* IIRC Rust uses posix_spawn() with a fork() fallback for cases whey they need something not supported by the former.

So it's feasible to avoid fork() nowadays in most cases.

Out-of-memory victim selection with BPF

Posted Aug 21, 2023 16:59 UTC (Mon) by ibukanov (subscriber, #3942) [Link] (1 responses)

Thanks for the info!

But then one should be able disable overcommit on Linux and things will work since most apps use vfork and that does not require for the kernel to reserve the memory as the child shares the memory with the parent until exec. Or have I missed something?

Out-of-memory victim selection with BPF

Posted Aug 21, 2023 22:37 UTC (Mon) by izbyshev (guest, #107996) [Link]

> But then one should be able disable overcommit on Linux and things will work since most apps use vfork

While vfork() is used behind the scenes in many languages, I'd be surprised if most C/C++ programs migrated from fork(). And yes, posix_spawn() is suitable in many cases, but it still lacks portable options for some trivial stuff like changing the current directory (glibc and musl have posix_spawn_file_actions_addfchdir_np() extension for that). One could use a simple wrapper executable to tweak the child attributes and then execve() to the real program, but all of this is annoying.

So I'd expect that whether disabling overcommit is fine or not still depends on your set of apps heavily.

And of course, even if fork() had been the main reason to default to overcommit, there is stuff like mmap'ing lots of memory (without MAP_NORESERVE) and then not touching most of it, like those sparse data structures that people mentioned in the comments. Hyrum's law means that after all these years somebody definitely relies on this.

Out-of-memory victim selection with BPF

Posted Aug 19, 2023 18:53 UTC (Sat) by atnot (subscriber, #124910) [Link]

The great irony is that for all of the costs it imposes on the OS design, fork() and friends are more or less irreparably broken in any presence of threading or locks or signals i.e. in any modern program. Unless your process has been delicately set up for it, the only thing you can really safely do post fork is destroy the current context that you just worked so hard to duplicate by calling execve and friends. All for something that wouldn't be too hard to achieve by just sharing memory manually.

Out-of-memory victim selection with BPF

Posted Aug 21, 2023 17:22 UTC (Mon) by ibukanov (subscriber, #3942) [Link] (2 responses)

Windows and MacOS by default use memory compression that handles out-of memory much gracefully compared with swapping to slow or even not-so-fast SSD, while with many Linux distros the compression has to be explicitly enabled. Plus Linux has not one, but two memory compressors, zram and zswap, with each having plusses and minuses with the end result one may even need to enable and carefully configure both. Plus when the compression is enabled, it is not configured with good defaults. Like not using lz4 or zstd compression algorithms, not tuning swap look ahead etc.

Out-of-memory victim selection with BPF

Posted Aug 22, 2023 13:36 UTC (Tue) by intelfx (subscriber, #130118) [Link] (1 responses)

> Plus Linux has not one, but two memory compressors, zram and zswap, with each having plusses and minuses with the end result one may even need to enable and carefully configure both.

What are their plusses and minuses? What is the reason one might need to configure _both_?

> Plus when the compression is enabled, it is not configured with good defaults. Like not using lz4 or zstd compression algorithms, not tuning swap look ahead etc.

Could you elaborate?

Out-of-memory victim selection with BPF

Posted Aug 22, 2023 16:36 UTC (Tue) by ibukanov (subscriber, #3942) [Link]

See https://linuxreviews.org/Zram and https://www.reddit.com/r/Fedora/comments/mzun99/new_zram_... for a good overview and benchmarks.

My anecdotal experience is that when the memory usage is bursty with only occasionally exceeding the total RAM by 20-30% (like when compiling and linking many things in parallel), lz4 performs better then zstd. Otherwise zstd is abetter choice and its pity that is not used by default.

Also note that with a good modern SSD that has sustained read-write speed of 3 GB/s or more with hardware encryption on and does not suffer from persistent writing as in past it does not make sense to use compressed memory. Even with lz4 and multiple parallel compression the SSD will be faster and leave CPU available to do the real job. But if one does not trust hardware encryption, then zram with zstd or lz4 is the way to go.

Out-of-memory victim selection with BPF

Posted Aug 31, 2023 17:26 UTC (Thu) by tuna (guest, #44480) [Link]

"The handling of out-of-memory situations is one of the few things I still believe Windows does better than Linux, from my personal user experience; this is across Debian (my current go-to), Arch, CentOS 5-7 and a handful of other distros."

Fedora (and other distros) have put in a lot of work of improving the OOM behavior in desktop mode. Maybe you could try one of those and see if they behave better?

Turn it sideways (Out-of-memory victim selection with BPF)

Posted Sep 11, 2023 16:47 UTC (Mon) by fest3er (guest, #60379) [Link]

It occurs to me that the problem could be turned on its side. Instead of killing a random process, implement per-process swap space that is in addition to standard swap.

The kernel has access to 'everything', so when there is memory pressure:

Find the process to bully, preferably one that is growing the fastest or uses the most memory. If it has a process swap file, skip to step 4.
Find a FS that has, mmm, 10x more free space than memory requested and at least 10% available.
Create a contiguous swap file on that FS for the selected process, at 4x the requested space.
Page out (2x requested memory) of the process to its swap file, then complete the memory request.
Going forward, when that process wants more RAM, grow its swap space as needed and page more of it out.
When memory pressure eases, page other processes back in from their swap files as needed.
If a process' swap file hasn't been read from or written to for a minute or five and is (nearly) empty (basically use-once code/data and not much of anything else), migrate the contents to the system swap and remove the process' swap file.
When the process exits, clear and delete all of the swap spaces attached to it.

Per-process swap space could solve several problems. The whole system wouldn't necessarily slow down because one process is using tremendous amounts of memory and disk caching would be less affected. Huge processes would largely affect only themselves.