Out-of-memory victim selection with BPF
There are numerous ways to go hunting for a process to sacrifice when memory runs out. The process using the most memory is an obvious choice, but that process is often something important: a window-system server or a database manager, for example. So developers have naturally tried, over the years, to enable the kernel to make a better choice; see the LWN kernel index to see how things have evolved over time. In current kernels, this decision comes down to a function called oom_badness() which, after exempting processes that cannot be killed for one reason or another, makes a simple calculation. A process's "OOM score" comes down to the amount of memory it uses, adjusted by that process's oom_score_adj value. By tweaking those knobs, user space can shelter some processes from the OOM-killer's depredations while directing its attention toward others.
That, evidently, is not enough control for some users. The BPF patch series from Chuyi Zhou is the latest in a series of attempts to improve that control.
In current kernels, the OOM killer will iterate through all of the possible target processes, call oom_badness() on each, then target the process that is given the highest score. Zhou's patch set allows the oom_badness() check to be replaced with a call to a BPF function, which should be defined as an "fmod_ret" tracing function (meaning it is invoked on return from an internal kernel function and can change that function's return value) with this name and prototype:
int bpf_oom_evaluate_task(struct task_struct *task, struct oom_control *oc);
This function will be called at the beginning of the evaluation of each potential victim and, if it makes a decision on the given task, will cause the normal evaluation to be skipped. The oom_control structure describes the context in which the OOM kill is taking place; the BPF function has access to it but probably (the rules are not actually documented anywhere) should not make changes to it. That function can also look at the task under consideration and make a decision regarding its fate, as reflected in its return value:
- NO_BPF_POLICY: no policy is in effect, so the normal oom_badness() method should be used.
- BPF_EVAL_ABORT: abort the selection process entirely with no process chosen to kill.
- BPF_EVAL_NEXT: move on to the next process, passing over this one.
- BPF_EVAL_SELECT: select this process as the one to kill.
Returning BPF_EVAL_SELECT does not bring an end to the iteration through the list of processes; there will be further calls to bpf_oom_evaluate_task() if there are more processes to examine. As a result, the function can change its mind and return BPF_EVAL_SELECT again if a more appealing victim comes along later in the sequence.
It is possible to use BPF_EVAL_NEXT for some processes while using NO_BPF_POLICY for others. The end result will be to shield some processes from the OOM killer while letting the kernel make a decision in the usual way by looking at the rest. Mixing BPF_EVAL_SELECT and NO_BPF_POLICY looks like it could create surprising results, though; this combination does not appear to be intended and should, unless something changes in a future version, be avoided.
Specifically, the oom_control structure contains a pointer called chosen identifying the currently selected victim, and an integer chosen_points holding its badness score. In the absence of a BPF program, the kernel compares each process's score against chosen_points, and updates both if the new process has a higher score. Returning BPF_EVAL_SELECT sets chosen without setting chosen_points to anything. If BPF_NO_POLICY is returned for a later process, its score will be compared against a chosen_points that has no connection to the process selected earlier.
There are two related hooks provided by the patch set as well. One of them allows the name of the current victim-selection policy to be stored in the kernel; that name will be propagated through to the log when an actual kill is done. To do so, the program should define an fmod_ret function:
void bpf_set_policy_name(struct oom_control *oc);
That function, which will be called at the beginning of the OOM-kill procedure, can then turn around and call:
void set_oom_policy_name(struct oom_control *oc, const char *name, size_t sz);
Where oc is the oom_control structure passed to bpf_set_policy_name(), name is the policy name to use, and sz is the length of that name. Names are limited to 16 bytes, including the terminating NUL byte.
There is also a new tracepoint, select_bad_process_end, that fires if the OOM-kill procedure fails to find a process to kill. It is intended to be a helper for developers who are trying to develop a new OOM-kill policy.
This series is currently in its second revision. In response to the
first posting, memory-management developer Michal Hocko suggested
simplifying the interface somewhat. Roman Gushchin, instead, argued for
taking a more general approach, where the BPF program is called once at OOM
time and is expected to figure out a way to free some memory somewhere.
Hocko responded that
it would be better to start with "something that is good enough
" add
complexity later if it seems warranted. In response to the second
revision, Alexei Starovoitov also
supported a more general callback, though, and Zhou has started
considering the implications of such a change.
Both Hocko and Gushchin expressed worries that introducing BPF into this code, which runs when the system is in a distressed state, could further reduce the stability of an out-of-memory situation. An attempt by a BPF program to allocate a lot of memory in this situation seems likely to end in tears, for example. That is true of any code that hooks into the OOM-killer, though, and is not a problem specific to BPF.
The conversation has shown that there is some interest in the use of BPF to
select victims for the out-of-memory killer. Thus far, though, there is
not a clear consensus on the approach that this work should take. It would
not be surprising, at this point, to see this feature go through some
significant changes before it gets closer to the mainline; until then, the
kernel will just have to continue choosing processes to sacrifice the
old-fashioned way.
Index entries for this article | |
---|---|
Kernel | BPF/Memory management |
Kernel | Memory management/Out-of-memory handling |
Posted Aug 17, 2023 16:57 UTC (Thu)
by rrolls (subscriber, #151126)
[Link] (44 responses)
The handling of out-of-memory situations is one of the few things I still believe Windows does better than Linux, from my personal user experience; this is across Debian (my current go-to), Arch, CentOS 5-7 and a handful of other distros.
Windows sets up a default amount of swap space, and when RAM is close to exhaustion, programs tend to slow down a bit, but the system remains usable and stable, and all I need to do is realise what has happened, close enough programs to free up memory, and wait a minute or two, and it'll be as good as it was before.
Linux - or at least the distros I've used - also by default sets up an amount of swap space. However, the _moment_ it starts using any of that swap, the system becomes unusably slow - even basic interactions become virtually impossible - even switching to a text console and attempting to log in there takes so long that if it ever completes, whatever runaway process has filled up my RAM has probably eaten up all of swap by then as well. I have used some systems where it was possible to run swapoff to get the system back in working order, but otherwise, the only answer is the physical reset button. As a result, I now have a personal policy of always disabling swap on any Linux installation, resulting in the OOM killer getting invoked immediately instead, which usually still results in me needing to reset the system to recover, but at least this way I get to instantly find out what the problem is, and there's a chance there might be something left to salvage.
Personally, I've never bought the argument that overcommit provides more benefits than drawbacks. To me, the intuitive and obvious approach is that programs should be written in the first place to gracefully handle situations where malloc returns NULL, and failing that, a program should only ever get OOM-killed if that program is the one trying to allocate memory in the first place. One program calling malloc should not cause another program to get killed. (Yes, I understand that programs can cause memory to be allocated by writing to pages and so forth: I'm advocating that malloc, and its various higher-level analogues, should be the one and only point of possible OOM - and that if the system cannot guarantee memory will be available, it should not allocate the requested memory at all.)
I do wonder, overall, if all the effort spent on overcommit and OOM-selection could have been better spent designing programs to work without overcommit and handle failure gracefully. Of course, given that just about everything on Linux is currently written to assume overcommit is a thing, it'd be a mammoth task to change the status quo now, but am I really the only one to question whether, with hindsight, this was the best path to take?
Or, at least, why can't Linux at least do as well as Windows - in my experience - in this one respect?
Posted Aug 17, 2023 18:32 UTC (Thu)
by wsy (subscriber, #121706)
[Link]
Posted Aug 17, 2023 18:33 UTC (Thu)
by roc (subscriber, #30627)
[Link] (11 responses)
And even when there is something other than just exit that you'd like to do for OOM, it's really hard to implement. You have to do whatever it is without allocating more memory, which is often impossible. When it is possible, you have to thoroughly test these code paths, which is more hard work. It's hardly ever worth it.
Posted Aug 17, 2023 18:36 UTC (Thu)
by roc (subscriber, #30627)
[Link] (3 responses)
Posted Aug 17, 2023 21:20 UTC (Thu)
by flussence (guest, #85566)
[Link] (2 responses)
Posted Aug 18, 2023 10:36 UTC (Fri)
by khim (subscriber, #9252)
[Link]
And who would write such desktop for you? Currently the worst offenders which allocate tons of memory they would never use are precisely programs that show desktop, launch other programs and so on.
Posted Aug 18, 2023 23:40 UTC (Fri)
by roc (subscriber, #30627)
[Link]
Posted Aug 18, 2023 10:35 UTC (Fri)
by khim (subscriber, #9252)
[Link] (6 responses)
Try any program which was written before Windows craze began. Or even early Windows apps. If you don't have memory they would politely inform you about that fact and then you can close some documents or do some other thing. Around the beginning of XXI century memory have become cheap enough that trying to ration it have become less important and, as usual, something that user couldn't see started degrading in hurry. But it's absolutely feasible and possible to do, just there are no way to make that happen in a market economy where people don't know or don't care that program doesn't handle memory exhaustion gracefully.
Posted Aug 18, 2023 23:50 UTC (Fri)
by roc (subscriber, #30627)
[Link] (5 responses)
Personally I don't hark back to those "good old days". They were also the days when a bug in one application or a power flicker would crash the entire system and cause you to lose all work not manually saved. Modern applications continually save your data so you lose practically nothing if the application dies for any reason --- a much better approach, which also makes specialized OOM handling unnecessary.
Posted Aug 19, 2023 14:49 UTC (Sat)
by khim (subscriber, #9252)
[Link] (4 responses)
Tell that fairy tale to someone else, please. People wouldn't have clamored for phones with gigabytes of RAM if that approach actually worked. Apple Lisa provided that capability fourty years ago and in Apple's walled garden this even works, to some degree. But still with a lot of glitches. Everywhere else? Nope. It's not hard to set aside some emergency memory which can be used for showing such message boxes. TurboVision and OWL did that and it worked. That's not entirely trivial, but certainly is much easier than robustly saving the app state and restoring it after crash. Saying that solution to the task which nobody was able to show is easier than something that was done routinely in the post is just useless posturing. But as I have said: at some point computers got enough memory for the problem not to become obvious in 2 weeks in which you may demand your money back and thus incentive to write robust programs which one may actually trust disappeared. In places where it actually matters they are still written and used. History doesn't repeat itself, but it rhymes. I don't think we would ever return back to “good old days”, but after collapse of our modern infrastructure people would, hopefully, rebuild it in more robust fashion. Not 100% bullet-proof one, but, hopefully, in a way that doesn't require throwing 1000x more resources on a problem then it intrinsically needs.
Posted Aug 20, 2023 5:12 UTC (Sun)
by roc (subscriber, #30627)
[Link] (3 responses)
Posted Aug 20, 2023 10:02 UTC (Sun)
by khim (subscriber, #9252)
[Link] (2 responses)
Let me try. “Reply to this comment”, type something, kill the tab (like OOM killer would do), try to reload. “Conform form resubmission” and no matter what you do comment is lost. Let's try Reddir. “Comment”, kill. Same. Or maybe GitHub… nope. Or do you want to say that they are saving “before doing important work” like Turbo Pascal did 40 years ago which could save file before trying to compile and run it? That's not a new trick and is not a replacement to proper handling of memory. And the same with Android apps. Android assumes that it may kill them at any moment and they would just restore their state, somehow, but that's hit-and-miss and that's why phones with many gigabytes of memory are valuable. That's mostly thanks to modern OSes which properly isolate apps. The apps themselves are not any better and in data retention aspects they are even worse than they were 40 years ago. You just could afford to turn on various “auto-save” options because modern SSDs are so much faster than old floppies. But these options, themselves, existed for decades and were available way before memory sloppy programming have become the norm.
Posted Aug 21, 2023 15:29 UTC (Mon)
by jkingweb (subscriber, #113039)
[Link] (1 responses)
I can't speak to Web apps and closing tabs (and LWN is not a Web app, anyway), but if I type into the comment form and kill my Android browser (Vivaldi) explicitly, the text is retained when I start it back up. If I close the tab and use Vivaldi's "undo" function immediately the text is retained. If I wait for the "undo" prompt to go away and re-open the tab from the list of closed tabs the text is gone (but this is not very surprising).
Posted Aug 21, 2023 15:55 UTC (Mon)
by khim (subscriber, #9252)
[Link]
Yes. Some web apps are preserving content. Some Android apps are preserving it, too. Vivaldi is pretty good there. But, as I have said, auto-save is not a new invention and yet it was always spotty. It's still spotty thus I'm not sure what kind of “progress” is discussed here.
Posted Aug 17, 2023 19:24 UTC (Thu)
by mpr22 (subscriber, #60784)
[Link] (2 responses)
Also, the kernel knows nothing about malloc(3). The kernel only knows about mmap(2) and brk(2).
Posted Aug 17, 2023 19:47 UTC (Thu)
by gfernandes (subscriber, #119910)
[Link] (1 responses)
Posted Aug 20, 2023 17:57 UTC (Sun)
by jmspeex (subscriber, #51639)
[Link]
Posted Aug 17, 2023 21:25 UTC (Thu)
by donald.buczek (subscriber, #112892)
[Link]
Sure. However, a memory allocation might also happen because of automatic stack expansion. Without an explicit allocation call, which could return an error, it is much more difficult for an application to gracefully handle a failure.
Posted Aug 17, 2023 22:16 UTC (Thu)
by WolfWings (subscriber, #56790)
[Link] (2 responses)
While it won't overcommit it will still allow unfaulted pages to be used for page cache AFAIK so there's really no downsides.
vm.overcommit_memory=2
Some may bicker about the _ratio value or using _kbytes instead, but =100 works equally regardless of swap space or lack thereof, while anything lower makes a no-swap environment (might as well for VMs) just have inaccessible RAM left on the table.
And if there's some program that refuses to load because it relies on the page-fault handler as a lazy sparse array implementation, honestly I wrap that in it's own VM at that point.
Posted Aug 18, 2023 12:47 UTC (Fri)
by Sesse (subscriber, #53779)
[Link] (1 responses)
Posted Aug 22, 2023 10:17 UTC (Tue)
by WolfWings (subscriber, #56790)
[Link]
Even MS starts at "RAM / 8, max 32GB" for their page file size recommended minimum. So beyond 256GB of RAM the baseline swap space doesn't go up any further. And even if you leave it on auto it will upper-limit it to 1/8th of the system volume size, and quite often large servers such as you're describing will have a separate boot volume of say a pair of RAID-1 480GB drives in my experience and the main data on separate volumes, so that puts a hard cap of 60GB virtual memory even if the server has over a terabyte of RAM.
https://learn.microsoft.com/en-us/troubleshoot/windows-cl...
The only time 'full RAM' sizes figure into things is if you expect and want to capture a full memory dump for either crash debugging or uncompressed hibernation. Even Ubuntu recommends (and I quote) round(sqrt(RAM)) so 1TB = 32G of swap... hey there's that number again.
https://help.ubuntu.com/community/SwapFaq
And disabling overcommit doesn't mean an allocation is instantly faulted in and removed from page cache. That's two separate things. Disabling overcommit just prevents the kernel from offering more pages than exist on the system, but doesn't change how those pages are actually managed. Copy-on-write still applies, etc.
Posted Aug 18, 2023 5:59 UTC (Fri)
by timrichardson (subscriber, #72836)
[Link] (3 responses)
Swap is always slow if it is being used to consistently compensate for lack of ram. Windows and macos can't magically make it faster. The drama for desktop users with linux is the oom killer. The kernel one was very slow to act, at least before MGLRU, in which case you can face minutes, maybe hours, of an effectively frozen system while the OOM killer waits to see what happens.
In my stress testing, MGLRU radically improves kernel OOM handling for desktop users. user space killers are hard too; the systemd attempt based on memory pressure seems very, very hard to configure properly for the desktop, it either acts too late or kills too early. I gave up on it. I think tweaking systemd-oomd for a good out of the box experience for desktop users is going to be very, very hard.
I found the best option for desktop use where there is room for a swapfile on low memory systems is zswap, dynamic swap via swapspace, no user user-space killer and MGLRU activated. I don't think there is a single distribution which defaults to that combination.
Posted Aug 18, 2023 21:00 UTC (Fri)
by intelfx (subscriber, #130118)
[Link]
I believe Arch does almost that, modulo swapspace:
$ grep -E '(LRU_GEN|ZSWAP)' config
Posted Aug 21, 2023 15:47 UTC (Mon)
by patrakov (subscriber, #97174)
[Link] (1 responses)
For desktop systems this is true. The reason is that MGLRU protects the working set, and the interactive applications that the user cares about are exactly the working set.
On servers the situation is different. During a disaster, the sysadmin cares about the ability to ssh in, but, during normal operation, nobody is logged in, so sshd is an unused daemon, which gets evicted first and needs to be paged in. In other words, parts of sshd that are used only during the login procedure are not a part of the working set and are not protected by MGLRU. Result: existing connections with e.g. top (if they exist) have good response time, but new connections time out because the system cannot load the swapped-out parts of sshd fast enough.
Posted Aug 21, 2023 20:41 UTC (Mon)
by atnot (subscriber, #124910)
[Link]
> Hard memory protection. If the memory usage of a cgroup is within its effective min boundary, the cgroup's memory won't be reclaimed under any conditions. If there is no unprotected reclaimable memory available, OOM killer is invoked. Above the effective min boundary (or effective low boundary if it is higher), pages are reclaimed proportionally to the overage, reducing reclaim pressure for smaller overages.
Posted Aug 18, 2023 8:36 UTC (Fri)
by eduperez (guest, #11232)
[Link] (1 responses)
I do not think that the main argument in favor of overcommit are those programs that cannot handle a "no" from malloc; returning a valid pointer, to a memory that cannot be used because there is no more memory available, is just delaying the problem. Besides, that situation should be easy to take into account by the program, in most cases.
The main argument in favor of overcommit are those programs that allocate far more memory than they really need, because (in some use cases) it is easier to allocate a large memory block, and only use some parts of it, rather than allocate multiple smaller memory blocks. With overcommit, the kernel can delay the allocation until the program really uses each block; in many cases, most of that memory will never have to be allocated at all, and can be used by other programs.
In general, OOM kicks in when a program uses a memory block, not during the malloc.
Posted Aug 25, 2023 4:18 UTC (Fri)
by ssmith32 (subscriber, #72404)
[Link]
So, they try to play games to work around the actual memory allocator(s), blissfully ignorant of the fact they've accomplished exactly zero, and the actual memory allocator just effectively ignores their requests, and goes ahead and allocates when it decides is best.
Seems odd that we should worry about that scenario, but I can see why it happens, from both sides. E.g. Java programs working around the GC, which is on top of the userspace allocator, which is on top of the kernel allocator. They'd be better off with not Java, or understanding and configuring the many layers beneath them.
Posted Aug 18, 2023 13:28 UTC (Fri)
by mathstuf (subscriber, #69389)
[Link]
A program allocating all memory it can and just sitting on it can then cause tiny utilities to fail to allocate a reasonably sized request.
There's actually a leak in macOS of Mach ports we're running into on our CI machines that causes process launch to fail, so any kind of remediation without an already-open System Monitor is DOA (e.g., SSH can't launch `zsh` for any kind of administration or even investigation).
I think that some more structured decision making behind which process needs to go is better than "whoever happens to get unlucky". Not that Linux's OOM killer is a *good* implementation, but I'd much rather something other than the processes I'm using to try and wrangle things be eligible for being next in line.
Posted Aug 19, 2023 14:10 UTC (Sat)
by vadim (subscriber, #35271)
[Link] (12 responses)
So for instance, right now I have a firefox process taking 6 GB RAM. Lots of stuff running. If it wanted to fork(), this means the allocation of 6 GB RAM more. If it wants to say, run a tiny commandline tool then the outcome would be that you'd need to satisfy the demand for 12 GB RAM for a few ms, until execve freed the additional 6GB.
Windows doesn't have this issue because it doesn't have fork(). A process just starts from scratch, so this intermediary step of huge process having two copies of itself is never a thing that happens.
So a solution would be to kill off fork(), but it's much easier said than done. You could deal with the execve case easily, but that leaves fork() for multiprocessing. Right now I have 65 firefox processes running, which without overcommit would take about 250 GB RAM. Fixing this would be very difficult because it'd be an entirely new model. You might have to fork earlier, or explicitly mark memory as unshared, or something else along those lines.
Posted Aug 19, 2023 16:02 UTC (Sat)
by mb (subscriber, #50428)
[Link] (10 responses)
Posted Aug 19, 2023 17:05 UTC (Sat)
by vadim (subscriber, #35271)
[Link] (6 responses)
clone() is Linux specific and thus non-portable, and much more complicated to use than fork(). I'm sure there's stuff that uses it, but I think most things just won't bother without a good reason.
Posted Aug 19, 2023 17:42 UTC (Sat)
by mb (subscriber, #50428)
[Link] (5 responses)
Well, it blocks until the child calls execve(). Which is the only thing the child is supposed to do. That takes a microsecond or so (Plus two context switches).
Posted Aug 19, 2023 20:32 UTC (Sat)
by kleptog (subscriber, #1183)
[Link] (4 responses)
Well, processes almost always do other thing like close FDs, setup pipes, change permissions, configure signals, configure network/pid/ipc namepsaces, etc. The fact you need to actually do things between the fork() and execve() is why stuff like posix_spawn() never goes anywhere. There's an awful lot of state that gets inherited and you need to be able to manipulate all of it before starting the new process.
Ideally you'd like a way to create a new (empty) process and be able to manipulate its execution state using the standard syscalls without actually forking and then at the last moment kick it off with the new ELF image directly. Probably some smart cookie has designed such an interface, but I don't see it taking off any time soon.
Maybe an execve() with an io_uring-like list of syscalls to execute in the new process? Or via BPF?
Posted Aug 19, 2023 21:03 UTC (Sat)
by izbyshev (guest, #107996)
[Link]
Technically, all these things are the kernel state, not the libc state, so they can be configured after vfork() via direct syscalls without ever touching libc. But this is rarely a good option for a typical application because there are some footguns with direct syscall usage, as well as with vfork() itself.
> Maybe an execve() with an io_uring-like list of syscalls to execute in the new process? Or via BPF?
This has been discussed: https://lwn.net/Articles/908268. No news after that, I'm afraid.
Posted Aug 20, 2023 3:01 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
For 99% of cases none of that is needed.
> Maybe an execve() with an io_uring-like list of syscalls to execute in the new process? Or via BPF?
Ideally? Create a process in a suspended state, returning its file descriptor, then poke at it with process management functions that accept FDs, and finally let it continue.
Posted Aug 21, 2023 16:33 UTC (Mon)
by ibukanov (subscriber, #3942)
[Link] (1 responses)
Posted Aug 21, 2023 16:48 UTC (Mon)
by mathstuf (subscriber, #69389)
[Link]
Posted Aug 19, 2023 20:50 UTC (Sat)
by izbyshev (guest, #107996)
[Link] (2 responses)
Last time I checked the situation across the languages was the following:
* musl and modern glibc implement posix_spawn() via vfork().
So it's feasible to avoid fork() nowadays in most cases.
Posted Aug 21, 2023 16:59 UTC (Mon)
by ibukanov (subscriber, #3942)
[Link] (1 responses)
But then one should be able disable overcommit on Linux and things will work since most apps use vfork and that does not require for the kernel to reserve the memory as the child shares the memory with the parent until exec. Or have I missed something?
Posted Aug 21, 2023 22:37 UTC (Mon)
by izbyshev (guest, #107996)
[Link]
While vfork() is used behind the scenes in many languages, I'd be surprised if most C/C++ programs migrated from fork(). And yes, posix_spawn() is suitable in many cases, but it still lacks portable options for some trivial stuff like changing the current directory (glibc and musl have posix_spawn_file_actions_addfchdir_np() extension for that). One could use a simple wrapper executable to tweak the child attributes and then execve() to the real program, but all of this is annoying.
So I'd expect that whether disabling overcommit is fine or not still depends on your set of apps heavily.
And of course, even if fork() had been the main reason to default to overcommit, there is stuff like mmap'ing lots of memory (without MAP_NORESERVE) and then not touching most of it, like those sparse data structures that people mentioned in the comments. Hyrum's law means that after all these years somebody definitely relies on this.
Posted Aug 19, 2023 18:53 UTC (Sat)
by atnot (subscriber, #124910)
[Link]
Posted Aug 21, 2023 17:22 UTC (Mon)
by ibukanov (subscriber, #3942)
[Link] (2 responses)
Posted Aug 22, 2023 13:36 UTC (Tue)
by intelfx (subscriber, #130118)
[Link] (1 responses)
What are their plusses and minuses? What is the reason one might need to configure _both_?
> Plus when the compression is enabled, it is not configured with good defaults. Like not using lz4 or zstd compression algorithms, not tuning swap look ahead etc.
Could you elaborate?
Posted Aug 22, 2023 16:36 UTC (Tue)
by ibukanov (subscriber, #3942)
[Link]
My anecdotal experience is that when the memory usage is bursty with only occasionally exceeding the total RAM by 20-30% (like when compiling and linking many things in parallel), lz4 performs better then zstd. Otherwise zstd is abetter choice and its pity that is not used by default.
Also note that with a good modern SSD that has sustained read-write speed of 3 GB/s or more with hardware encryption on and does not suffer from persistent writing as in past it does not make sense to use compressed memory. Even with lz4 and multiple parallel compression the SSD will be faster and leave CPU available to do the real job. But if one does not trust hardware encryption, then zram with zstd or lz4 is the way to go.
Posted Aug 31, 2023 17:26 UTC (Thu)
by tuna (guest, #44480)
[Link]
Fedora (and other distros) have put in a lot of work of improving the OOM behavior in desktop mode. Maybe you could try one of those and see if they behave better?
Posted Sep 11, 2023 16:47 UTC (Mon)
by fest3er (guest, #60379)
[Link]
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
> Modern applications continually save your data so you lose practically nothing if the application dies for any reason --- a much better approach, which also makes specialized OOM handling unnecessary.
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
> Personally I'm pretty satisfied with the Android apps and Web apps I use not losing data.
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
>
> Let me try. “Reply to this comment”, type something, kill the tab (like OOM killer would do), try to reload. “Conform form resubmission” and no matter what you do comment is lost.
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
vm.overcommit_ratio=100
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
You can easily do the same thing in linux; in debian/linux use the package swapspace.
Out-of-memory victim selection with BPF
CONFIG_ZSWAP=y
CONFIG_ZSWAP_DEFAULT_ON=y
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_DEFLATE is not set
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_LZO is not set
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_842 is not set
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_LZ4 is not set
# CONFIG_ZSWAP_COMPRESSOR_DEFAULT_LZ4HC is not set
CONFIG_ZSWAP_COMPRESSOR_DEFAULT_ZSTD=y
CONFIG_ZSWAP_COMPRESSOR_DEFAULT="zstd"
# CONFIG_ZSWAP_ZPOOL_DEFAULT_ZBUD is not set
# CONFIG_ZSWAP_ZPOOL_DEFAULT_Z3FOLD is not set
CONFIG_ZSWAP_ZPOOL_DEFAULT_ZSMALLOC=y
CONFIG_ZSWAP_ZPOOL_DEFAULT="zsmalloc"
CONFIG_LRU_GEN=y
CONFIG_LRU_GEN_ENABLED=y
# CONFIG_LRU_GEN_STATS is not set
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
There is vfork(), and clone() with all its flags and stuff to avoid a memory COW duplication.
COW is still kind of expensive, because it requires page table duplication. So it's worth avoiding, even without overcommit in mind.
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
* OpenJDK has been using vfork() by default since forever.
* Modern .NET (nee .NET Core) switched to vfork() at some point.
* CPython uses vfork() in subprocess on Linux by default since 3.10 (and can use posix_spawn() in some non-default cases since 3.8).
* Go uses vfork()-equivalent clone(CLONE_VM|CLONE_VFORK).
* IIRC Rust uses posix_spawn() with a fork() fallback for cases whey they need something not supported by the former.
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
Out-of-memory victim selection with BPF
It occurs to me that the problem could be turned on its side. Instead of killing a random process, implement per-process swap space that is in addition to standard swap.Turn it sideways (Out-of-memory victim selection with BPF)
The kernel has access to 'everything', so when there is memory pressure:
Per-process swap space could solve several problems. The whole system wouldn't necessarily slow down because one process is using tremendous amounts of memory and disk caching would be less affected. Huge processes would largely affect only themselves.