"User-space data may not be resident in RAM"

Posted Jul 8, 2020 6:34 UTC (Wed) by epa (subscriber, #39769)
Parent article: Sleepable BPF programs

I note that the most recent Fedora release no longer creates a swap partition or swap file by default. Instead there is swapping to zram. But perhaps we are nearing the day when even that won't be necessary and systems won't have swap any more. If you also got rid of demand paging for executables, and assuming for the moment that memory-mapped files need not be handled, how much kernel could could be simplified out once you know that all user-space data is always in physical RAM?

"User-space data may not be resident in RAM"

Posted Jul 8, 2020 12:32 UTC (Wed) by k3ninho (subscriber, #50375) [Link] (1 responses)

>...how much kernel could could be simplified out once you know that all user-space data is always in physical RAM?
You're still going to LRU files listed in the VFS cache, and maybe that would go away when you've got enough spare RAM to pin a mount point into the cache -- operating (eg root and home partitions) from RAM at the cost of being write heavy keeping disk in fsync().

K3n.

"User-space data may not be resident in RAM"

Posted Jul 8, 2020 12:54 UTC (Wed) by epa (subscriber, #39769) [Link]

Unless I've misunderstood, the VFS cache is not user-space data. I meant if you can assume that all of a user process's address space (apart from memory-mapped files) is in RAM and so there is no possibility of having to wait for it to be paged in.

"User-space data may not be resident in RAM"

Posted Jul 8, 2020 13:59 UTC (Wed) by corbet (editor, #1) [Link] (15 responses)

That would be so hugely wasteful of RAM that I have a hard time seeing it ever happening. Programs have a tendency to never access large parts of their address spaces. And that's before you get into other tricks, like copy-on-write or memory shared with devices, that can cause faults. You'll be waiting a long time waiting for that particular simplification, I think.

"User-space data may not be resident in RAM"

Posted Jul 8, 2020 20:38 UTC (Wed) by epa (subscriber, #39769) [Link] (4 responses)

If you apply the rule of thumb that swap space should equal twice the physical RAM(*), then you only need triple the amount of RAM to eliminate swap altogether. That could happen in a year or two depending on how well Moore's Law applies to RAM production. Of course, software becomes more bloated at the same time, which is why we still have swap space, but I think the improvement in machine size is slowly outstripping the rate at which programmers can add bloat.

As far as I know, Android systems don't have swap space. I would expect most embedded Linux systems to do without it also. A spinning disk is too slow, and swapping to flash memory wears it out. Meanwhile supercomputers will normally run a specialized application designed to fit into physical RAM. It's the medium-sized systems that still use swap, as a kind of last resort before breaking out the OOM killer. I think this middle space is likely to shrink in the years to come.

I take your point that programs may allocate a huge address space and then use little of it. That's a slightly different issue from swap: no 64-bit system has a swap file big enough for the whole virtual address space that some process might make, and you can have huge (but mostly empty) address spaces with no swap at all. But it would invalidate the property that access to user memory can never block. If the page had been allocated but not touched, a page fault would still be needed on first use.

(*) I know this rule came from various old systems and Unix variants in simpler times. I use it as a rough example of a fairly generous allocation of swap space, not that I am advocating it on current systems. My point is that even if swap is much bigger than RAM, that's still less than an order of magnitude to overcome.

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 8:14 UTC (Thu) by farnz (subscriber, #17727) [Link] (2 responses)

My Android devices do have swap - it's backed by a zram compressed RAM device, but it's swap nonetheless.

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 11:03 UTC (Thu) by epa (subscriber, #39769) [Link] (1 responses)

Thanks for the correction. So Fedora is now catching up to Android in this regard.

I wonder whether for Android devices an old-fashioned task swapper might work better than demand paging, with an app that you've switched away from (and which does not run in the background) being swapped out to zram and swapped back in when you open it.

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 11:29 UTC (Thu) by farnz (subscriber, #17727) [Link]

Android effectively has that, by requiring all apps to be capable of writing their own state to storage and exiting while backgrounded. However, swap helps because it allows more processes to sit idle in memory, ready for an on-demand "instant" switch instead of a slightly slower restore from storage.

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 9:40 UTC (Thu) by gray_-_wolf (subscriber, #131074) [Link]

> eliminate swap altogether.

I like to be able to hibernate my machine though...

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 0:50 UTC (Thu) by Fowl (subscriber, #65667) [Link] (9 responses)

Maybe in a world without fork()...

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 6:28 UTC (Thu) by epa (subscriber, #39769) [Link] (8 responses)

Yes, the fork-exec idiom is a longstanding pain point for conservative memory management. If only posix_spawn() weren't so kitchen-sink-looking compared with the classical elegance of separate fork and exec primitives.

As for ordinary fork() where you won't immediately exec() afterwards, that might not spoil things, as it's only *writing* to user-space memory that can cause a copy-on-write fault, not reading it. And a system call to read bytes from a file into some memory may as well reserve the physical space in advance, rather than allocating it only once data comes back from the disk.

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 6:35 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> As for ordinary fork() where you won't immediately exec() afterwards, that might not spoil things, as it's only *writing* to user-space memory that can cause a copy-on-write fault, not reading it.
Unless you're writing from other threads...

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 8:55 UTC (Thu) by epa (subscriber, #39769) [Link] (2 responses)

You've reached the limits of my knowledge. Could you explain how writing from other threads would cause a copy-on-write page fault to happen when a page is *read* rather than when it is written?

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 13:28 UTC (Thu) by Fowl (subscriber, #65667) [Link]

The fault wouldn't occur on a read, that's true, but even if the child process doesn't write, the parent is still running with all its threads, which will likely write eventually, and those can then unexpectedly fault.

So you'd have to not write in *both* the parent and the child until the child execs.

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 17:54 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

The child and parent both become CoW, and fork() does not stop any other threads in the parent. These threads can very well keep on writing to the newly shared memory.

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 13:13 UTC (Thu) by Fowl (subscriber, #65667) [Link] (2 responses)

Ha, that's an interesting idea - fork_nocow(). Exist in the parent's address space until exec().

Hey, clone(2) already exists! A trampoline to spill all your registers to memory and you're done. Except for having to be careful not to leak any memory, as the memory would leak in the parent too, it's probably a friendlier environment than using regular fork - all your threads are still there. Oh and locks, who knows what's going on with that.

Oh and how does thread-local storage work? Can you move it between threads? Is it even common? Does it depends on the thread ID? Is it possible to fake the TID? I don't know

Hmm and how do you know when you can free the stack you've just allocated for this new out-of-process-but-same-address-space thread you've just started? And I bet python and other runtime-y things are not going to be happy with two threads that think they're the same. Maybe block the parent until the child exec s? I'm going to wave my magic wand and proclaim that if you did this you could then share the same stack on both sides of this new fork_nocow.

Anyway, I had fun spending 15 minutes thinking about it.

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 13:20 UTC (Thu) by Fowl (subscriber, #65667) [Link] (1 responses)

Oh I'm an idiot. fork has to return - of course you can't share a stack. Copying/COWing just the stack would mess up all those RAII and scope based resource cleanup things. More likely to work with a callback, but then you've thrown away the almost-compatibility with existing code that was the whole point.

Still, fun.

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 17:06 UTC (Thu) by JGR (subscriber, #93631) [Link]

Your fork_nocow() is probably closest to vfork(2), which is approximately the same as forking but without a new stack or any CoW. However vfork is only really useful for the child to immediately call exec().

"User-space data may not be resident in RAM"

Posted Jul 9, 2020 21:39 UTC (Thu) by ibukanov (subscriber, #3942) [Link]

posix_spawn is a kitchen-sink because there is no way to create a suspended process, apply all necessary settings/syscalls to it and start it. Hence the need to communicate a lot of possible settings through a single system call.

"User-space data may not be resident in RAM"

Posted Jul 8, 2020 17:24 UTC (Wed) by zlynx (guest, #2285) [Link]

I think that might happen in the distant future when all storage is NVRAM. If your entire storage system *IS* RAM already, there would be no point in paging. The CPU would be handling it in hardware via level 3 or level 4 cache.