Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
Posted May 9, 2020 1:30 UTC (Sat) by NYKevin (subscriber, #129325)In reply to: Blocking userfaultfd() kernel-fault handling by Paf
Parent article: Blocking userfaultfd() kernel-fault handling
1. You're doing live migrations of VMs.
2. You can dynamically regenerate paged-out data faster than the OS can page it in.
(1) makes very little sense if you control all of the code in the VM, because it's far easier to just use a container instead of a VM, and start/stop instances as required (with all state living in some kind of database-like-thing, or perhaps a networked filesystem, depending on your needs). Sure, this is slightly more upfront design work, but live migration consumes an incredible amount of bandwidth once you try to scale it up, whereas container orchestration is a mature and well-understood technology. Unless you are making money per VM, it's difficult to justify the cost of live migration.
(Granted, if all of your VMs are very similar to one another, you might be able to develop a clever compression algorithm that shaves a lot of bytes off of that cost, but you're still not going to beat containers on size.)
That leaves (2). What's happening in case (2) is that you're using the page fault mechanism as a substitute for some kind of LRU cache for data that is expensive to compute, but cheaper than actually hitting the disk. But you can build an LRU cache in userspace, and it'll probably be a lot more efficient and easier to tune, since you can design it to exactly fit your specific use case. Trying to rope page faults into that problem makes no logical sense.
So, in conclusion, I'd tentatively suggest that distros consider turning the whole feature off and see if anything breaks. Perhaps they should teach their package managers to enable this setting if, and only if, one or more installed packages really need it.
Posted May 9, 2020 1:36 UTC (Sat)
by josh (subscriber, #17465)
[Link]
Posted May 9, 2020 2:00 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
Client workflows often can't be interrupted at will and even asking clients nicely to reboot their instances (so they can migrate to other hardware nodes) can take months. It's much easier to involuntarily migrate client VMs to different hardware.
Posted May 9, 2020 4:58 UTC (Sat)
by wahern (subscriber, #37304)
[Link] (3 responses)
Posted May 9, 2020 5:02 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Live migration is very useful to move client software out of a failing node. So really this makes sense only for large cloud providers.
Posted May 9, 2020 7:52 UTC (Sat)
by wahern (subscriber, #37304)
[Link] (1 responses)
Posted May 9, 2020 15:59 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted May 9, 2020 20:33 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link]
I don't understand how this contradicts anything that I said...
Posted May 13, 2020 8:48 UTC (Wed)
by nilsmeyer (guest, #122604)
[Link]
That is true in a lot of environments, especially when yo u are dealing with software that manages state. It's easy to say that one can design an application so this isn't necessary (though a lot of the container/cloud-native crowd completely ignores stateful systems), but the reality is very different.
Posted May 9, 2020 5:27 UTC (Sat)
by kccqzy (guest, #121854)
[Link] (1 responses)
Posted May 9, 2020 5:37 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted May 9, 2020 7:59 UTC (Sat)
by Sesse (subscriber, #53779)
[Link] (1 responses)
Posted May 11, 2020 7:02 UTC (Mon)
by gus3 (guest, #61103)
[Link]
If the user space handles compression, the kernel doesn't care about it at all.
They aren't related.
Posted May 9, 2020 12:58 UTC (Sat)
by roc (subscriber, #30627)
[Link] (1 responses)
We have a giant omniscient database which lets us reconstruct the memory state of a process at any point in its recorded history. Sometimes we want to execute an application function "as if" the process was at some point in that history. So we create a new process, ptrace it, create mappings in it corresponding to the VMAs that existed at that point in history, and enable userfaultfd() for those mappings. Then we set the registers into the right state for the function call and PTRACE_CONT. Every time the process touches a new page, we reconstruct the contents of that page from our database. Works great.
Posted May 9, 2020 13:00 UTC (Sat)
by roc (subscriber, #30627)
[Link]
Posted May 17, 2020 8:54 UTC (Sun)
by smooth1x (guest, #25322)
[Link]
Posted Jun 17, 2020 0:48 UTC (Wed)
by tobin_baker (guest, #139557)
[Link]
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
It actually does behind the scenes with T2 and T3 instances.
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
(de-)compression and view are different layers
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling
Blocking userfaultfd() kernel-fault handling