New system calls for memory management
process_vm_mmap()
There are many use cases for quickly moving data from one process to another; message-passing applications are one example, but far from the only one. Since the 3.2 development cycle, there has been a pair of specialized, little-known system calls intended for this purpose:
ssize_t process_vm_readv(pid_t pid, const struct iovec *lvec, unsigned long liovcnt, const struct iovec *rvec, unsigned long riovcnt, unsigned long flags); ssize_t process_vm_writev(pid_t pid, const struct iovec *lvec, unsigned long liovcnt, const struct iovec *rvec, unsigned long riovcnt, unsigned long flags);
Both calls copy data between the local address space (as described by the lvec array) and the remote space (described by rvec); they do so without moving the data through kernel space. For certain kinds of traffic they are quite efficient, but there are exceptions, especially as the amount of copied data gets large.
The cover letter for a patch set from Kirill Tkhai describes the problems some have encountered with these system calls: they have to actually pass over and access all of the data while copying it. If the data of interest happens to be swapped out, it will be brought back into RAM. The same is true for the destination; additionally, if the destination side does not have pages allocated in the given address range, more memory will have to be allocated to hold the copy. Then, all of the data passes through the CPU, thus wiping out the (presumably more useful) data already there. This leads to problems like:
Tkhai's solution is to introduce a new system call that avoids the copying:
int process_vm_mmap(pid_t pid, unsigned long src_addr, unsigned long len, unsigned long dst_addr, unsigned long flags);
This call is much like mmap(), in that it creates a new memory mapping in the calling process's address space; that mapping (possibly) starts at dst_addr and is len bytes long. It will be populated by the contents of the memory range starting at src_addr in the process identified by pid. There are a couple of flags defined: PVMMAP_FIXED to specify an exact address for the mapping and PVMMAP_FIXED_NOREPLACE to prevent a fixed mapping from replacing an existing mapping at the destination address.
The end result of the call looks much like what would happen with process_vm_readv(), but with a significant difference. Rather than copying the data into new pages, this system call copies the source process's page-table entries, essentially creating a shared mapping of the data. Avoiding the need to copy the data and possibly allocate new memory for it speeds things considerably; this call will also avoid swapping in memory that has been pushed out of RAM.
The response to this patch set has been guarded at best. Andy Lutomirski
didn't
think the new system call would help to solve the real problems and
called the API "quite dangerous and complex
". Some of his
concerns were addressed in the following conversation, but he is still
unconvinced that the problem can't be solved with splice().
Kirill Shutemov worried
that this functionality might not play well with the kernel's
reverse-mapping code and that it would "introduce hard-to-debug
bugs
". This discussion is still ongoing; process_vm_mmap()
might eventually find its way into the kernel, but there will need to be a
lot of questions answered first.
Remote madvise()
There are times when one process would like to call madvise() to change the kernel's handling of another process's memory. In the case described by Oleksandr Natalenko, it is desirable to get a process to use kernel same-page merging (KSM) to improve memory utilization. KSM is an opt-in feature that is requested with madvise(); if the process in question doesn't happen to make that call, there is no easy way to cause it to happen externally.
Natalenko's solution is to add a new file (called madvise) to each process's /proc directory. Writing merge to that file will have the same effect as an madvise(MADV_MERGEABLE) call covering the entire process address space; writing unmerge will turn off KSM. Possible future enhancements include the ability to affect only a portion of the target's address space and supporting other madvise() operations.
The reaction to this patch set has not been entirely enthusiastic either. Alexey Dobriyan would rather see a new system call added for this purpose. Michal Hocko agreed, suggesting that the "remote madvise()" idea briefly discussed at this year's Linux Storage, Filesystem, and Memory-Management Summit might be a better path to pursue.
process_madvise()
As it happens, Minchan Kim has come along with an implementation of the remote madvise() idea. This patch set introduces a system call that looks like this:
int process_madvise(int pidfd, void *addr, size_t length, int advice);
The result of this call is as if the process identified by pidfd (which is a pidfd file descriptor, rather than a process ID) called madvise() on the memory range identified by addr and length with the given advice. This API is relatively straightforward and easy to understand; it also only survived until the next patch in the series, which rather complicates things:
struct pr_madvise_param { int size; const struct iovec *vec; } int process_madvise(int pidfd, ssize_t nr_elem, int *behavior, struct pr_madvise_param *results, struct pr_madvise_param *ranges, unsigned long flags);
The purpose of this change was to allow a single process_madvise() call to make changes to many parts of the target process's address space. In particular, the behavior, results, and ranges arrays are each nr_elem elements long. For each entry, behavior is the set of madvise() flags to apply, ranges is a set of memory ranges held in the vec array, and results is an array of destinations for the results of the call on each range.
The patch set also adds a couple of new madvise() operations. MADV_COOL would cause the indicated pages to be moved to the head of the inactive list, causing them to be reclaimed in the near future (and, in particular, ahead of any pages still on the active list) if the system is under memory pressure. MADV_COLD, instead, moves the pages to the tail of the inactive list, possibly causing them to be reclaimed immediately. Both of these features, evidently, are something that the Android runtime system could benefit from.
The reaction to this proposal was warmer; when most of the comments are related to naming, chances are that the more fundamental issues have been taken care of. Christian Brauner, who has done most of the pidfd work, requested that any system call using pidfds start with "pidfd_"; he would thus like this new call to be named pidfd_madvise(). That opinion is not universally shared, though, so it's not clear that the name will actually change. There were more substantive objections to MADV_COOL and MADV_COLD, but less consensus on what the new names should be.
Hocko questioned the need for the multi-operation API, noting that madvise() operations are not normally expected (or needed) to be fast. Kim said he would come back with benchmark numbers to justify that API in a future posting.
Of the three interfaces described here, process_madvise() (or
whatever it ends up being named) seems like the most likely to proceed.
There appears to be a clear need for the ability to have one process change
how another process's memory is handled. All that is left is to hammer out
the details of how it should actually work.
Index entries for this article | |
---|---|
Kernel | Memory management |
Posted May 24, 2019 14:56 UTC (Fri)
by hyc (guest, #124633)
[Link]
Posted May 24, 2019 16:10 UTC (Fri)
by dullfire (guest, #111432)
[Link] (4 responses)
Posted May 24, 2019 21:48 UTC (Fri)
by roc (subscriber, #30627)
[Link] (3 responses)
Sharing a memfd only works if the target process arranges in advance to use memfds for all relevant memory, and it doesn't cover important cases like if the target process has mapped a file and you want access to that.
Posted May 25, 2019 0:23 UTC (Sat)
by dullfire (guest, #111432)
[Link] (2 responses)
Posted May 25, 2019 13:39 UTC (Sat)
by scientes (guest, #83068)
[Link] (1 responses)
How about adding a flag so that memfd_create can be supplied additional arguments that provide it with starting data (that will be COW). I think that would fit the VM migration case quite well.
Posted May 25, 2019 22:50 UTC (Sat)
by dullfire (guest, #111432)
[Link]
Posted May 24, 2019 16:54 UTC (Fri)
by josh (subscriber, #17465)
[Link] (1 responses)
Meanwhile, process_vm_mmap looks like a great way to set up shared memory.
Posted May 24, 2019 21:57 UTC (Fri)
by roc (subscriber, #30627)
[Link]
When rr checkpoints a set of processes via fork() we currently need to explicitly copy all MAP_SHARED regions so that the checkpoint is disconnected from future changes. It would be great if instead we could make a copy-on-write duplicate of those regions.
Posted Jun 9, 2019 11:29 UTC (Sun)
by felix.s (guest, #104710)
[Link]
My use case is a userspace DPMI host that keeps its guest process in a ptrace() sandbox to prevent it from issuing native system calls on its own. I figured I'd sometimes have to manipulate the guest's address space to respond to memory allocation requests, or to inject ‘gadgets’ that would safely invoke syscalls that would be otherwise blocked.
(I know, KVM would be better, but it's not always available.)
New system calls for memory management
New system calls for memory management
New system calls for memory management
New system calls for memory management
Additionally: if programs are going to use these new sys calls, they'd have to be written to do so. It'd probably be reasonable to make libc's use use a memfd+mmap to allocate memory, as opposed to sbrk/mmap-annon/(what ever else they might be using). Which would fairly trivially prodive the desired functionality.
New system calls for memory management
New system calls for memory management
New system calls for memory management
New system calls for memory management
New system calls for memory management