New system calls for memory management

By Jonathan Corbet
May 24, 2019

Several new system calls have been proposed for addition to the kernel in a near-future release. A few of those, in particular, focus on memory-management tasks. Read on for a look at process_vm_mmap() (for zero-copy data transfer between processes), and two new APIs for advising the kernel about memory use in a different process.

process_vm_mmap()

There are many use cases for quickly moving data from one process to another; message-passing applications are one example, but far from the only one. Since the 3.2 development cycle, there has been a pair of specialized, little-known system calls intended for this purpose:

    ssize_t process_vm_readv(pid_t pid, const struct iovec  *lvec, 
			     unsigned long liovcnt, const struct iovec *rvec,
		 	     unsigned long riovcnt, unsigned long flags);

    ssize_t process_vm_writev(pid_t pid, const struct iovec  *lvec, 
			      unsigned long liovcnt, const struct iovec *rvec,
		 	      unsigned long riovcnt, unsigned long flags);

Both calls copy data between the local address space (as described by the lvec array) and the remote space (described by rvec); they do so without moving the data through kernel space. For certain kinds of traffic they are quite efficient, but there are exceptions, especially as the amount of copied data gets large.

The cover letter for a patch set from Kirill Tkhai describes the problems some have encountered with these system calls: they have to actually pass over and access all of the data while copying it. If the data of interest happens to be swapped out, it will be brought back into RAM. The same is true for the destination; additionally, if the destination side does not have pages allocated in the given address range, more memory will have to be allocated to hold the copy. Then, all of the data passes through the CPU, thus wiping out the (presumably more useful) data already there. This leads to problems like:

We observe similar problem during online migration of big enough containers, when after doubling of container's size, the time increases 100 times. The system resides under high IO and throwing out of useful caches.

Tkhai's solution is to introduce a new system call that avoids the copying:

    int process_vm_mmap(pid_t pid, unsigned long src_addr, unsigned long len,
			unsigned long dst_addr, unsigned long flags);

This call is much like mmap(), in that it creates a new memory mapping in the calling process's address space; that mapping (possibly) starts at dst_addr and is len bytes long. It will be populated by the contents of the memory range starting at src_addr in the process identified by pid. There are a couple of flags defined: PVMMAP_FIXED to specify an exact address for the mapping and PVMMAP_FIXED_NOREPLACE to prevent a fixed mapping from replacing an existing mapping at the destination address.

The end result of the call looks much like what would happen with process_vm_readv(), but with a significant difference. Rather than copying the data into new pages, this system call copies the source process's page-table entries, essentially creating a shared mapping of the data. Avoiding the need to copy the data and possibly allocate new memory for it speeds things considerably; this call will also avoid swapping in memory that has been pushed out of RAM.

The response to this patch set has been guarded at best. Andy Lutomirski didn't think the new system call would help to solve the real problems and called the API "quite dangerous and complex". Some of his concerns were addressed in the following conversation, but he is still unconvinced that the problem can't be solved with splice(). Kirill Shutemov worried that this functionality might not play well with the kernel's reverse-mapping code and that it would "introduce hard-to-debug bugs". This discussion is still ongoing; process_vm_mmap() might eventually find its way into the kernel, but there will need to be a lot of questions answered first.

Remote madvise()

There are times when one process would like to call madvise() to change the kernel's handling of another process's memory. In the case described by Oleksandr Natalenko, it is desirable to get a process to use kernel same-page merging (KSM) to improve memory utilization. KSM is an opt-in feature that is requested with madvise(); if the process in question doesn't happen to make that call, there is no easy way to cause it to happen externally.

Natalenko's solution is to add a new file (called madvise) to each process's /proc directory. Writing merge to that file will have the same effect as an madvise(MADV_MERGEABLE) call covering the entire process address space; writing unmerge will turn off KSM. Possible future enhancements include the ability to affect only a portion of the target's address space and supporting other madvise() operations.

The reaction to this patch set has not been entirely enthusiastic either. Alexey Dobriyan would rather see a new system call added for this purpose. Michal Hocko agreed, suggesting that the "remote madvise()" idea briefly discussed at this year's Linux Storage, Filesystem, and Memory-Management Summit might be a better path to pursue.

process_madvise()

As it happens, Minchan Kim has come along with an implementation of the remote madvise() idea. This patch set introduces a system call that looks like this:

    int process_madvise(int pidfd, void *addr, size_t length, int advice);

The result of this call is as if the process identified by pidfd (which is a pidfd file descriptor, rather than a process ID) called madvise() on the memory range identified by addr and length with the given advice. This API is relatively straightforward and easy to understand; it also only survived until the next patch in the series, which rather complicates things:

    struct pr_madvise_param {
    	int size;
    	const struct iovec *vec;
    }

    int process_madvise(int pidfd, ssize_t nr_elem,
		    	int *behavior,
		    	struct pr_madvise_param *results,
		    	struct pr_madvise_param *ranges,
		    	unsigned long flags);

The purpose of this change was to allow a single process_madvise() call to make changes to many parts of the target process's address space. In particular, the behavior, results, and ranges arrays are each nr_elem elements long. For each entry, behavior is the set of madvise() flags to apply, ranges is a set of memory ranges held in the vec array, and results is an array of destinations for the results of the call on each range.

The patch set also adds a couple of new madvise() operations. MADV_COOL would cause the indicated pages to be moved to the head of the inactive list, causing them to be reclaimed in the near future (and, in particular, ahead of any pages still on the active list) if the system is under memory pressure. MADV_COLD, instead, moves the pages to the tail of the inactive list, possibly causing them to be reclaimed immediately. Both of these features, evidently, are something that the Android runtime system could benefit from.

The reaction to this proposal was warmer; when most of the comments are related to naming, chances are that the more fundamental issues have been taken care of. Christian Brauner, who has done most of the pidfd work, requested that any system call using pidfds start with "pidfd_"; he would thus like this new call to be named pidfd_madvise(). That opinion is not universally shared, though, so it's not clear that the name will actually change. There were more substantive objections to MADV_COOL and MADV_COLD, but less consensus on what the new names should be.

Hocko questioned the need for the multi-operation API, noting that madvise() operations are not normally expected (or needed) to be fast. Kim said he would come back with benchmark numbers to justify that API in a future posting.

Of the three interfaces described here, process_madvise() (or whatever it ends up being named) seems like the most likely to proceed. There appears to be a clear need for the ability to have one process change how another process's memory is handled. All that is left is to hammer out the details of how it should actually work.

Index entries for this article
Kernel	Memory management

New system calls for memory management

Posted May 24, 2019 14:56 UTC (Fri) by hyc (guest, #124633) [Link]

They really need to implement the MAP_NOSYNC mmap flag that e.g FreeBSD already supports.

http://nixdoc.net/man-pages/FreeBSD/mmap.2.html

New system calls for memory management

Posted May 24, 2019 16:10 UTC (Fri) by dullfire (guest, #111432) [Link] (4 responses)

Whats the advantage of these syscalls as opposed to mmaping /proc/<pid>/mem, or sharing a memfd?

New system calls for memory management

Posted May 24, 2019 21:48 UTC (Fri) by roc (subscriber, #30627) [Link] (3 responses)

You can't mmap /proc/<pid>/mem. Mmapping /proc/<pid>/mem would be FAR more complex because what is mapped would change as the underlying process changes its mappings. You would also have to deal with deeply nested or even circular dependencies. As far as I can tell the proposed process_vm_mmap() would copy the underlying mappings rather than track them, which is much more sane.

Sharing a memfd only works if the target process arranges in advance to use memfds for all relevant memory, and it doesn't cover important cases like if the target process has mapped a file and you want access to that.

New system calls for memory management

Posted May 25, 2019 0:23 UTC (Sat) by dullfire (guest, #111432) [Link] (2 responses)

I dont know if linux's vm areas have an easy way (as currently designed) to handle it, but seems like making mmap on /proc/<pid>/map work with sane/well-thought-out semanitic would be the more flexible, and better design decision.
Additionally: if programs are going to use these new sys calls, they'd have to be written to do so. It'd probably be reasonable to make libc's use use a memfd+mmap to allocate memory, as opposed to sbrk/mmap-annon/(what ever else they might be using). Which would fairly trivially prodive the desired functionality.

New system calls for memory management

Posted May 25, 2019 13:39 UTC (Sat) by scientes (guest, #83068) [Link] (1 responses)

It was already explained to you that this wouldn't work because it would be expected that the mappings would change based on the other processes's mmap and munmap calls.

How about adding a flag so that memfd_create can be supplied additional arguments that provide it with starting data (that will be COW). I think that would fit the VM migration case quite well.

New system calls for memory management

Posted May 25, 2019 22:50 UTC (Sat) by dullfire (guest, #111432) [Link]

So you are saying that something, that is unimplmented now, won't work because peoples expectations of the functionality (which they can't have used because it doesn't yet exist) will not match the way it would have to actually work?

New system calls for memory management

Posted May 24, 2019 16:54 UTC (Fri) by josh (subscriber, #17465) [Link] (1 responses)

Personally, I'd love to see ways to set up copy-on-write mappings that *doesn't* require calling fork(). I've had *many* uses for that.

Meanwhile, process_vm_mmap looks like a great way to set up shared memory.

New system calls for memory management

Posted May 24, 2019 21:57 UTC (Fri) by roc (subscriber, #30627) [Link]

Yeah, the current semantics where memory is either private CoW or shared by everyone is unfortunately restrictive.

When rr checkpoints a set of processes via fork() we currently need to explicitly copy all MAP_SHARED regions so that the checkpoint is disconnected from future changes. It would be great if instead we could make a copy-on-write duplicate of those regions.

New system calls for memory management

Posted Jun 9, 2019 11:29 UTC (Sun) by felix.s (guest, #104710) [Link]

Seeing this proposal to introduce remote madvise() reminded me of something I wanted to accomplish a while ago, namely: is there a way to do mmap() or mprotect() on behalf of another process?

My use case is a userspace DPMI host that keeps its guest process in a ptrace() sandbox to prevent it from issuing native system calls on its own. I figured I'd sometimes have to manipulate the guest's address space to respond to memory allocation requests, or to inject ‘gadgets’ that would safely invoke syscalls that would be otherwise blocked.

(I know, KVM would be better, but it's not always available.)