Sharing page tables with msharefs

By Jonathan Corbet
July 15, 2022

A page-table entry (PTE) is relatively small, requiring just eight bytes to refer to a 4096-byte page on most systems. It thus does not seem like a worrisome level of overhead, and little effort has been made over the kernel's history to reduce page-table memory consumption. Those eight bytes can hurt, though, if they are replicated across a sufficiently large set of processes. The msharefs patch set from Khalid Aziz is a revised attempt to address that problem, but it is proving to be a hard sell in the memory-management community.

One of the defining characteristics of a process on Linux (or most other operating systems) is a distinct address space. As a result, the page tables that manage the state of that address space are private to each process (though threads within a process will share page tables). So if two processes have mappings to the same page in physical memory, each will have an independent page-table entry for that page. The overhead for PTEs, thus, increases linearly with the number of processes mapping each page.

Even so, this cost is not normally problematic, but there is always somebody out there doing outlandish things. As described in the cover letter from the patch series:

On a database server with 300GB SGA [Oracle system global area], a system crash was seen with out-of-memory condition when 1500+ clients tried to share this SGA even though the system had 512GB of memory. On this server, in the worst case scenario of all 1500 processes mapping every page from SGA would have required 878GB+ for just the PTEs. If these PTEs could be shared, the amount of memory saved is very significant.

Sharing those PTEs is the objective of this work, which was discussed at the Linux Storage, Filesystem, Memory-Management, and BPF Summit in May. At that time, Aziz was proposing a new system call (mshare()) to manage this sharing. The current patch set has changed this interface and now requires no new system calls at all.

Even without the system call, it is still necessary for processes to explicitly request the sharing of page tables for a range of memory. The current patch set provides yet another kernel virtual filesystem — msharefs — for that purpose; it is expected to be mounted on /sys/fs/mshare. The file mshare_info in that filesystem will, when read, provide the minimum alignment required for a memory region to be able to share page tables.

The next step is to create a file under /sys/fs/mshare with a name that means something to the application. Then, an mmap() call should be used to map that file into the process's address space. The size passed to mmap() will determine the size of the resulting shared region of memory. Your editor's reading of the code suggests that providing an explicit address for the mapping is advisable; there does not appear to be any mechanism to automatically pick an address that meets the alignment requirements. Once the region has been mapped, it can be used just like any other memory range.

The purpose of creating such a region is to allow other processes to map it as well. Any other processes will need to start by opening the msharefs file created by the first process, then reading a structure of this type from it:

    struct mshare_info {
	unsigned long start;
	unsigned long size;
    };

The start and size fields provide the address at which the region is mapped and its size, respectively; the new process should pass those values (and the opened msharefs file) to its own mmap() call to map the shared region. After that, the region will be mapped just like any other shared-memory area — with a couple of important exceptions, as will be described below.

A process's address space is described by struct mm_struct; there is one such structure for each process (other than kernel threads) in the system. The msharefs patch set changes the longstanding one-to-one relationship between this structure and its owning process by creating a new mm_struct structure for each shared region. The page tables describing this region belong to this separate structure, rather than to any process's mm_struct. Whenever a process maps this region, the associated vm_area_struct (VMA) will contain a pointer to this special mm_struct. The end result is that all processes mapping this area will share not just the memory, but also the page tables that go along with it.

That saves the memory that would have gone into duplicate page tables, of course, but it also has a couple of other, possibly surprising, results. For example, changing the protection of memory within that region with mprotect() will affect all processes sharing the area; with ordinary shared memory, only the calling process will see protection changes. Similarly, the memory region can be remapped entirely with mremap() and all users will see the change.

It appears that use of mremap() is actually part of the expected pattern for PTE-shared memory regions. The mmap() call that is required to create the region will populate that region with anonymous memory; there is no way to request that file-backed memory be used instead. But it is possible to use mremap() to dump that initial mapping and substitute file-backed memory afterward. So applications wanting to use shared page tables with file-backed memory will have to perform this extra step to set things up correctly.

The developers at the LSFMM session were clear that they found this whole concept to be somewhat frightening. So far, the reaction to this patch series has (from a memory-management point of view) been relatively subdued, with the exception of David Hildenbrand, who is pushing for a different sort of solution. He would rather see a mechanism that would automatically share page tables when mappings are shared, without requiring application-level changes. That would make the benefits of sharing more widely available while exposing fewer internal memory-management details.

Automatic sharing would need to have different semantics, though; otherwise applications will be surprised when an mprotect() or mremap() call in another process changes their mappings. Though it was not stated in this version of Aziz's patch posting, the sense from the LSFMM session was that the altered semantics were desirable. If that is the case, fully automatic sharing will not be possible, since applications would have to opt in to that behavior.

Either way, it looks like this particular patch set needs more work and discussion before it can find its way into the mainline. Until then, applications depending on sharing memory between large numbers of processes will continue to pay a high page-table cost.

Index entries for this article
Kernel	Memory management/Page-table sharing

Sharing page tables with msharefs

Posted Jul 15, 2022 14:32 UTC (Fri) by clugstj (subscriber, #4020) [Link] (12 responses)

Maybe Oracle should not spawn a process for each client.

Sharing page tables with msharefs

Posted Jul 15, 2022 14:38 UTC (Fri) by willy (subscriber, #9762) [Link] (11 responses)

The database can work in a threaded mode, but performance on Linux is worse than the fork model.

Sharing page tables with msharefs

Posted Jul 15, 2022 15:28 UTC (Fri) by clugstj (subscriber, #4020) [Link] (7 responses)

Well, when given the choice between a performance hit and it crashing, I'd say you should pick performance hit.

Sharing page tables with msharefs

Posted Jul 15, 2022 17:11 UTC (Fri) by malmedal (subscriber, #56172) [Link] (2 responses)

If you don't care about performance you really shouldn't be running Oracle.

Sharing page tables with msharefs

Posted Jul 15, 2022 18:11 UTC (Fri) by k8to (guest, #15413) [Link] (1 responses)

But also, if you do care about performance you really shouldn't be running Oracle.

(Sort of a joke, because no one should be. But sort of true because they tend only win in the case of very fragile tuning that cannot be done in a timely way. And because of course if you spent that same money on hardware, you'd usually be better off with postgresql.)

Sharing page tables with msharefs

Posted Jul 15, 2022 20:20 UTC (Fri) by malmedal (subscriber, #56172) [Link]

I'm totally onboard with the no-one should be running Oracle sentiment :)

Several years out of the loop now, but I remember that Oracle, at least used to be, up to twice as fast as
PostgreSQL and MySQL on the same hardware and also the Magic Money Tree would provide far more money for
Oracle hardware since the license was so expensive :(

Sharing page tables with msharefs

Posted Jul 16, 2022 6:20 UTC (Sat) by mokki (subscriber, #33200) [Link] (3 responses)

Doesn't PostgreSQL also use fork per client model?

I was under the impression that thread per client model is faster, but less safe. In process per client model a bug in one client cannot corrupt memory in the process, outside the explicitly shared memory area.

Sharing page tables with msharefs

Posted Jul 16, 2022 9:44 UTC (Sat) by edeloget (subscriber, #88392) [Link] (2 responses)

> Doesn't PostgreSQL also use fork per client model?

Yes, but then, correct system architecture tells you to limit the number of clients to something that the machine can handle :)

If you cannot have 1500 clients on a signe machine, then maybe you shouldn't have 1500 clients on the same machine :)

Sharing page tables with msharefs

Posted Jul 16, 2022 9:58 UTC (Sat) by flussence (guest, #85566) [Link] (1 responses)

I've seen articles about database designs that, instead of routing things through a frontend web API or whatever, expose Postgres users and use its security rules model directly. I imagine that'd cause some scaling headaches if done to a medium-large website...

Sharing page tables with msharefs

Posted Jul 16, 2022 18:32 UTC (Sat) by butlerm (subscriber, #13312) [Link]

If your database server can handle the load in terms of number of connections, and your application can afford to be tightly coupled to the database design, and the connection latency is typical of a local area network, it is probably almost always faster to connect directly rather than running through extra tiers that mostly shuffle things around en route. In so many cases that is not possible any more though, even old school web applications multiplex database connections across concurrent users (and often do a outstanding job of it, but that is another story).

Sharing page tables with msharefs

Posted Jul 15, 2022 19:55 UTC (Fri) by josh (subscriber, #17465) [Link] (2 responses)

What makes it lower performance when using threads? Lock contention on the process mm? Something else?

Sharing page tables with msharefs

Posted Jul 15, 2022 20:52 UTC (Fri) by bartoc (guest, #124262) [Link]

Perhaps they want to COW some regions of memory while sharing others. This is quite difficult to do without fork/clone because as far as I can tell there's no other good way to create a COW mapping of another COW mapping. (On windows this is completely impossible, which is why you can't easily do things like forking game saves without implementing COW as part of your application's data structures). I'm not 100% sure how to do this on linux (without using fork/clone), maybe you can reflink tempfs files together or something.

Sharing page tables with msharefs

Posted Jul 15, 2022 22:59 UTC (Fri) by nickodell (subscriber, #125165) [Link]

Yes, exactly. From the May 17 article about this patchset:

> That raises an obvious question, he said: why not just use threads? The answer was that "mmap_lock sucks". It is also not possible to change the existing behavior of MAP_SHARED, since that would break programs, so there would need to be, at a minimum, a new mmap() flag if not a new system call. Aziz said that the separate system call makes the page-table sharing explicit rather than it just being a side effect. That makes the decision to opt into this behavior explicit as well.

https://lwn.net/Articles/895217/

Sharing page tables with msharefs

Posted Jul 18, 2022 11:53 UTC (Mon) by ddevault (subscriber, #99589) [Link] (2 responses)

>A page-table entry (PTE) is relatively small, requiring just eight bytes to refer to a 4096-byte page on most systems.

This is a little bit misleading. On x86_64, there are four levels of page tables: PML4, PDPT, PD, and PT; the latter contains PTEs (you also have PDE, PDPTE, and PML4E's going up the chain). A virtual address essentially contains four indexes into these tables, and the CPU follows the tables to resolve the physical address for a given virtual address, such that address 0x111222333444 will look up index 0x111 from the PML4 and so on (simplified representation) to find the appropriate physical memory address. Each page table is 4 KiB each and holds 512 of these 8-byte entries, be it PDEs or otherwise. If you map fewer than 512 pages, you still need to use at least 4K*4 = 16K bytes for the four page tables, each of which takes up a single page.

In the simple case, you can map up to 512 pages with just the four page tables -- assuming the addresses are contiguous. In the pathological case, the addresses are distributed throughout memory, in which case you might need 512*3 = 1536 page tables = 6 MiB to map all of them (*3 because you only ever need one PML4 per address space).

Sharing page tables with msharefs

Posted Jul 18, 2022 11:54 UTC (Mon) by ddevault (subscriber, #99589) [Link]

>In the pathological case, the addresses are distributed throughout memory, in which case you might need 512*3 = 1536 page tables = 6 MiB to map all of them (*3 because you only ever need one PML4 per address space).

One other note: this situation calls for 6 MiB of page tables to map 2 MiB worth of actual pages.

Sharing page tables with msharefs

Posted Jul 18, 2022 13:52 UTC (Mon) by corbet (editor, #1) [Link]

Pathological cases can always be constructed. But if you're trying to map GB of memory scattered through memory as individual pages, you're going to have other problems as well, and PTE overhead is still unlikely to be the constraining factor.

Sharing page tables with msharefs

Posted Jul 21, 2022 14:45 UTC (Thu) by rwmj (subscriber, #5474) [Link] (1 responses)

> Automatic sharing would need to have different semantics, though; otherwise applications will be surprised when an mprotect() or mremap() call in another process changes their mappings.

Wouldn't any automatic sharing solution have to spot this case and "unshare" the PTEs?

Sharing page tables with msharefs

Posted Jul 22, 2022 10:13 UTC (Fri) by flussence (guest, #85566) [Link]

It sounds like there's a lot of overlap here with Transparent Hugepages, which had to deal with similar merging and splitting.

Sharing page tables with msharefs

Posted Feb 22, 2023 22:35 UTC (Wed) by smooth1x (guest, #25322) [Link] (1 responses)

"those who do not understand unix are condemned to reinvent it poorly"

Solaris already had this since the 1990s, Intimate Shared Memory (ISM) - shmat with the SHM_SHARE_MMU flag.

Solaris even now has Dynamic Intimate Shared Memory - https://docs.oracle.com/cd/E19683-01/817-3801/whatsnew-s9...

When will Linux get with the program and get this added, only 30 years later?!!

Sharing page tables with msharefs

Posted Feb 22, 2023 23:46 UTC (Wed) by mpr22 (subscriber, #60784) [Link]

When someone who cares does the work.