LWN: Comments on "The conclusion of the 5.14 merge window"

MADV_POPULATE_* and mbind()

rockeet — Sat, 16 Jul 2022 14:52:37 +0000

Is there any difference between MADV_POPULATE_* and mlock?

The conclusion of the 5.14 merge window

maxfragg — Thu, 15 Jul 2021 08:00:58 +0000

the cost would be higher in an OS less focused on portability than linux.
Since linux tends to use its own syscall dispatching over old limited syscall mechanisms, were you use a hardware instruction with an immediate syscall number, which tend to be quite limited.

for example x86-32 used int80h all syscalls, while some non portable systems might want to avoid dispatching inside the int80h handler and instead spread syscalls over the interupts, then if you run out of interupt numbers, you have a cost increase. Linux uses dispatiching anyways, so there is no big cost to have a thousand syscalls, besides someone having to maintain them all and the desire to basically never break even a single one

MADV_POPULATE_* and mbind()

abatters — Tue, 13 Jul 2021 14:41:50 +0000

Thanks for taking the time to look into this!

MADV_POPULATE_* and mbind()

david.hildenbrand — Tue, 13 Jul 2021 14:16:04 +0000

Makes sense! QEMU similarly reads+writes one byte of each page when told to preallocate guest memory; the read+write is in place to trigger COW, but to not overwrite existing data, for example, when some piece of guest memory corresponds to a virtual NVDIMM.

In the meantime, I verified that MADV_POPULATE_* and mbind() works as expected.

MADV_POPULATE_* and mbind()

abatters — Tue, 13 Jul 2021 13:17:44 +0000

I just double-checked, and you are correct, my code does write to the memory, and the comment even says that it is to break the COW mapping so that the memory is actually allocated, so my previous comment was in error.

The conclusion of the 5.14 merge window

Sesse — Tue, 13 Jul 2021 11:36:11 +0000

The cost is primarily technical, not really about performance. There might be some small cache effects if you call way too much different code, but it's unlikely to be a big deal.

MADV_POPULATE_* and mbind()

david.hildenbrand — Tue, 13 Jul 2021 08:01:57 +0000

> "by looping over the allocation and reading at PAGE_SIZE-intervals"

Are you sure that you are *reading* and not writing? On anonymous memory, reading will simply populate the shared zeropage, so I'd be surprised if it (no populated page vs. populated shared zeropage) makes a real difference when later reading from that mapping (read() ...), or even when writing to it (write() ...) in your example.

mlock(), MAP_POPULATE and the new MADV_POPULATE_READ and MADV_POPULATE_WRITE options nowadays all end up calling handle_mm_fault() -- the very basic fault handler also called on page faults on the faulting CPU. So I'd be surprised if they behave differently-- but I'll double check.

Note that there are subtle differences when it comes to shared mappings: mlock() and MAP_POPULATE won't trigger COW on shared mappings. But for your example, mmap(MAP_PRIVATE | MAP_ANONYMOUS), the mbind() documentation is quite clear: "pages will be allocated only according to the specified policy when the application writes (stores) to the page. For anonymous regions, an initial read access will use a shared page in the kernel containing all zeros. ". And I'd assume that holds for any allocations, also when triggering writes from other CPUs, e.g., as part of a syscall.

The conclusion of the 5.14 merge window

Paf — Tue, 13 Jul 2021 04:51:52 +0000

I think if new functionality is desired and can be clearly delineated, then it’s no worse than other systems growing larger. It has costs. But nothing enormous.

The conclusion of the 5.14 merge window

JohnVonNeumann — Mon, 12 Jul 2021 23:01:38 +0000

Taken from: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/...

> Nowadays a new system call cost is negligible while it is way
> simpler for userspace to deal with a clear-cut system calls than with a
> multiplexer or an overloaded syscall.

I am a Kernel noob, was just wondering what/if there are downsides to increasing the number of syscalls? Is there a worry about far too much fragmentation amongst syscalls? I guess if I was to make a bad comparison, I'm aware that the x86 instruction set is massive, and people like Chris Domas have done research and found hidden instructions due to the size of the instruction set. Again, I want to reiterate that I know this is a bad example, but I'm just trying to illustrate a point.

MADV_POPULATE_* and mbind()

abatters — Mon, 12 Jul 2021 22:07:29 +0000

In some of my programs I allocate memory with specific properties:

mmap(MAP_PRIVATE | MAP_ANONYMOUS)
mbind() to a specific NUMA node
set other madvise flags (MADV_HUGEPAGE, MADV_DONTDUMP, MADV_DONTFORK, etc.)
prefault in the pages manually by looping over the allocation and reading at PAGE_SIZE-intervals

A long time ago (many kernels ago), I found that prefaulting is needed because just doing a system call like read() and passing the buffer without prefaulting from userspace doesn't always obey mbind() policy. I once tried using mlock() to prefault the pages, but that ignored the mbind() policy also (again with old kernels).

So do these new MADV_POPULATE_* obey mbind() policy?