Sharing page tables with mshare()
Sharing page tables with mshare()
Posted May 19, 2022 4:51 UTC (Thu) by roc (subscriber, #30627)In reply to: Sharing page tables with mshare() by re:fi.64
Parent article: Sharing page tables with mshare()
Posted May 20, 2022 16:45 UTC (Fri)
by kaziz (subscriber, #117201)
[Link] (2 responses)
Posted Jun 13, 2022 15:40 UTC (Mon)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Dec 24, 2022 8:37 UTC (Sat)
by dankamongmen (subscriber, #35141)
[Link]
* if i want some number of virtually-contiguous bytes, i use malloc() (which uses mmap() as a backend for sufficiently large requests, but always the same way: anonymous private).
i can't wait for CXL to find its place among this strange brew. there's kernel's mm/ and hardware coherence and the POSIX+ vma/file APIs, and the first two seem more or less on the same page, but the last is at best an imprecise and incomplete means of influencing them.
the mmap(2) flag potpourri is acceptable for sharing and coarse-grained access control on a computer from the early 80s, but as a userspace developer who'd like to use memory effectively on something more complex than an ATARI i want to know:
* what are my memory hierarchies?
i want to be able to say "i have a dense 24MB. gimme good memory for that." or "i have 32MB of sparse garbage." or "gimme a sack of buffers that don't alias one another, without my personal study of cache details for this Garbotron 7000". or "i always want these 64KB hot in my TLB but i don't care whether it's a dainty or beefy page beyond that." or "map this file so i can change it in ram and leave things to the page cache, keep it simple, i am eight."
iouring is getting really close with buffer pools and some kernelside dataflow. some kind of buffer coloring seems like it could go a long way here. i did something similar with my libtorque allocator[1], but that project effectively died over a decade ago.
bandwidths and latencies are also interesting, but i'm less likely to design around their precise values (and doing so would lead to a system very sensitive to disruption from other processors). non-temportal stores, IO device streaming into or out of on-die cache, IO device scatter/gather restrictions -- these are all necessary to achieve peak performance. systemwide monitoring of cycles lost to TLB misses could probably be used to configure all the hugetlb stuff better than admins+devs ever could manually. but let's get the simple stuff first.
sorry for the rant, but all this stuff has been the bane of my existence as a userspace hacker for a minute now. it works well enough in rarified HPC environments, where you know and control the machine, but putting out code hoping to use just basic hugetlbs on arbitrary machines is an exercise in annoyance.
Sharing page tables with mshare()
Sharing page tables with mshare()
Sharing page tables with mshare()
* if i need them aligned, i use posix_memalign(3).
* if my use case maps to one of madvise(2)'s Borges-like[0] flags, i can give it the ol' college try.
** will it be a no-op this kernel version? who am i to question the will of Allah?
* mremap(2) is there, guaranteeing me job security dealing with subtle linux v freebsd difference
* vmsplice and zerocopy and userfaultfd sometimes come over on the weekends
* MAP_HUGETLB? MAP_HUGE_2MB? MAP_HUGE_1GB? can i provide both to fall back if one isn't available? is hugetlbfs involved? what if i map hugetlbfs without these flags? are hugepages better than superpages? are my smallpages being combined into hugepages by the kernel? is the kernel fracturing my hugepages into smallpages? without kernel command line options can i use them? yeah? without runtime configuration requiring CAP_SYS_ADMIN? so i can't configure them without privs, but i can map them without privs? oh no i need CAP_IPC_LOCK? but i'm not doing IPC? i guess it's a "memory resource" so act like it's an mlock(2)? but i don't need that for regular pages? must every mmap() i write forevermore first call trying to get hugepages, then call again when that fails? how do i know whether the failure was due to hugepages? are a single one of these resource limits tied into the number of largepage TLB entries, probably the most relevant parameter for effective hugetlb use? (these are rhetorical questions; you needn't point me to the answers as of 6.1.1.)
* what are my memory-processor-IO topologies?
* what are my cacheline sizes, cache capacities, cache associativities, and page sizes?
* most importantly: given a proposed working set size, where can i keep it in memory?
[0] https://en.wikipedia.org/wiki/Celestial_Emporium_of_Benev...
[1] https://github.com/dankamongmen/libtorque