Guard pages for file-backed memory
The purpose of a guard page is to prevent buggy (or malicious) code from overrunning a memory region. An inaccessible page placed at the end of a region will cause a segmentation fault should the running process try to read or write to it; well-placed guard pages can trap a number of common buffer overruns and similar problems. Prior to 6.13, though, the only way to put a guard page into a process's address space was to set the protections on one or more pages with mprotect(); that works, but at the cost of creating a new virtual memory area (VMA) to contain the affected page(s). Placing a lot of guard pages will create a lot of VMAs, which can slow down many memory-management functions.
The new guard-page feature addresses this problem by working at the page-table level rather than creating a new VMA. A process can create guard pages with a call to madvise(), requesting the MADV_GUARD_INSTALL operation. The indicated range of memory will be rendered inaccessible; any data that might have been stored there prior to the operation will be deleted. There is an operation (MADV_GUARD_REMOVE) to remove guard pages as well.
Placing guard pages in VMAs containing anonymous pages is the simplest case, which is why anonymous pages were supported first. These pages have no connection to any file on disk, so there are relatively few hazards involved with changing their behavior. File-backed pages bring more complexity, though, and a number of places where guard pages could cause problems. Stoakes goes through the list in detail in the patch posting.
For example, readahead is an important part of maintaining performance when a process is working sequentially through a file. As that process reads some data from a file, the kernel can guess that the process will go on to request the following data in the file in the near future. By initiating a read operation before user space gets around to asking for the data, the kernel can ensure that this data is present (or at least on its way) when the request arrives. The presence of a guard page will stop readahead cold at that point, since the page has been marked inaccessible. As Stoakes notes, this should not be a problem, since it would be unusual for a process to map a file, place a guard page, then try to read through that page.
Similar complications arise in other situations. The kernel will often try to "fault around" a page that has been faulted in, under the assumption that nearby data will be of interest; guard pages will prevent that as well. If a file is truncated, the removed portion may include guard pages, but the guard pages themselves will remain in place. And so on; in each case, Stoakes has ensured that the kernel's operation will be correct and make sense.
There are still a couple of exceptions, though, one of which was known
about before the patches were posted, while the other was a surprise. The
known issue is that guard pages cannot be placed in memory areas that have
been locked into RAM with mlock().
The problem, as Vlastimil Babka pointed
out, is that mlock() guarantees that the affected pages will
not be kicked out of RAM. Installing a guard page, though, frees any data
stored there, which runs counter to the mlock() promise. Stoakes
is considering
a new operation that would make this data destruction explicit in that case
but, as David Hildenbrand said,
"mlock is weird
" and there are a number of other details that would
have to be managed there.
The unexpected issue was raised by Kalesh Singh, who wondered how the presence of guard pages would be represented in /proc/PID/maps and /proc/PID/smaps. These files, which are documented in Documentation/filesystems.proc.html, describe a process's VMAs in detail. Singh said:
In the field, I've found that many applications read the ranges from /proc/self/[s]maps to determine what they can access (usually related to obfuscation techniques). If they don't know of the guard regions it would cause them to crash; I think that we'll need similar entries to PROT_NONE (---p) for these, and generally to maintain consistency between the behavior and what is being said from /proc/*/[s]maps.
It seems that banking apps running on Android are known for this sort of behavior and could run into trouble if guard pages are installed — which is something that the Android runtime might well want to do as a general hardening measure. Since those apps already read the indicated /proc files, Singh thought that would be a logical place to indicate the presence of the guard pages.
This request took Stoakes by surprise, since he thought the topic had been discussed previously and the situation understood. That situation is that, since those files describe VMAs, they are not a suitable place to put information about guard pages which, by design, do not have their own VMAs. Hildenbrand quickly suggested that a bit in /proc/PID/pagemap, which provides page-level data now, would be the best way to export that information to user space. The conversation nonetheless became a little tense, seemingly mostly as a result of misunderstandings rather than true disagreement.
In the end, though, it was agreed that pagemap was the right place for this information. Suren Baghdasaryan eventually joined the conversation, saying that some work would be needed to make this information available to apps in the Android system, but that he would start on that project. Apologies and thanks were shared around, and Stoakes said that he would go ahead and implement the kernel side of the pagemap solution.
With that issue seemingly resolved, there does not appear to be any serious
obstacles to this feature heading toward the mainline in the near future.
The patch series (minus the pagemap changes) is sitting in
linux-next now and could conceivably go upstream as soon as the 6.15 merge
window. That should result in easier and cheaper user-space hardening,
which seems worth the trouble.
Index entries for this article | |
---|---|
Kernel | System calls/madvise() |
Posted Mar 3, 2025 20:15 UTC (Mon)
by ljsloz (subscriber, #158382)
[Link]
Posted Mar 3, 2025 22:00 UTC (Mon)
by wtarreau (subscriber, #51152)
[Link]
Posted Mar 3, 2025 22:05 UTC (Mon)
by jokeyrhyme (subscriber, #136576)
[Link]
Posted Mar 3, 2025 22:06 UTC (Mon)
by khim (subscriber, #9252)
[Link] (13 responses)
We don't know what exactly banking apps need, but we know what they want: they just want to find out where in the address space lies the code of system libraries… to scan that code. Some would want to scan data segment, too, but most only care about code. I don't think they even think in terms of VMAs or pagetables… they just need list of addresses they can safely peek into without triggering SIGSEGV… and currently it can be readily found in This patch definitely breaks that API. P.S. Of course on ARM64 there are an additional twist to all that madness: because normally libraries are mapped execute-only on Android not only these apps need to find these regions via
Posted Mar 4, 2025 1:01 UTC (Tue)
by WolfWings (subscriber, #56790)
[Link] (4 responses)
There's quite a few that won't even load if you have Developer mode enabled on Android, doesn't matter if you're using it just to keep the screen on, or to override bluetooth versions to work with an older car radio, they just straight refuse to load.
Posted Mar 4, 2025 3:50 UTC (Tue)
by wtarreau (subscriber, #51152)
[Link] (3 responses)
We're still missing a portable way to attempt a safe kernel-assisted memory copy from one area to another that would simply return EFAULT when either area is not accessible. I'm using some hacks using syscalls when I need to do that but that's ugly. I suspect one could also use vmsplice() to move the data into a pipe then from it, though I have not tried.
Posted Mar 4, 2025 8:41 UTC (Tue)
by fw (subscriber, #26023)
[Link]
Posted Mar 4, 2025 9:38 UTC (Tue)
by khim (subscriber, #9252)
[Link] (1 responses)
Well… using
Posted Mar 5, 2025 4:36 UTC (Wed)
by wtarreau (subscriber, #51152)
[Link]
Posted Mar 4, 2025 9:28 UTC (Tue)
by vbabka (subscriber, #91706)
[Link]
But yes, the new API gives you better efficiency (fewer VMAs) with the tradeoff for /proc/pid/(s)maps visibility, and having to deal with faults when you would want to scan the memory ranges and don't know where the guard pages are.
So yes in practice that makes it harder to use the new APIs when some part of the userspace (i.e. libc or the android Zygote mechanism) would switch from PROT_NONE guard areas to the new functionality and thus break other parts of the userspace. But the kernel doesn't change anything to the existing userspace unless it opts-in to the new functionality, so "we don't break the userspace" is not affected here.
Posted Mar 4, 2025 9:56 UTC (Tue)
by ljsloz (subscriber, #158382)
[Link] (6 responses)
There's a series of assumptions being made about executable code being present and immutable, none of which are 'API'.
It may be taken to probably be the case that it's fine, but it's in essence assuming internal implementation details.
Note that executable segments are very unlikely to be reclaimed, but they might be, at which point your underlying file system may do strange things on fault (e.g. network fs etc.). It may be unlikely, certainly in an android case, but if one is going to call this an implicit 'API' you really need some solid basis to do so.
What makes this more likely are for instance, obfuscation techniques and JIT which may result in 'strange' executable code relocation.
Speaking from a philosophical standpoint, PROT_NONE is not a contract that guarantees 'hey this is the only means by which a guard region can be implemented', it is only, simply, a VMA that guarantees, at least right now, that accesses will cause a signal to arise.
So I think right now these apps are making assumptions about internal implementation details, imagining implied contracts, which happen to be the case (at least _most_ of the time) right now, but, should implementers _choose_ to use this API, will no longer be in the future.
So speaking _generally_ about /proc/$pid/maps, ranges shown there:
- May SIGBUS.
There is absolutely emphatically no guarantee you will not receive a signal (or brokenness) by accessing regions seen in /proc/$pid/maps. No such guarantee is documented anywhere, and cannot reasonably relied upon by userland.
Linux isn't windows of yore, the commitment to not breaking userspace has very well been established as 'not breaking _reasonable_ userland if and only if there are users which actually specifically rely upon said reasonable interfaces'.
The discussion on-list was that all of this was made abundantly clear throughout the implementation of this feature, and it was agreed that there was some miscommunication that led to this issue not being raised.
But equally, it was agreed that it would be correct to instead of having the banking apps etc. simply rely upon _assumptions_ via /proc/$pid/maps, they could, instead, make use of an interface that explicitly provided this information. The provision of guard region information in /proc/$pid/pagemap provides the means for this. This should solve things for this very, very specific and unusual use case.
Additionally I went to great lengths to try to find whatever means by which we could find to resolve this - since this is all moot, given this is a shipped feature which again - I emphasise does NOT break the 'do not break userland' concept, nor does it break any existing API.
As @vbabka points out also, what is very clear is - this feature is opt-in. Nobody _has_ to use it, and continuing to use linux as-is will have no impact whatsoever with this feature present. So again, no API break, no never-break-userspace-break.
The benefits of this feature are huge, as the kind of kernel memory that is used upon VMA proliferation is pure memory pressure - it cannot be reclaimed or migrated. At scale this can really, really add up.
Also another point raised in the discussion is at no point could this feature exist AND something appear in /proc/$pid/maps. The feature emphatically requires the separation of page table-induced faulting behaviour vs. VMA metadata state. /proc/$pid/maps does not traverse page tables, so cannot obtain this information. It is expensive to do so. Adjustment of VMA metadata to show it here would cause VMA merge failure and thus render the feature useless.
In the end we all reached a satisfactory agreement upon how to move forward sensibly :)
People are very very keen to jump to 'oh the kernel broke userspace!' very quickly, but often things are more subtle and nuanced than they first appear. In this case it's understandable, but I would respectfully suggest you are mistaken in that assertion.
Posted Mar 4, 2025 11:04 UTC (Tue)
by PeeWee (guest, #175777)
[Link] (5 responses)
But what happens to such banking apps if they try to scan memory areas of applications that did opt-in to this? Wouldn't they then do <funky stuff> because their - quite questionable - assumptions don't hold anymore? Say, libc opts in and such an unmodified banking app wants to scan its memory, wouldn't it trip over this? So they don't need to opt-in to be affected by this, or am I missing/misunderstanding something here?
On a related note, I believe such banking apps should not exist to begin with IMHO, because they essentially break 2FA and try to work around that by making sure no "hacker tools" are present by using this (and other?) technique. It used to be that one was strongly discouraged - by those very banks mind you - from using the same mobile device for entering transaction data AND receiving things like (transaction bound) TANs to validate said transaction, because a compromised device is able to manipulate both the transaction and the TAN that is only valid for this one transaction. I am a bit fuzzy on the exact details but that's the gist of it; TL;DR: if you can simply enter a transaction and don't have to do an actual additional validation (TAN) there is no 2FA and thus no guarantee that you are doing the transaction you actually want instead of the one the baddies want you to commit.
What I am trying to say is that I think this should NOT be considered "reasonable userspace" and (maybe) SHOULD be broken, on purpose even. The kernel should definitely NOT accommodate such onerous app behaviour IMHO.
Posted Mar 4, 2025 12:01 UTC (Tue)
by tux3 (subscriber, #101245)
[Link]
I think there's a reasonable interpretation where this is giving libc the tools, and libc can do something sensible with the feature without necessarily breaking everything.
Concretely, I think for Android libc could reasonably gate this on Android API level ("if your app declares targetSdkVersion >= X, libc will make use guard pages"). Every so often, the Android team increases the minimum SDK version required on their play store. So they will eventually be able to turn on guard pages unconditionally, but without taking the authors of innocent m̶a̶l̶w̶a̶r̶e anti-debug obfuscation features by surprise.
Posted Mar 6, 2025 0:08 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link] (3 responses)
But even if it does break, it's not much of a breakage anyway. You just get a less-useful "ptrace said no" error message than you would otherwise. It doesn't prevent you from doing any of the things that you would otherwise be able to do.
Posted Mar 6, 2025 11:58 UTC (Thu)
by PeeWee (guest, #175777)
[Link] (2 responses)
Posted Mar 9, 2025 0:51 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
* root can ptrace anybody.
My position is that none of this actually matters. What matters is that ptracing is a tool for developers to figure out why their app is broken. It is not a security mechanism. It is not, in fact, intended for random apps to ptrace each other just because they feel like it, or because somebody with the word "compliance" in their job title has decided that it's a good idea.
When you go around poking your nose into somebody else's memory, at runtime, in production, on real hardware that is owned by a real user, any breakage is entirely your own problem. Nobody ever promised that you could do that, and there are numerous hardening measures that can trivially break it in one way or another (for example, the user could put you or the other app behind a container, or even just a separate UID), plus you have to consider more mundane problems like userspace ASLR, static linking and LTO, no debug symbols, and so on. The whole idea is monstrously fragile and it's a miracle if it works at all.
Posted Mar 9, 2025 8:31 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Bank apps ptrace() _themselves_ to make sure there's nothing unusual injected into their address space.
Posted Jul 30, 2025 0:56 UTC (Wed)
by jepsis (subscriber, #130218)
[Link]
pagemap change now in -next :)
Thanks for pointing MADV_GUARD_*
more like this please :)
What happened to “we don't break the userspace” idea?
/proc/PID/maps
./proc/PID/maps
… they also need to make them read+execute
(for investigation purposes) and then they make them execute-only
again (if they are courteous… many leave mappings in read+execute
state). I wonder what would happen when all these hardening techniques would meet on one place, though.What happened to “we don't break the userspace” idea?
What happened to “we don't break the userspace” idea?
What happened to “we don't break the userspace” idea?
> I'm using some hacks using syscalls when I need to do that but that's ugly.
What happened to “we don't break the userspace” idea?
write
for read looks a fit… quaint, but works well, in practice.pipe
, then fork
with one side reading with write
syscall (specify memory argument that you want to look into as buffer, pipe as target, kernel will return EFAULT
if memory can not be read) and the other getting information from pipe… this trick is decades old and portable (even if I'm not sure how portable, but it certainly works fine with very old versions of Linux), but don't see why would it stop working… because of fork
?What happened to “we don't break the userspace” idea?
What happened to “we don't break the userspace” idea?
What happened to “we don't break the userspace” idea?
- May fault causing file systems to possibly do strange things (they have custom hooks), that could in theory result in SIGSEGV or do other broken things....
- May trigger a uffd fault, where a broken userland app may cause an eternal sleep.
- /proc/$pid/maps is racey, and you may see things out of order as you read left to right if there are aggregate (in userland) operations being performed.
- May not exist any more, people may unmap/remap at any time.
What happened to “we don't break the userspace” idea?
What happened to “we don't break the userspace” idea?
The kernel is giving userspace new APIs, but not breaking any pre-existing code; taking an old system and installing this new kernel will not by itself break the dodgy memory scanning code.
What happened to “we don't break the userspace” idea?
What happened to “we don't break the userspace” idea?
What happened to “we don't break the userspace” idea?
* You can ptrace your own processes. Or, to be more pedantically correct, processes running with the same UID can ptrace each other. (I have no idea if that is real UID, EUID, or some other UID-like thing entirely.)
* Nobody else can ptrace anything, unless they have a special capability.
* I imagine there might be a sysctl knob or something that applies further restrictions (such as turning off non-root ptracing altogether), but I don't know if such an interface really exists.
What happened to “we don't break the userspace” idea?
These really do sound like memory mines. Glad to see they’ll be mapped properly, as mandated by Conventions to safeguard civilspace.
Naming