LWN: Comments on "May the FOLL_FORCE not be with you"

thanks for the ptracer option

nix — Wed, 11 Sep 2024 11:17:43 +0000

Absolutely, though you have to take some special measures to avoid having to stop all your threads in the process of attachment :)

Julia

mrugiero — Fri, 16 Aug 2024 12:26:02 +0000

> Nevertheless the /proc/self/mem approach was our favored approach
> because it a) Required an attacker to be able to execute syscalls
> which is a taller order than getting memory write and b) didn't double
> the virtual address space requirements (as a dual mapping approach
> would).

I understand the argument b) to some degree (although if your code section takes up a problematic portion of address space there is a bigger problem than just doubling it), but argument a) is seems moot to me.
You don't protect your process by making the attack surface the whole system (which FOLL_FORCE do), and you don't protect that much against ROP if you have the exact sequence of instructions the attacker would need as part of your initialization (which the interpreter does).
So, it doesn't protect significantly against local exploits and it requires opening up a much bigger (system-wide) exploit space instead. Further, the claim they do have the fallback means also the code to enable the "just modify the memory" approach is one jump away.

thanks for the ptracer option

mrugiero — Fri, 16 Aug 2024 12:19:57 +0000

I believe some do to try to fight attempts to reverse engineering, so I guess it's possible.

Julia

khim — Tue, 30 Jul 2024 08:56:14 +0000

> But editing code while it's still cached by another CPU is a very 90s approach to JIT.

And walking on Earth is so last millennium, right?

People are using the best approach that's available. JITs are editing code that's currently executing and there are no plans to change that.

Usually JITs are using two mappings (one readable, one writable) nowadays, but they absolutely do that, because there are no better approach invented yet.

P.S. And yes, code editing is limited in most JITs: they are rewriting jump target addresses and add new code in place where previously there were just NOPs. But these two are not going away any time soon, because they are tightly coupled with the nature of JIT: it's name, quite literally, means “Just In Time” which means, essentially:

We are compiling small pieces of code at time, thus couldn't afford waste the whole page for a tiny amount of code produced.
We are stitching together code “on the fly” which means that calls to “compile-that-code-and-run-it” in the already finished code are replaced with calls to finished, recompiled, code regularly.

If you find a way to beat Oracle Java's JIT, Google's ART JIT, Chrome V8's JIT, Firefox's Warp JIT and all other JITs that are using that approach with something better then it would be time to say that everyone should switch. Saying that everyone should stop doing what they are doing just because you don't like it — without offering any alternative, on the other hand, is just irresponsible.

thanks for the ptracer option

gray_-_wolf — Mon, 29 Jul 2024 09:04:37 +0000

Is it possible for program to attach to itself as a debugger satisfying the ptrace requirement?

Self-modifying code

khim — Sun, 28 Jul 2024 12:05:21 +0000

> In discussions like this I believe it is useful to go through many possible options and evaluate their pros and cons without being particularly wedded to any one of them.

Why? What have that approach brings to you? What have you achieved doing it?

I find that very strange. Things that we already have, things that exist, by definition, have a priority. They are already here, they are done, that enough. But any change from the status quo need a justification.

Sure, I like to go “back the memory lane” and see why things that we have are like they are. Because situation of today is different from situation of yesterday.

But no matter what, even if the thing that made us to pick original decision is no longer valid or even if the original decision was made on a whim without any rational justifications… things that we have are very-very different from things that we don't have.

> Since it's an option that meets the specific constraint that you, yourself, chose to highlight, I felt that it should be included in the discussion.

One may invent bazillion crazy schemes if not constrained by anything. Talking about them would take forever unless we would limit these discussions, somehow.

“Anything new should come with an extra justification that explains why should we do that if some other solution already exists” is very good rule if we are talking about something that we plan to implement. I don't know anyone who achieved anything significant while violating it (but note the subtle difference: if we don't yet have a solution at all then someone who doesn't “know” that “X is simply impossible” may achieve something really cool… but when said X is not just possible in theory but we already know how to do X in practice then situation chances).

Well, maybe fiction writers would be an exception, but even they, when they construct their strange imaginary worlds, still play on that contrast between what “we” have and what “they” have. “Does it exist?” is still very much a central question that governs their decisions even if they imagine a world where something that we already have doesn't exist and where evolution of civilization, as a consequence, goes into a different direction.

> This approach appears to be incompatible with your style of arguing, so I'll just be ignoring you from now on.

Fine by me. I don't like to waste time on pointless discussions without any practical consequences (even if the consequence is minor like “now that I have wrote that I may just refer people here instead of repeating my arguments again and again”) while you seem to regard these as the only ones worthy of pursuing.

Self-modifying code

malmedal — Sun, 28 Jul 2024 10:57:12 +0000

> You are proposing third (or fourth) one without explaining why it's better than what we already have.

In discussions like this I believe it is useful to go through many possible options and evaluate their pros and cons without being particularly wedded to any one of them. Since it's an option that meets the specific constraint that you, yourself, chose to highlight, I felt that it should be included in the discussion. This approach appears to be incompatible with your style of arguing, so I'll just be ignoring you from now on.

Self-modifying code

khim — Sun, 28 Jul 2024 10:00:28 +0000

> This is an answer, it has a cost, an expensive one, but it is an answer.

Sure, but as was mentioned in the very beginning, seven years ago there are already two other methods (three if you include self-ptrace).

You are proposing third (or fourth) one without explaining why it's better than what we already have.

This looks like “he have to do something” — “this is something” — “let's do it!” logic.

Such logic rarely produces good designs.

Self-modifying code

malmedal — Sun, 28 Jul 2024 09:31:59 +0000

> Why? What's the point?

You asked for this earlier:

>> maybe even teach kernel not to ever provide W+X mappings at all

This is an answer, it has a cost, an expensive one, but it is an answer.

Self-modifying code

khim — Sun, 28 Jul 2024 08:40:31 +0000

> If you don't want the kernel to provide userspace with a W+X mapping you can have the kernel keep the mapping to itself, but if userspace then can ask the kernel to do arbitrary changes, or have separate W and X mappings, you haven't gained that much

Yes, you did. You have made life for attackers harder. As explained in the Wikipedia article. And that article even includes section about JITs, too!

Security and usability are always at odds, there exist 100% bullet-proof way to stop any attacks, both local and remote — just turns the computer off and all kinds of attacks are prevented! But this “protection” is not very usable, thus we need something else.

> you want to limit what the kernel will do to the simplest thing that will work.

Yes, but now we need to determine what is that work even is!

> The point is to stop the old code at a safe location

Do you actually read what I wrote? Just where have I wrote that JIT wants/needs that? It have no such need. On the contrary, what JIT needs is described precisely under titles Asynchronous modification under Cross-Modifying Code in the AMD Manual: the nature of the code being executed by the target thread is such that it is insensitive to the exact timing of the update.

JIT (or, heck, dynamic loader that resolves symbols lazily) initially inserts jump to the “slow path” because “fast path” doesn't exist, later, when “fast path” does exist jump is replaced. That's all, JIT doesn't care if “slow path” is used a few times after “fast path” is created, it just wants “eventual consistency” where programs stop using “slow path” after a few milliseconds.

And, as I have shown you, with references to AMD manual, CPUs actually offer enough relevant guarantees on the hardware level, description is so precisely tailored to the need of JITs that they could, as well, call it “JIT-friendly code modification” and not “asynchronous modification”.

And yet, you repeatedly invent complication that make things more problematic and AFAICS don't achieve anything security-wise? Why? What's the point?

Julia

roc — Sun, 28 Jul 2024 02:01:16 +0000

If you're only JITting one function per page then your memory usage is going to be terrible. Yes, in reality you can do more than one function at once but in practice you will seldom be able to fill a page before you need to start executing code in it.

And the first flip from writable to executable is already a problem.

Self-modifying code

malmedal — Sun, 28 Jul 2024 00:38:03 +0000

It's an answer to your previous request:

> maybe even teach kernel not to ever provide W+X mappings at all

If you don't want the kernel to provide userspace with a W+X mapping you can have the kernel keep the mapping to itself, but if userspace then can
ask the kernel to do arbitrary changes, or have separate W and X mappings, you haven't gained that much, so you want to limit what the kernel will do to the simplest thing that will work. The point is to stop the old code at a safe location, on some CPUs you can set address-based breakpoints instead.

Self-modifying code

khim — Sat, 27 Jul 2024 22:56:06 +0000

> I was responding to your earlier statement:

Said statement was just a side comment to explain what JITs are doing, why and how what they are dong is guaranteed to work. To show that need of JITs to alter running code while it's running was acute enough and understood enough that even hardware makers already created special JIT-tailored guarantees there.

> Have the app ask for a writable alloc, fill it with code.

That's possible.

> Then the app tells the kernel make this executable and the safe interrupt point is xxx.

How exactly do you plan to do that? Would kernel take 10-20 bytes of generated code and allocate whole 4KB (or, worse, 16KB if we are talking about modern ARM) page to make it executable?

This sounds pretty wasteful.

> Then when the time comes to replace the running code the app tells the kernel to write a processor-specific safe stop sequence to the interrupt point.

So now you want to introduce something like stop-the-world interrupt in place where previously everything was completely lock-free?

Also: JIT doesn't actually replaces running code, it replaces branch target in the running code. Using well-documented and guaranteed approach explicitly described in the CPU manual.

Why do you want to change that?

> on x86 this would likely be a sequence of eight int 3 instructions.

Why eight and what would it give us? Except more complications and more places to have bugs?

> This will stop the thread and the app can allocate a new writable alloc and tell the kernel to make it executable and have the thread continue there.

But app doesn't need that! App just simply wants to replace target of jump! Without all that complicated and useless machinery! Previously there was call COMPILE_ME_FOO and now there would be call FOO_JIT_COMPILED_AND_READY_TO_USE. That's all!

It's useless because eight int 3 instructions don't guarantee anything (x86 includes instructions longer than eight bytes and with redundant prefix you may force almost any instruction to be longer) and it's useless because write via /proc/self/mem (or use of two mappings) already does everything that's needed!

Why adding API that would be more convoluted and slower yet not actually safer then existing API?

It doesn't really makes much sense! You gave us some elaborate solution to some unknown problem, but neglected to say what is the problem that solution is supposed to solve!

It's very hard to understand whether proposal is good or bad if we have no idea what that proposal even supposed to achieve.

As in: what that dane with eight int 3 instructions and additional syscall was supposed to accomplish? What would it do better than write to /proc/self/mem or two separate mappings (one writable, one executable) for the same chunk of memory?

Julia

willy — Sat, 27 Jul 2024 22:17:56 +0000

I don't understand why you'd want to flip a page back to writable. You do a preliminary fast JIT to address A, flip the page from writable to executable. Then you decide the function is hot enough and do a more thorough compilation to address B, and change all the references to address A to address B. Once all the CPUs are no longer executing the code in the address A page, free it. Or reuse it. But editing code while it's still cached by another CPU is a very 90s approach to JIT.

Self-modifying code

malmedal — Sat, 27 Jul 2024 21:52:33 +0000

I was responding to your earlier statement:

> On x86 CPUs there are special guarantee that it can be done safely if modified part fits fully into 8bytes segment (it probably goes back to 80486 CPUs because I have no idea how to explain that 8bytes limitation)

Anyway, for you current statement, just have the kernel to the tricky bit.

Have the app ask for a writable alloc, fill it with code.

Then the app tells the kernel make this executable and the safe interrupt point is xxx.

Then when the time comes to replace the running code the app tells the kernel to write a
processor-specific safe stop sequence to the interrupt point.

on x86 this would likely be a sequence of eight int 3 instructions.

This will stop the thread and the app can allocate a new writable alloc and tell the kernel to make it executable and
have the thread continue there.

Julia

NYKevin — Sat, 27 Jul 2024 20:30:05 +0000

They were already doing that. Specifically, the revert patch linked in the article speaks of opening /proc/self/mem, which is your own memory.

Self-modifying code

khim — Sat, 27 Jul 2024 20:21:52 +0000

> I don't remember if the 386 had an executable-only mode, but we certainly had writable memory that could be executed.

The main issue that we are discussing here revolves around NX bit that allows one to create non-executable code!

On 386 the only way to make code non-executable was to play with segments and their limits. On Unix-like OS the best you may do is split 4GB of virtual address space in two: non-excutable area and executable area.

That means that approach that skissane talks about is just simply not possible on 386! Except if you use extremely weird OS which doesn't use paging, but uses segments for virtual memory.

Such OSes may exist, in theory, but I certainly know none, that actually did this thing in practice, that's why I have become so excited you said you did that with 386.

But it looks more and more likely that you haven't done what we are talking here about at all and are talking about entirely different situation.

> They are recommending a sort of RCU like approach to avoid this:

Note that since stores to the instruction stream are observed by the instruction fetcher in program order, one can do multiple modifications to an area of the target thread's code that is beyond reach of the thread's current control flow, followed by a final asynchronous update that alters the control flow to expose the modified code to fetching and execution.

That just happens with JITs automatically: once you have created optimized version of routine there are rarely the need to go back to intepreter. But yeah, usually only one call/jmp instruction is patched.

It's not a fragile thing to do, it is even supported by ld, see the -N option. Seems that interferes with shared libraries, so if you want that you need to use mprotect.

Again: that's different. Keeping something in the write+execute mode is dangerous WRT exploits, but not fragile, but playing with permissions and flipping from read+write to read+execute and back is pretty fragile because you need to ensure that code that you want to patch is not executed on the other core!

> I believe it fell out of favour because the performance advantage became much less when the 486 came with on-chip cache.

No, it fell out of favor much later, when people started caring about security and started enforcing W^X property.

First with segment limit tricks and then, later, with hardware NX bit.

Only Apple and only on iOS enforces it so radically as make JITs simply impossible, other OSes provide ways for JITs to work, that we are discussing here.

But all this discussion is happening in an W^X world!

Why do you keep bring W+X examples and keep saying that you can do everything easily if only you remove that restriction… of course it's possible to do, what could be simpler?

That's simply not what we are discussing here! The idea is to ensure that W^X is strictly enforced, maybe even teach kernel not to ever provide W+X mappings at all — and yet still keep JITs working, somehow.

thanks for the ptracer option

Heretic_Blacksheep — Sat, 27 Jul 2024 17:21:23 +0000

I agree. This is one low hanging fruit down, several left to go, from a security POV. The compromise is acceptable - and hopefully distros will default to "ptrace" in the future after a transition period thus eventually forcing software that can to clean up their acts and responsibly notify users when they functionally can't how to re-enable the old behavior, much as how it worked with OpenBSD's W^X transition some years back.

Self-modifying code

malmedal — Sat, 27 Jul 2024 17:03:40 +0000

Thank you.

So reading page from page 206, you are talking about asynchronous modification. 8 bytes is not because of 486, it is because 8 bytes is 64 bits, the size that gets atomically updated. Also the 64 bits must be aligned.

It is basically warning about the situation where the instruction pointer is in the middle of a 64bit quad word when the quad word gets updated, if the instruction boundary changes so that the IP is not actually at the start of an intended instruction you have a problem.

They are recommending a sort of RCU like approach to avoid this:

Note that since stores to the instruction stream are observed by the instruction fetcher in program order, one can do multiple modifications to an area of the target thread's code that is beyond reach of the thread's current control flow, followed by a final asynchronous update that alters the control flow to expose the modified code to fetching and execution.

Reading a bit further, on synchronous modification where the target thread is waiting while the other thread is writing, the rules are the same as before, you can make whatever changes you want, but the target thread must execute a serialising instruction.

I don't remember if the 386 had an executable-only mode, but we certainly had writable memory that could be executed.

It's not a fragile thing to do, it is even supported by ld, see the -N option. Seems that interferes with shared libraries, so if you want that you need to use mprotect.

I believe it fell out of favour because the performance advantage became much less when the 486 came with on-chip cache.

Julia

khim — Sat, 27 Jul 2024 15:36:07 +0000

> We certainly used to be able to do that. It gave a nice speedup on the 386 when you could save a register by having a mov #immediate, register and modify the immediate as needed.

386 doesn't even have “executable” bit in its page tables. Were you using segments? I guess this may work with segments since they cache permission in CPU registers. Still sounds very tricky and fragile to me.

Are you even talking about change it from writable to executable using mprotect (and in SMP environment) when you say “we used to be able to do that”?

> The old rule was you needed a jump or taken branch instruction to be sure the pipeline was clear after you modified executable code, later you could do any "serialising" instruction, like cpuid, instead.

I think we are talking about past each other, again. I'm talking about execution of the code that you are planning to patch by some other CPU core. Were these 386 systems, that you are talking about, even SMP ones? On UP system things are much, much, MUCH simpler. But on SMP systems when you are patching code that other CPU may be executing at this precise moment you need atomicity guarantees or complicated and convoluted scheme that would ensure that code that you are planning to patch is not, currently, executing.

Reference for this?

Look for the Asynchronous modification under Cross-Modifying Code in the AMD Manual. Intel provides more or less the same guarantees, but I don't remember which section of the manual describes that.

Julia

malmedal — Sat, 27 Jul 2024 15:04:48 +0000

> You can not do that to code that's already compiled and, more importantly, is currently executing.

We certainly used to be able to do that. It gave a nice speedup on the 386 when you could save a register by having a mov #immediate, register and
modify the immediate as needed.

> On x86 CPUs there are special guarantee that it can be done safely if modified part fits fully into 8bytes segment

Reference for this? The old rule was you needed a jump or taken branch instruction to be sure the pipeline was clear after you modified executable code, later you could do any "serialising" instruction, like cpuid, instead.

Julia

khim — Sat, 27 Jul 2024 14:12:18 +0000

> Why not just write to the page normally, and then change it from writable to executable using mprotect? Wouldn’t that be simpler?

You can not do that to code that's already compiled and, more importantly, is currently executing.

> I thought that was what most JITs did.

Most “serious” JITs also do what Julia does. When you JIT-compiler one function you couldn't be sure that other functions, that are called from the current one needs to be JIT-compiled too: if they are handling some exceptional conditions then they would never be called and JIT-compiler needs to be fast, otherwise it may even be slower than interpreter, in some cases!

That's why instead of putting call to JIT-compiled function they put a call to “compile me later” thunk (or, in some JITs, to the interpreter, I'm not sure what exactly Julia uses).

But when you have fully-optimized version of function that call to “compile me later” thunk is now just pure overhead! You can go eliminate it… but for that you need to patch already compiled (and, presumably, executing!) code.

On x86 CPUs there are special guarantee that it can be done safely if modified part fits fully into 8bytes segment (it probably goes back to 80486 CPUs because I have no idea how to explain that 8bytes limitation), most other CPU don't need this trick because they couldn't embed address into one instruction anyway thus they load address from memory... but that memory have to be in the same page on some CPUs because of limitation of instructions encoding!

You may guess how important is it for performance from just a simple fact: Intel APX added special new jump format to make that patching easier!

And yes, Julia is not an exception, almost all JITs are doing that, they are just using two mappings because that's simple and cross-platform way of doing that.

What's in a name?

intelfx — Sat, 27 Jul 2024 11:33:28 +0000

> follow_page()

Ah, so that’s what was under all those turtles!

Thanks.

Julia

roc — Sat, 27 Jul 2024 10:11:54 +0000

I already mentioned above that it's generally harder for an exploit to perform a system call with the right parameters than to just write to the desired address.

Julia

pm215 — Sat, 27 Jul 2024 08:26:36 +0000

The commit message to the commit where Linus reverted his initial patch has a quote presumably from one of the Julia developers which gives some context here (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/...) :

"We used these semantics as a hardening mechanism in the julia JIT. By
opening /proc/self/mem and using these semantics, we could avoid
needing RWX pages, or a dual mapping approach. We do have fallbacks to
these other methods (though getting EIO here actually causes an assert
in released versions - we'll updated that to make sure to take the
fall back in that case).

Nevertheless the /proc/self/mem approach was our favored approach
because it a) Required an attacker to be able to execute syscalls
which is a taller order than getting memory write and b) didn't double
the virtual address space requirements (as a dual mapping approach
would)."

Julia

mb — Sat, 27 Jul 2024 07:09:27 +0000

> while the page is writable another thread could modify it to include malicious code.

yes, well. But that's also possible with /proc/self/mem + FOLL_FORCE, as the article explains.
So switching to it doesn't solve that problem.

Julia

roc — Sat, 27 Jul 2024 01:35:07 +0000

Flipping pages between writable and executable has a couple of problems. There's performance overhead, since removing prot bits from a page requires IPIs to all other CPUs with the page mapped. And there are security issues, since while the page is writable another thread could modify it to include malicious code.

Julia

skissane — Sat, 27 Jul 2024 00:30:59 +0000

Why not just write to the page normally, and then change it from writable to executable using mprotect? Wouldn’t that be simpler? I thought that was what most JITs did.

thanks for the ptracer option

roc — Fri, 26 Jul 2024 22:54:40 +0000

Phew, I'm glad the discussion arrived at a solution that supports unlimited access for ptracers. It would have been hugely problematic if people had to choose between debuggers working and no protection at all.

Julia

roc — Fri, 26 Jul 2024 22:53:08 +0000

It's for the Julia JIT. JITs want to dynamically generate code and execute it, which happens to be the same thing that exploits want to do. So Julia's JIT allocates some read-only executable pages and then writes to them using /proc/.../mem. That's safer than making the pages directly writable, because it's harder for exploits to make that appropriate write system call than to write to memory directly.

Julia

willy — Fri, 26 Jul 2024 22:42:35 +0000

It could operate on its own address space instead of on the address space of another process?

Julia

acarno — Fri, 26 Jul 2024 22:12:19 +0000

I'm no expert by any means, but it looks like Julia allows both interpreted execution (like Python) as well as just-in-time compilation. It's a consequence of being a language designed for high-performance numerical analysis - it aims to support heavy number crunching as well as quick scripts. They presumably want to be able to live-patch a process as it runs (I'm not sure how you could do this otherwise).

Julia

rgb — Fri, 26 Jul 2024 20:48:52 +0000

Could someone enlighten me what is so special about Julia that it needs this giant security hole to operate properly?

What's in a name?

pbonzini — Fri, 26 Jul 2024 19:36:22 +0000

> That flag causes the write to succeed, regardless of whether the normal memory protections at the target address would allow writing

What's in a name?

nickodell — Fri, 26 Jul 2024 18:26:55 +0000

What does FORCE stand for? What check is being ignored?

What's in a name?

abatters — Fri, 26 Jul 2024 16:26:20 +0000

"Follow", as in the prefix of bitwise-or flag arguments passed to follow_page(), the kernel function that uses page tables to lookup a struct page * given a struct vma and a virtual address.

What's in a name?

intelfx — Fri, 26 Jul 2024 16:03:29 +0000

I'll bite. What does FOLL_ stand for, anyway?