The intersection of shadow stacks and CRIU

December 16, 2022

This article was contributed by Mike Rapoport

Shadow stacks are one of the methods employed to enforce control-flow integrity and thwart attackers; they are a mechanism for fine-grained, backward-edge protection. Most of the time, applications are not even aware that shadow stacks are in use. As is so often the case, though, life gets more complicated when the Checkpoint/Restore in Userspace (CRIU) mechanism is in use. Not breaking CRIU turns out to be one of the big challenges facing developers working to get user-space shadow-stack support into the kernel.

The idea behind shadow stacks is simple: in addition to the normal program stack (which holds return addresses, local variables, and more) there is a special memory area, called the "shadow stack", that stores only return addresses. Whenever a CALL instruction is executed, the return address is pushed onto both the normal and the shadow stacks. When, later, a function ends with a RET instruction, the return address that's popped from the normal stack is compared to that on the shadow stack. If they match, the execution continues; if they don't, a violation of control-flow integrity has just been detected.

Recent x86 processor models implement shadow stacks in hardware, meaning that no instrumentation is required for a program to get the protection that shadow stacks provide and that the cost of using a shadow stack is negligible. Once the feature is enabled, the CPU takes care of pushing and popping the return address on the shadow stack and comparing the return addresses. If the return addresses do not match, the CPU generates a control protection exception. To support shadow stacks, the x86 architecture has been extended with a model-specific register (MSR) that controls the use of the shadow stack and its features. There are also shadow-stack pointer MSRs (one for each possible privilege level) and a set of instructions for manipulating shadow-stack contents.

The discussion about how kernel should support shadow stacks for user space started a long time ago, but it has still not concluded. One of the difficulties in enabling this feature is the possibility that some applications will be broken by shadow stacks because they use non-standard ways to change their control flow. The list of problematic applications includes GDB, various JIT engines, and, of course, CRIU.

CRIU and shadow stacks

CRIU is known for its intimate relations with the kernel and its use of obscure kernel interfaces. Among other things, CRIU has to intervene in the control flow of the tasks to be checkpointed in order to extract the information that cannot be obtained by other means (such as from the /proc file system). When restoring a saved process, CRIU has to be able to recreate its state as it was at checkpoint time, so if the process had a shadow stack enabled, that shadow stack has to be restored exactly as it was before the checkpoint.

To checkpoint (or "dump" in CRIU jargon) a process, CRIU injects a blob with parasite code into the target to get parts of the process state that are not visible from the outside or which can only be obtained slowly and painfully. To inject the parasite, CRIU stops the task with ptrace(), finds a free area in the task's memory layout, puts the parasite code there, and makes the task jump into that code. So far, there are no conflicts with shadow-stack enforcement because, after the parasite starts running, the CALL and RET instructions within the parasite are properly paired.

There is, however, a problem when the parasite's job is done and the normal process execution should be resumed. CRIU uses the sigreturn() system call, which is normally only invoked at the end of a signal handler, to "cure" the task of the parasite and resume its normal execution. This operation could be done with ptrace(), but sigreturn() reduces the synchronization complexity between CRIU and the parasite and, more importantly, allows the task to continue even if CRIU itself fails.

The implementation of sigreturn() in the kernel takes special measures to ensure that its usage does not violate shadow-stack integrity. Whenever the kernel needs to deliver a signal to a process, it sets up the return frame that will be used when signal handler is concluded; it also pushes some data to the shadow stack and then verifies the integrity of that data when sigreturn() is called. Since CRIU uses sigreturn() directly — without any signal being delivered to the process that is being dumped — it has to tweak the shadow-stack contents to match the state expected by the kernel. The modification of the shadow-stack pointer is done using a couple of ptrace() calls are part of the latest API proposed for shadow-stack enablement; the shadow-stack contents can already be adjusted using existing ptrace() calls. This shadow-stack modification is performed early during parasite injection in order to preserve the ability to resume normal task execution if anything goes wrong.

Once parasite injection and removal are handled, dumping a process with a shadow stack enabled is simple. The only difference from a "normal" dump is the need to save the shadow-stack enable/disable state and the shadow-stack pointer, both of which can be easily done with the ptrace() calls. The shadow-stack memory area is saved exactly as any other anonymous memory and does not require any special care.

CRIU restore

Restoring a process with a shadow stack is slightly more involved than dumping. When CRIU restores a process tree, it creates all of the tasks and threads found in the checkpoint and then modifies them so that their state will be exactly as the state that was saved at dump time. After the state of each thread is restored, CRIU sets up a sigreturn() frame for each thread, cleans up leftovers of the original CRIU process, and calls sigreturn() to restart the execution of the restored tasks. In order to restore the shadow stack, CRIU needs to be able to map the shadow-stack memory at exactly at the same address as it was before the dump. CRIU also needs a way to efficiently populate the contents of the shadow stack with the saved data and the ability to set the shadow-stack control bits and pointer. Additionally, the kernel API lets the C library and program loader lock various shadow-stack features; CRIU must thus be able to ensure that these feature locks are kept after a restore.

Since shadow-stack memory is somewhat special, the virtual memory area for it should be created with proposed map_shadow_stack() system call (described in this article) rather than with mmap(). Shadow-stack memory is read-only and it cannot be remapped. Based on the feedback from the CRIU developers, the latest version of the kernel patches that enable shadow stacks for user space allows passing a desired address to the map_shadow_stack() system call. This allows CRIU to map the shadow stack of the restored processes exactly where it was before the dump.

As a result of the way CRIU recreates the process's memory layout and restores its memory contents, mapping shadow-stack memory requires some additional care beyond having it at the correct address. To avoid conflicts between the memory layouts of CRIU and the restored process, CRIU reserves enough virtual memory to hold all of the restored process's memory areas, partially populates that memory, and then uses mremap() to map chunks of the reserved area to the appropriate addresses; it then finishes restoring the memory contents. The remapping happens late in the restore process and, since the shadow-stack memory cannot be remapped, it has to be created after the memory layout is nearly finalized; otherwise map_shadow_stack() could clobber an existing mapping.

Once the shadow stack has been put into the correct place, CRIU switches the shadow-stack pointer to it using the x86 RSTORSSP and SAVEPREVSSP instructions. At this point, the shadow stack can be populated with the WRUSS instruction. After restoring the saved shadow-stack data, CRIU uses WRUSS again to set up a frame for sigreturn() that will later resume normal execution of the restored tasks.

Restoring the shadow-stack contents could also be done with ptrace(), but user-space stacks can grow quite deep; there may be a lot of threads, and so restoring shadow-stack contents that way would involve complex synchronization between the CRIU control process and the tasks being restored. Additionally, filling memory with ptrace() is terribly slow. Although WRUSS is not as efficient as memcpy(), it is still much faster than ptrace(). Before using WRUSS, though, it should be enabled in the shadow-stack control register, where it is disabled by default. CRIU can enable WRUSS before restoring the shadow-stack memory with an arch_prctl() call that allows manipulating bits in the shadow-stack control MSR, and switch it back off before letting the restored tasks run.

The last task that CRIU has to take care of is the locking of the shadow-stack features. The GNU C Library (glibc) will enable shadow stacks for a process if it finds certain bits in the ELF header of the running program, and disables the feature if these bits are absent. Once the shadow stack is enabled or disabled, glibc locks its state with an arch_prctl() call. The same call allows locking the state of WRUSS enablement but, at the moment, glibc does not use it. The feature locks are inherited across a clone() call so, if CRIU runs with shadow stacks enabled, it cannot restore a process that has shadow stacks disabled and similarly, if CRIU starts without the shadow stack, it has no way to enable it after clone()ing the restored tasks. To resolve this problem, the proposed kernel API introduces another arch_prctl() call that will unlock the shadow-stack features. This call is only available via ptrace(), so an attacker won't be able to disable shadow stack from within a process. With this arch_prctl() call, CRIU can control the shadow-stack feature locks for the clone()ed tasks and then reset them to the final, secure state after the shadow stack is restored.

Conclusions

Shadow stacks on the x86 architecture provide efficient protection against return-oriented programming (ROP) and similar attacks, but its use necessitates updates of certain applications. Hopefully, CRIU's experience with shadow stacks will be useful to other projects that need to address shadow-stack compatibility issues. Enabling shadow stack-support in CRIU revealed several gaps in the earlier versions of the proposed kernel APIs and the initial implementation of shadow-stack support in CRIU relied on API extensions that were not included in the original kernel patches. The latest version of those patches has incorporated feedback from the CRIU developers and has all the necessary knobs to support checkpoint and restore of applications with shadow stacks.

Index entries for this article
Kernel	Checkpointing
Kernel	Security/Control-flow integrity
GuestArticles	Rapoport, Mike

The intersection of shadow stacks and CRIU

Posted Dec 16, 2022 14:19 UTC (Fri) by nix (subscriber, #2304) [Link] (3 responses)

This call is only available via ptrace(), so an attacker won't be able to disable shadow stack from within a process.

Surely if you can call any syscall (say, if this was a new call) you can call ptrace() on yourself, and thus get access to this facility anyway? (Or, rather, the attacker can cause the process to call ptrace() on itself).

(But maybe, like many ptrace() operations, it can only be called on a seized process, in which case an attacker would have to provoke a new thread into seizing and unlocking all the others, and if an attacker can do *that* it's presumably already won.)

The intersection of shadow stacks and CRIU

Posted Dec 16, 2022 14:35 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (2 responses)

It seems that self-ptracing is a fraught concept and needs to be done carefully. I think if you're forking threads in a process with this specific setup, you've already lost quite a lot.

https://yarchive.net/comp/linux/ptrace_self_attach.html

The intersection of shadow stacks and CRIU

Posted Dec 18, 2022 12:50 UTC (Sun) by smurf (subscriber, #17840) [Link] (1 responses)

Well, as the linked email describes, it's trivial to clone(CLONE_VFORK|CLONE_MM) into a new thread – which can use ptrace() on its parent at will.

The intersection of shadow stacks and CRIU

Posted Jan 5, 2023 12:32 UTC (Thu) by anton (subscriber, #25547) [Link]

ptrace() on the parent or other non-children has been disabled by default on (at least) Ubuntu for a number of years, which has led to breakage.

Injection bootstrapping

Posted Dec 16, 2022 15:44 UTC (Fri) by izbyshev (guest, #107996) [Link] (2 responses)

> To inject the parasite, CRIU stops the task with ptrace(), finds a free area in the task's memory layout, puts the parasite code there, and makes the task jump into that code.

Given that writable+executable mappings are unlikely to exist in the target process, how does this part actually work? Does CRIU bootstrap the injection by reusing the existing code, e.g. by finding the syscall instruction in executable memory and "calling" it (via ptrace()) with appropriate arguments, or something?

Injection bootstrapping

Posted Dec 16, 2022 23:32 UTC (Fri) by nix (subscriber, #2304) [Link] (1 responses)

It creates a new mapping by injecting a call to mmap() first :) this is even documented! <https://criu.org/Parasite_code>

Injection bootstrapping

Posted Dec 19, 2022 14:44 UTC (Mon) by izbyshev (guest, #107996) [Link]

Well, no, it's not possible to answer the question "how does CRIU bootstrap code execution in the target" by "it executes mmap() first" :)

I looked at the code, and it seems that code for the first syscall in the target is written to the location computed by get_exec_start() [1], which simply finds the first large enough executable (but not necessarily writable) mapping. And then I realized my mistake that lead to my original question: I assumed that PTRACE_POKEDATA uses the same page permissions that the normal user-space code does, which would make it impossible for CRIU to write code to a read-only mapping. But no, in fact, the kernel overrides the permissions [2], so code reuse "attack" is not needed for CRIU.

[1] https://github.com/checkpoint-restore/criu/blob/008c2b9c7...
[2] https://elixir.bootlin.com/linux/v6.1/source/kernel/ptrac...

The intersection of shadow stacks and CRIU

Posted Dec 18, 2022 11:33 UTC (Sun) by jorgegv (subscriber, #60484) [Link] (9 responses)

I have never quite understood why a functionality which mainly consists of dumping all kernel data and process state to a file, and later using that for restoring, is to be done from user space.

Since the kernel is in a privileged position to dump all that data, why not do It from there?

Instead of defining hundreds of interfaces and tricks for extracting that info from US, why not just dump everything from KS to e.g. a netlink socket?

Surely a serialization format, protocol, etc. should be designed, but is seems to me this would be easier than implementing dozens of interfaces for extracting info to US.

Also since the extracted info is so tied to the kernel, Keeping the CR functionality as an integral parte of the kernel source would seem like the logical thing to do...

Of course, most probably the relevant people already thought about this and decided the current way was the best instead of what I propose...

The intersection of shadow stacks and CRIU

Posted Dec 18, 2022 13:17 UTC (Sun) by mathstuf (subscriber, #69389) [Link] (5 responses)

As a gut check, this feels like way too easy to bake in some kernel structure layout/size/whatever into the userspace ABI by accident.

The intersection of shadow stacks and CRIU

Posted Dec 18, 2022 15:36 UTC (Sun) by jorgegv (subscriber, #60484) [Link] (4 responses)

Ahhh, this seems a good reason to me. Since the Linux policy is to never ever break US, you are right.

The data output by the kernel with my schema should be considered an opaque blob, but It also should be examined to account for different versions, etc ... which introduces the risks you mention.

I kind of get It. This seems a case where an ABI policy similar to the BSD's, where KS and US are maintained in lockstep, would make things much easier for CR features...

Thanks for the point.

The intersection of shadow stacks and CRIU

Posted Dec 18, 2022 16:12 UTC (Sun) by saladin (subscriber, #161355) [Link] (1 responses)

As an additional reason against having the kernel dump the data, depending on how the kernel serializes the data, it could result in leaking struct layouts if those are randomized. I'm sure the security folks wouldn't like that.

The intersection of shadow stacks and CRIU

Posted Dec 20, 2022 10:46 UTC (Tue) by adobriyan (subscriber, #30858) [Link]

Struct layout randomization is not an issue, many of the members are kernel pointers and will never be allowed to be serialised directly.

The intersection of shadow stacks and CRIU

Posted Dec 18, 2022 16:18 UTC (Sun) by Wol (subscriber, #4433) [Link] (1 responses)

> I kind of get It. This seems a case where an ABI policy similar to the BSD's, where KS and US are maintained in lockstep, would make things much easier for CR features...

Until the checkpoint/restore is to allow you to do an upgrade?

Cheers,
Wol

The intersection of shadow stacks and CRIU

Posted Dec 18, 2022 20:55 UTC (Sun) by jorgegv (subscriber, #60484) [Link]

That's the reason for the serialization protocol I mentioned.

The intersection of shadow stacks and CRIU

Posted Dec 20, 2022 11:05 UTC (Tue) by adobriyan (subscriber, #30858) [Link]

> I have never quite understood why a functionality which mainly consists of dumping all kernel data

You want to dump only userspace-observable data, otherwise you'll drown in complexity at restore time
not to mention restoring old programs on new kernel (which is major use case: long running program
should survive kernel upgrade and continue).

But if only user observable stuff is in the image, then restoring can (in theory) be done with userspace
accessible interfaces only, some of which exist already (think umask(2)).

> and process state to a file, and later using that for restoring, is to be done from user space.

> Since the kernel is in a privileged position to dump all that data, why not do It from there?

Some stuff is much easier from kernelspace, true.

The intersection of shadow stacks and CRIU

Posted Jan 6, 2023 6:20 UTC (Fri) by njs (subscriber, #40338) [Link] (1 responses)

IIRC the reason the project is called Checkpoint Restore *In Userspace* is to distinguish it from earlier attempts to do it all in the kernel, which failed. It's quite a bit ago now but if you look you can probably find the old discussions about why that approach was abandoned, by those who tried it.

Kernel-based checkpointing

Posted Jan 6, 2023 14:06 UTC (Fri) by corbet (editor, #1) [Link]

The LWN kernel index is your friend if you want to do this sort of digging; there's articles covering the development of checkpoint/restore going back to 2008.

The intersection of shadow stacks and CRIU

Posted Jan 6, 2023 9:01 UTC (Fri) by rep_movsd (guest, #100040) [Link]

Hopefully someday RAM becomes persistent (Memristors anyone?)
Then things like reboots, suspending, etc become no-ops