|
|
Subscribe / Log in / New account

Emulating Windows system calls in Linux

By Jonathan Corbet
June 25, 2020
The idea of handling system calls differently depending on the origin of each call in the process's address space is not entirely new. OpenBSD, for example, disallows system calls entirely if they are not made from the system's C library as a security-enhancing mechanism. At the end of May, Gabriel Krisman Bertazi proposed a similar mechanism for Linux, but the objective was not security at all; instead, he is working to make Windows games run better under Wine. That involves detecting and emulating Windows system calls; this can be done through origin-based filtering, but that may not be the solution that is merged in the end.

To run with any speed at all, Wine must run Windows code directly on the CPU to the greatest extent possible. That must end, though, once the Windows program makes a system call; trapping into the Linux kernel with the intent of making a Windows system call is highly unlikely to lead to good results. Traditionally, Wine has handled this by supplying its own version of the user-space Windows API that implemented the required functionality using Linux system calls. As explained in the patch posting, though, Windows applications are increasingly executing system calls directly rather than going through the API; that makes Wine unable to intercept them.

The good news is that Linux provides the ability to intercept system calls in the form of seccomp(). The bad news is that this mechanism, as found in current kernels, is not suited to the task of intercepting only system calls made from Windows code running within a larger process. Intercepting every system call would slow things down considerably, an effect that tends to make gamers particularly cranky. Tracking which parts of a process's address space make Linux system calls and which make Windows calls within the (classic) BPF programs used by seccomp() would be awkward at best and, once again, would be slow. So it seems that a new mechanism is called for.

The patch set adds a new memory-protection bit for mmap() called PROT_NOSYSCALL which, by default, does not change the kernel's behavior. If, however, a given process has turned on the new SECCOMP_MODE_MMAP mode in seccomp(), any system calls made from memory regions marked with PROT_NOSYSCALL will be trapped; the handler code can then emulate the attempted system call.

The cover letter notes that one should not rely on this mechanism the way OpenBSD uses its origin verification:

It goes without saying that this is in no way a security mechanism despite being built on top of seccomp, since an evil application can always jump to a whitelisted memory region and run the syscall. This is not a concern for Wine games.

seccomp() is used for this non-security feature, the text continues, because the alternative would be to duplicate much of its functionality.

The patch series generated a fair amount of discussion from developers who were not entirely comfortable with this mechanism. Kees Cook, for example, asked whether it would instead be possible to rewrite the Windows binary code at load time, replacing system calls with calls to the emulation functions. The answer, it seems, is "no". Modifying a game's code is likely to set off checks made to defeat cheaters, who also would otherwise make code modifications of their own. Wine developer Paul Gofman added that, to make such changes, Wine "would need some way to find those syscalls in the highly obfuscated dynamically generated code, the whole purpose of which is to prevent disassembling, debugging and finding things like that in it".

Matthew Wilcox, instead, suggested that the personality() mechanism could be extended to support a Windows personality. This, essentially, would create a new system-call entry point that would emulate the Windows calls. Gofman replied that this approach had been considered, but that the cost of executing the personality() call on each transition between Linux and Windows code would be too high. A possible solution here is to implement a special personality that looks at a flag, stored in user-space memory, to determine how system calls should be handled. Gofman offered to create a Wine patch using such a mechanism if an implementation existed; Krisman said that he would give it a try.

Andy Lutomirski had a couple of other suggestions, the first of which was a prctl() operation that would redirect all system calls through a user-space trampoline. System calls from the trampoline itself would be executed normally. In Wine's case, that trampoline could emulate system calls from Windows code while passing Linux system calls through to the kernel. Krisman indicated interest in this approach, and may implement a version of this idea as well.

Lutomirski's other idea was to allow a process to establish an (extended) BPF filter program for all system calls; he later extended this idea to have it handle all "architectural privilege transitions" for the process. This approach offers a lot of flexibility and may be useful far beyond Wine, but it suffers from a significant flaw: in the absence of unprivileged BPF, it could only be invoked by a privileged process, which is a show-stopper for Wine. Unless something changes, unprivileged BPF is an idea that isn't going anywhere in Linux, so the filter program does not look like a solution that Wine could use.

The end result of this discussion is that the problem is reasonably well understood and there is a shared desire to solve it. What form that solution will take is far from clear, though; there are a few approaches that need to be experimented with. Expect to see more patches in the future as the developers work to find which idea works best.

Index entries for this article
KernelSecurity/seccomp
KernelSystem calls


to post comments

Emulating Windows system calls in Linux

Posted Jun 25, 2020 17:36 UTC (Thu) by rvolgers (guest, #63218) [Link] (1 responses)

I was all set to dislike this proposal, but I actually think the original idea is great, and is kind of underselling itself.

The PROT_NOSYSCALL could be turned into a remarkably elegant and effective security barrier when combined with Intel's Indirect Branch Tracking (part of CET) to remove the possibility of jumping directly to a SYSCALL instruction.

Emulating Windows system calls in Linux

Posted Jun 25, 2020 17:57 UTC (Thu) by rvolgers (guest, #63218) [Link]

To clarify the kind of design I'm thinking of, you could perhaps get something more flexible than a BPF syscall filter by making all executable pages PROT_NOSYSCALL, requiring all syscalls to go via a single non-PROT_NOSYSCALL page which performs the filtering in userspace and then performs the syscall. Assuming CET is turned on, as it's needed to ensure the filtering is done before the syscall instruction.

Obviously there are still a lot of potential problems when you have security code running in the same address space as the code it's defending against, so there might be something I'm missing. Certainly the trusted code that does the filtering would have to be very carefully written to defend against the program altering the filter code, and there are lots of potential race conditions if you'd want to check anything stored in memory.

Still, it seems like a pretty flexible approach, especially considering syscall filtering via bpf is also pretty limited, and worth thinking about?

Emulating Windows system calls in Linux

Posted Jun 25, 2020 21:00 UTC (Thu) by chris_se (subscriber, #99706) [Link] (3 responses)

I think the user-space trampoline is the most promising solution here: it requires extremely minimal code changes to the kernel and offers full flexibility to the user-space programs, because they can use any code they want for their trampoline, and that code can be extremely hand-optimized assembly for that specific use case.

And if somebody much more clever than me when it comes to assembly can figure out a way to do that without requiring a context switch for the instruction that triggers the trampoline (don't know if that's possible) then this would have the same performance as eBPF, because whether you do the comparison in ring 3 or ring 0 shouldn't actually change anything.

On the other hand, if you do need two additional context switches (one for entering the kernel and one for jumping back to the trampoline) then one would have to measure the performance impact here. (Though in that case one optimization could be for the kernel to postpone the speculative execution mitigations until after it has determined the syscall comes from the trampoline and should hence be used directly, and don't do any of the mitigations both for entering and exiting ring 0 if it wants to jump immediately back to the trampoline, that should reduce the cost of the context switches quite a bit.)

Emulating Windows system calls in Linux

Posted Jun 26, 2020 5:12 UTC (Fri) by luto (guest, #39314) [Link] (2 responses)

Pre-Meltdown, trivial syscalls like this were very fast. A trampoline oughtn’t hurt that much.

Emulating Windows system calls in Linux

Posted Jun 26, 2020 21:08 UTC (Fri) by roc (subscriber, #30627) [Link] (1 responses)

When will it be safe to start assuming no-PTI when designing APIs?

Emulating Windows system calls in Linux

Posted Jun 26, 2020 21:09 UTC (Fri) by roc (subscriber, #30627) [Link]

Probably better to ask "when will it be safe to assume syscalls are fast again?"

Emulating Windows system calls in Linux

Posted Jun 25, 2020 22:56 UTC (Thu) by roc (subscriber, #30627) [Link] (11 responses)

rr grapples with a similar problem. We need to intercept commonly-executed system calls and wrap them with our own processing, with minimal overhead. I think Wine could probably use our approach.

We use a seccomp filter to trap on all syscalls except for those called from a single specific trampoline page. When a library makes a syscall, the filter triggers a ptrace trap. The ptracer looks at the code around the syscall and if it matches certain common patterns, we patch it with a jump to a stub that does the extra work we need and then issues a real syscall via the trampoline. Thus, a library syscall is slow the first time and fast the rest of the time. (Another possible variant of this approach, probably faster when applicable, would be to avoid using a ptracer and have the filter trigger a SIGSYS whose handler does the patching.)

Maybe I should post this to the list...

Emulating Windows system calls in Linux

Posted Jun 25, 2020 23:24 UTC (Thu) by smcv (subscriber, #53363) [Link] (10 responses)

As mentioned in the article, patching Windows game code is unlikely to work well, because in some cases it tries to detect external modifications to itself as an anti-cheating mechanism, and it's deliberately obfuscated to make modification and tracing harder. I'm aware rr isn't usually tracing actively cooperating processes, but it isn't usually tracing a process that is actively uncooperative either.

Emulating Windows system calls in Linux

Posted Jun 26, 2020 1:39 UTC (Fri) by roc (subscriber, #30627) [Link] (9 responses)

The rr approach applied to Wine would not require patching the Windows game code, only the Wine/Linux libraries which *are* somewhat cooperative.

Discussing it on LKML, the problem with our approach for Wine is probably the issues with multiple threads potentially racing with system-call patching.

Emulating Windows system calls in Linux

Posted Jun 26, 2020 11:49 UTC (Fri) by pm215 (subscriber, #98099) [Link] (8 responses)

The patchset says it's addressing the way that "Modern Windows applications are executing system call instructions directly from the application's code without going through the WinAPI" -- so I think your approach would imply patching game code. It sounds like they already have workable approaches for apps that are traditional "call the winapi library which makes the syscall" style.

Emulating Windows system calls in Linux

Posted Jun 26, 2020 12:40 UTC (Fri) by pbonzini (subscriber, #60935) [Link] (7 responses)

No, you would:

* patch WINE libraries (the only ones that should issue Linux system calls) to go through a trampoline page

* use seccomp-bpf to raise SIGSYS for almost all code except that single trampoline page

* now if you get SIGSYS you now it's a Windows syscall, and you handle it from the SIGSYS handler

Emulating Windows system calls in Linux

Posted Jun 26, 2020 14:09 UTC (Fri) by pm215 (subscriber, #98099) [Link] (4 responses)

That would work, but it's not the approach suggested at the top of this comment thread, which includes "The ptracer looks at the code around the syscall and if it matches certain common patterns, we patch it with a jump to a stub"... (You don't need to runtime-patch the wine libraries -- wine controls that code so it can just be built to do whatever.)

Emulating Windows system calls in Linux

Posted Jun 26, 2020 15:03 UTC (Fri) by pbonzini (subscriber, #60935) [Link]

Yes, the core idea though is the same, distinguishing trapped and pass-through system calls from the address.

Emulating Windows system calls in Linux

Posted Jun 26, 2020 21:05 UTC (Fri) by roc (subscriber, #30627) [Link] (2 responses)

Wine uses glibc and a bunch of other system libraries which do need to be patched. Those libraries aren't trying to stop us patching them, but they're not providing any hooks to avoid the need for patching, either.

Again: you don't need to patch the tricky game code with this approach ... as long as you can tolerate those syscalls being slow.

Emulating Windows system calls in Linux

Posted Jun 28, 2020 19:04 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (1 responses)

> as long as you can tolerate those syscalls being slow.

I imagine this will depend on the game. If it's isolated into a bunch of small levels with loading screens between them, well, the loading screens will suck, but the rest of the game should basically work most of the time, assuming the game engine isn't trying to do something weird (like constantly telling the OS which pages to evict first).

But if it's an open world game that dynamically loads stuff in and out of memory all the time, then you're in trouble.

Emulating Windows system calls in Linux

Posted Jul 3, 2020 13:28 UTC (Fri) by raoni (guest, #137137) [Link]

IIRC from when I read the thread, they are OK with overhead on syscalls from windows code, they are not the concern for performance, they are concerned that applying some sort of overhead for all syscalls is a bigger performance hit because the linux libraries and the winAPI emulation code.

Emulating Windows system calls in Linux

Posted Jun 30, 2020 12:31 UTC (Tue) by mirabilos (subscriber, #84359) [Link]

Except that the code needed to handle it almost certainly isn’t signal handler-safe or can be made to…

Emulating Windows system calls in Linux

Posted Jul 11, 2020 13:30 UTC (Sat) by Hi-Angel (guest, #110915) [Link]

> No, you would:
>
> * patch WINE libraries (the only ones that should issue Linux system calls) to
> go through a trampoline page

You can't achieve anything here by patching WINE libs because as the prev. author
said, there's no problem with apps that go through them. The problem being
discussed is that some apps make system calls without going through WinAPI/WINE
libs. Let me quote the original mail:

> Modern Windows applications are executing system call instructions directly
> from the application's code without going through the WinAPI. This breaks Wine
> emulation, because it doesn't have a chance to intercept and emulate these
> syscalls before they are submitted to Linux.

Emulating Windows system calls in Linux

Posted Jul 2, 2020 9:31 UTC (Thu) by rwmj (subscriber, #5474) [Link] (1 responses)

Don't Windows and Linux system calls use entirely different mechanisms? Does Windows still use an IDT to enter the kernel?

Emulating Windows system calls in Linux

Posted Jul 2, 2020 10:13 UTC (Thu) by khim (subscriber, #9252) [Link]

Windows 7 uses SYSENTER and doesn't work on CPUs without it.

When applications dropped WINXP support then got the chance to use it, too.


Copyright © 2020, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds