Emulating Windows system calls in Linux
To run with any speed at all, Wine must run Windows code directly on the CPU to the greatest extent possible. That must end, though, once the Windows program makes a system call; trapping into the Linux kernel with the intent of making a Windows system call is highly unlikely to lead to good results. Traditionally, Wine has handled this by supplying its own version of the user-space Windows API that implemented the required functionality using Linux system calls. As explained in the patch posting, though, Windows applications are increasingly executing system calls directly rather than going through the API; that makes Wine unable to intercept them.
The good news is that Linux provides the ability to intercept system calls in the form of seccomp(). The bad news is that this mechanism, as found in current kernels, is not suited to the task of intercepting only system calls made from Windows code running within a larger process. Intercepting every system call would slow things down considerably, an effect that tends to make gamers particularly cranky. Tracking which parts of a process's address space make Linux system calls and which make Windows calls within the (classic) BPF programs used by seccomp() would be awkward at best and, once again, would be slow. So it seems that a new mechanism is called for.
The patch set adds a new memory-protection bit for mmap() called PROT_NOSYSCALL which, by default, does not change the kernel's behavior. If, however, a given process has turned on the new SECCOMP_MODE_MMAP mode in seccomp(), any system calls made from memory regions marked with PROT_NOSYSCALL will be trapped; the handler code can then emulate the attempted system call.
The cover letter notes that one should not rely on this mechanism the way OpenBSD uses its origin verification:
seccomp() is used for this non-security feature, the text continues, because the alternative would be to duplicate much of its functionality.
The patch series generated a fair amount of discussion from developers who
were not entirely comfortable with this mechanism. Kees Cook, for example,
asked
whether it would instead be possible to rewrite the Windows binary code at
load time, replacing
system calls with calls to the emulation functions. The answer, it seems,
is "no". Modifying a game's code is likely to set off checks made to
defeat cheaters, who also would otherwise make code modifications of their
own. Wine developer Paul Gofman added
that, to make such changes, Wine "would need some
way to find those syscalls in the highly obfuscated dynamically
generated code, the whole purpose of which is to prevent disassembling,
debugging and finding things like that in it
".
Matthew Wilcox, instead, suggested that the personality() mechanism could be extended to support a Windows personality. This, essentially, would create a new system-call entry point that would emulate the Windows calls. Gofman replied that this approach had been considered, but that the cost of executing the personality() call on each transition between Linux and Windows code would be too high. A possible solution here is to implement a special personality that looks at a flag, stored in user-space memory, to determine how system calls should be handled. Gofman offered to create a Wine patch using such a mechanism if an implementation existed; Krisman said that he would give it a try.
Andy Lutomirski had a couple of other suggestions, the first of which was a prctl() operation that would redirect all system calls through a user-space trampoline. System calls from the trampoline itself would be executed normally. In Wine's case, that trampoline could emulate system calls from Windows code while passing Linux system calls through to the kernel. Krisman indicated interest in this approach, and may implement a version of this idea as well.
Lutomirski's other
idea was to allow a process to establish an (extended) BPF filter
program for all system calls; he later extended
this idea to have it handle all "architectural privilege
transitions
" for the process. This approach offers a lot of
flexibility and may be useful far beyond Wine, but it suffers from a
significant flaw: in the absence of unprivileged BPF, it could only be
invoked by a privileged process, which is a show-stopper for Wine. Unless
something changes, unprivileged BPF is an idea
that isn't going anywhere in Linux, so the filter program does not look
like a solution that Wine could use.
The end result of this discussion is that the problem is reasonably well
understood and there is a shared desire to solve it. What form that
solution will take is far from clear, though; there are a few approaches
that need to be experimented with. Expect to see more patches in the
future as the developers work to find which idea works best.
Index entries for this article | |
---|---|
Kernel | Security/seccomp |
Kernel | System calls |
Posted Jun 25, 2020 17:36 UTC (Thu)
by rvolgers (guest, #63218)
[Link] (1 responses)
The PROT_NOSYSCALL could be turned into a remarkably elegant and effective security barrier when combined with Intel's Indirect Branch Tracking (part of CET) to remove the possibility of jumping directly to a SYSCALL instruction.
Posted Jun 25, 2020 17:57 UTC (Thu)
by rvolgers (guest, #63218)
[Link]
Obviously there are still a lot of potential problems when you have security code running in the same address space as the code it's defending against, so there might be something I'm missing. Certainly the trusted code that does the filtering would have to be very carefully written to defend against the program altering the filter code, and there are lots of potential race conditions if you'd want to check anything stored in memory.
Still, it seems like a pretty flexible approach, especially considering syscall filtering via bpf is also pretty limited, and worth thinking about?
Posted Jun 25, 2020 21:00 UTC (Thu)
by chris_se (subscriber, #99706)
[Link] (3 responses)
And if somebody much more clever than me when it comes to assembly can figure out a way to do that without requiring a context switch for the instruction that triggers the trampoline (don't know if that's possible) then this would have the same performance as eBPF, because whether you do the comparison in ring 3 or ring 0 shouldn't actually change anything.
On the other hand, if you do need two additional context switches (one for entering the kernel and one for jumping back to the trampoline) then one would have to measure the performance impact here. (Though in that case one optimization could be for the kernel to postpone the speculative execution mitigations until after it has determined the syscall comes from the trampoline and should hence be used directly, and don't do any of the mitigations both for entering and exiting ring 0 if it wants to jump immediately back to the trampoline, that should reduce the cost of the context switches quite a bit.)
Posted Jun 26, 2020 5:12 UTC (Fri)
by luto (guest, #39314)
[Link] (2 responses)
Posted Jun 26, 2020 21:08 UTC (Fri)
by roc (subscriber, #30627)
[Link] (1 responses)
Posted Jun 26, 2020 21:09 UTC (Fri)
by roc (subscriber, #30627)
[Link]
Posted Jun 25, 2020 22:56 UTC (Thu)
by roc (subscriber, #30627)
[Link] (11 responses)
We use a seccomp filter to trap on all syscalls except for those called from a single specific trampoline page. When a library makes a syscall, the filter triggers a ptrace trap. The ptracer looks at the code around the syscall and if it matches certain common patterns, we patch it with a jump to a stub that does the extra work we need and then issues a real syscall via the trampoline. Thus, a library syscall is slow the first time and fast the rest of the time. (Another possible variant of this approach, probably faster when applicable, would be to avoid using a ptracer and have the filter trigger a SIGSYS whose handler does the patching.)
Maybe I should post this to the list...
Posted Jun 25, 2020 23:24 UTC (Thu)
by smcv (subscriber, #53363)
[Link] (10 responses)
Posted Jun 26, 2020 1:39 UTC (Fri)
by roc (subscriber, #30627)
[Link] (9 responses)
Discussing it on LKML, the problem with our approach for Wine is probably the issues with multiple threads potentially racing with system-call patching.
Posted Jun 26, 2020 11:49 UTC (Fri)
by pm215 (subscriber, #98099)
[Link] (8 responses)
Posted Jun 26, 2020 12:40 UTC (Fri)
by pbonzini (subscriber, #60935)
[Link] (7 responses)
* patch WINE libraries (the only ones that should issue Linux system calls) to go through a trampoline page
* use seccomp-bpf to raise SIGSYS for almost all code except that single trampoline page
* now if you get SIGSYS you now it's a Windows syscall, and you handle it from the SIGSYS handler
Posted Jun 26, 2020 14:09 UTC (Fri)
by pm215 (subscriber, #98099)
[Link] (4 responses)
Posted Jun 26, 2020 15:03 UTC (Fri)
by pbonzini (subscriber, #60935)
[Link]
Posted Jun 26, 2020 21:05 UTC (Fri)
by roc (subscriber, #30627)
[Link] (2 responses)
Again: you don't need to patch the tricky game code with this approach ... as long as you can tolerate those syscalls being slow.
Posted Jun 28, 2020 19:04 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
I imagine this will depend on the game. If it's isolated into a bunch of small levels with loading screens between them, well, the loading screens will suck, but the rest of the game should basically work most of the time, assuming the game engine isn't trying to do something weird (like constantly telling the OS which pages to evict first).
But if it's an open world game that dynamically loads stuff in and out of memory all the time, then you're in trouble.
Posted Jul 3, 2020 13:28 UTC (Fri)
by raoni (guest, #137137)
[Link]
Posted Jun 30, 2020 12:31 UTC (Tue)
by mirabilos (subscriber, #84359)
[Link]
Posted Jul 11, 2020 13:30 UTC (Sat)
by Hi-Angel (guest, #110915)
[Link]
You can't achieve anything here by patching WINE libs because as the prev. author
> Modern Windows applications are executing system call instructions directly
Posted Jul 2, 2020 9:31 UTC (Thu)
by rwmj (subscriber, #5474)
[Link] (1 responses)
Posted Jul 2, 2020 10:13 UTC (Thu)
by khim (subscriber, #9252)
[Link]
When applications dropped WINXP support then got the chance to use it, too.
Emulating Windows system calls in Linux
Emulating Windows system calls in Linux
Emulating Windows system calls in Linux
Emulating Windows system calls in Linux
Emulating Windows system calls in Linux
Emulating Windows system calls in Linux
Emulating Windows system calls in Linux
Emulating Windows system calls in Linux
Emulating Windows system calls in Linux
Emulating Windows system calls in Linux
Emulating Windows system calls in Linux
Emulating Windows system calls in Linux
Emulating Windows system calls in Linux
Emulating Windows system calls in Linux
Emulating Windows system calls in Linux
Emulating Windows system calls in Linux
Emulating Windows system calls in Linux
Emulating Windows system calls in Linux
>
> * patch WINE libraries (the only ones that should issue Linux system calls) to
> go through a trampoline page
said, there's no problem with apps that go through them. The problem being
discussed is that some apps make system calls without going through WinAPI/WINE
libs. Let me quote the original mail:
> from the application's code without going through the WinAPI. This breaks Wine
> emulation, because it doesn't have a chance to intercept and emulate these
> syscalls before they are submitted to Linux.
Emulating Windows system calls in Linux
Emulating Windows system calls in Linux