VMX virtualization runs afoul of split-lock detection

By Jonathan Corbet
April 7, 2020

One of the many features merged for the 5.7 kernel is split-lock detection for the x86 architecture. This feature has encountered a fair amount of controversy over the course of its development, with the result that the time between its initial posting and appearance in a released kernel will end up being over two years. As it happens, there is another hurdle for split-lock detection even after its merging into the mainline; this feature threatens to create problems for a number of virtualization solutions, and it's not clear what the solution would be.

To review quickly: a "split lock" occurs when a processor instruction locks a range of memory that crosses a cache-line boundary. Implementing such locks requires locking the entire memory bus, with unpleasant effects on the performance of the system as a whole. Most architectures do not allow split locks at all, but x86 does; only recently have some x86 processors gained the ability to generate a trap when a split lock is requested.

Kernel developers are interested in enabling split-lock detection as a way of eliminating a possible denial-of-service attack vector as well as just getting rid of a performance problem that could be especially problematic for latency-sensitive workloads. In short, there is a desire for x86 to be like other architectures in this regard. The implementation of this change has evolved considerably over time; in the patch that was merged, there is a new boot-time parameter (split_lock_detect=) that can have one of three values. Setting it to off disables this feature, warn causes a warning to be issued when user-space code executes a split lock, and fatal causes a SIGBUS signal to be sent. The default value is warn.

The various discussions around split-lock detection included virtualization, which has always raised some interesting questions. A system that runs virtualized guests is a logical place to enable split-lock detection, since a guest can disrupt others with hostile locking behavior. But a host that turns on split-lock detection risks breaking guests that are unprepared for it; this problem extends to the guest operating system, which will be directly exposed to the alignment-check traps caused by split-lock detection. It may not be possible for the administrator of the host to even know whether the guest workloads are ready or not. So various kernel developers wondered what the best policy regarding virtualization should be.

It seems that some of that discussion fell by the wayside as the final patch was being prepared, leading to an unpleasant surprise. Kenneth Crudup first reported that split-lock detection caused VMware guests to crash, but the problem turns out to be a bit more widespread than that.

Intel's "virtual machine extensions" (VMX, also referred to as "VT-x") implements hardware-supported virtualization on x86 processors. A VMLAUNCH instruction places the processor in the virtualized mode, where the client's system software can (mostly) behave like it is running on bare hardware while being contained within its sandbox. It turns out that, if split-lock detection is enabled and code running within a virtual machine attempts a split lock, the processor will happily deliver an alignment-check trap to a thread running in the VMX mode; what happens next depends on the hypervisor. And most hypervisors are not prepared for this to happen; they will often just forward the trap into the virtual machine, which, not being prepared for it, will likely crash. Any hypervisor using VMX is affected by this issue.

Thomas Gleixner responded to the problem with a short patch series trying to cause the right things to happen. One of the affected hypervisors is KVM; since it is a part of the kernel, the right solution is to just make KVM handle the trap properly. Gleixner included a patch causing KVM to check to see whether the machine was configured to receive an alignment-check trap and only deliver it if so. That patch is likely to be superseded by a different series written by Xiaoyao Li, but the core idea (make KVM handle the trap correctly) is uncontroversial.

The real question is what should be done the rest of the time. All of the other VMX-using hypervisors are out-of-tree, so they cannot be fixed directly. Gleixner's original patch was arguably uncharacteristic of his usual approach to such things: it disabled split-lock detection globally if a hypervisor module was loaded into the kernel. But, since modules don't come with a little label saying "this is a hypervisor", Gleixner's patch would, instead, read through each module's executable code at load time in search of a VMLAUNCH instruction. Should such an instruction exist, the module is deemed to be a hypervisor. Unless a special flag ("sld_safe") is set in the module info area, the hypervisor will be assumed to be unready for split-lock detection and the feature will be turned off.

It is not at all clear that this approach will be adopted. Among other things, it turns out that not all VMX hypervisors include VMLAUNCH instructions in their code. As Gleixner noted later in the discussion, VirtualBox doesn't directly contain any of the VMX instructions; those are loaded separately by the VirtualBox module, outside of the kernel's module-loading mechanism. "This 'design' probably comes from the original virtualbox implementation which circumvented GPL that way", Gleixner observed. Other modules use VMXON rather than VMLAUNCH.

Eventually these sorts of problems could be worked around, but there is another concern with this approach that was expressed, in typical style, by Christoph Hellwig:

This is just crazy. We have never cared about any out of tree module, why would we care here where it creates a real complexity. Just fix KVM and ignore anything else.

There is a fair amount of sympathy for this approach in kernel-development circles, but there is still a reluctance to ship something that is certain to create unexpected failures for end users even if it is not seen as a regression in the usual sense. So a couple of other ideas for how to respond to this problem have been circulating.

One of those is to continue scanning module code for instructions that indicate hypervisor functionality. But, rather than disabling split-lock detection on the system as a whole, the kernel would simply refuse to load the module. There are concerns about the run-time cost of scanning through module code, but developers like Peter Zijlstra also see an opportunity to prevent the loading of modules that engage in other sorts of unwelcome behavior, such as directly manipulating the CPU's control registers. A patch implementing such checks has subsequently been posted.

An alternative, suggested by Hellwig, is to find some other way to break the modules in question and prevent them from being loaded. Removing some exported symbols would be one way to do that. Zijlstra posted one attempt at "fixing" the problem that way; Hellwig has a complementary approach as well.

As of this writing, it's not clear which approach will be taken; the final 5.7 kernel could be released with both of them, or some yet unseen third technique. Then, just maybe, the long story of x86 split-lock detection will come to some sort of conclusion.

Index entries for this article
Kernel	Architectures/x86
Kernel	Virtualization

VMX virtualization runs afoul of split-lock detection

Posted Apr 8, 2020 11:47 UTC (Wed) by cesarb (subscriber, #6266) [Link] (4 responses)

> but developers like Peter Zijlstra also see an opportunity to prevent the loading of modules that engage in other sorts of unwelcome behavior, such as directly manipulating the CPU's control registers.

I wonder if the end game won't be something like requiring all out-of-tree modules to be compiled to WASM or similar.

VMX virtualization runs afoul of split-lock detection

Posted Apr 8, 2020 12:37 UTC (Wed) by amacater (subscriber, #790) [Link] (3 responses)

Or just kill out of tree modules as unfixable and blacklist them from linking ... that would kill off the unfixable cheap appliances that are always out of tree :)

VMX virtualization runs afoul of split-lock detection

Posted Apr 8, 2020 13:35 UTC (Wed) by dottedmag (subscriber, #18590) [Link] (2 responses)

Well, it will cause these appliances to be stuck on ancient kernels with out-of-tree modules.

VMX virtualization runs afoul of split-lock detection

Posted Apr 8, 2020 14:18 UTC (Wed) by Tov (subscriber, #61080) [Link]

Which is essentially status quo...

VMX virtualization runs afoul of split-lock detection

Posted Apr 8, 2020 18:01 UTC (Wed) by imMute (guest, #96323) [Link]

It'll also completely screw over people who use out-of-tree modules and *up-to-date* kernels too.

VMX virtualization runs afoul of split-lock detection

Posted Apr 9, 2020 4:08 UTC (Thu) by jcm (subscriber, #18262) [Link] (4 responses)

What isn’t spelled out clearly anywhere is just how badly they screwed up the split cache line lock implementation. There isn’t one “memory bus” in a modern system, there are many cache coherent agents talking to one another and via home agents (contained within home nodes in Intel parlance) to external memory. There are so many nasty ways that you could implement “locking the bus”.

VMX virtualization runs afoul of split-lock detection

Posted Apr 9, 2020 13:42 UTC (Thu) by jkingweb (subscriber, #113039) [Link] (3 responses)

For clarity, who is "they"?

VMX virtualization runs afoul of split-lock detection

Posted Apr 10, 2020 5:40 UTC (Fri) by jcm (subscriber, #18262) [Link] (2 responses)

Intel. This ought to be have been implemented using a TSX-like concept by putting both cache lines into effectively a write set consisting of lines exclusively held by a single core at a time, not by locking the uncore of the chip.

VMX virtualization runs afoul of split-lock detection

Posted Apr 10, 2020 15:44 UTC (Fri) by cesarb (subscriber, #6266) [Link] (1 responses)

> This ought to be have been implemented using a TSX-like concept

Given how many problems they have had with TSX (which includes some so bad that they had to completely disable TSX on the affected processors in a microcode update), it's a good thing they didn't do it that way.

It makes sense that they chose the simplest possible implementation: other than legacy MS-DOS era software, no software should do atomic operations on values which are not naturally aligned in memory. Their mistake was to take so long to implement a trap on these misaligned atomic operations.

VMX virtualization runs afoul of split-lock detection

Posted Apr 11, 2020 22:10 UTC (Sat) by nix (subscriber, #2304) [Link]

> Given how many problems they have had with TSX (which includes some so bad that they had to completely disable TSX on the affected processors in a microcode update), it's a good thing they didn't do it that way.

After a huge amount of effort to get TSX working for locks in glibc, actual benchmarking was done. TSX was only slightly faster than no-TSX, and algorithmic changes made no-TSX massively faster than TSX. Whoops. (TSX remains dead useful for attackers trying to trigger speculative attacks that then erase the evidence of their execution.)

VMX virtualization runs afoul of split-lock detection

Posted Apr 9, 2020 19:23 UTC (Thu) by kmweber (guest, #114635) [Link] (2 responses)

As someone who knows nothing about this, I have a question: does this essentially mean putting a disassembler in the kernel? Because presumably that particular sequence of bytes that constitutes the opcode could appear in other contexts, right? E.g. as initialized data, as the end of one instruction opcode followed by the beginning of another, etc. Since x86 instructions are variable in length, is there a way to reliably determine this short of disassembling the entire thing from the beginning?

Like I said, I know nothing about this, so this isn't meant as a commentary on whether it is a good solution or not, whether the benefits outweigh the costs, and so forth--I'm just teying to better understand what's going on.

VMX virtualization runs afoul of split-lock detection

Posted Apr 10, 2020 13:06 UTC (Fri) by nix (subscriber, #2304) [Link] (1 responses)

I have a question: does this essentially mean putting a disassembler in the kernel?

There's been a disassembler in the kernel for ages (or actually many, for many architectures: used e.g. for kprobes and uprobes etc to analyze function prologues to drop probe points there). This is just one of many users.

VMX virtualization runs afoul of split-lock detection

Posted Apr 12, 2020 21:48 UTC (Sun) by hmh (subscriber, #3838) [Link]

Not to mention the emulators... Yes, the kernel can, and does, have to emulate (a few instructions of) the running processor sometimes...

VMX virtualization runs afoul of split-lock detection

Posted Apr 19, 2020 21:09 UTC (Sun) by robbe (guest, #16131) [Link]

How it has worked before with VMware Workstation: the product ships a ton of pre-built kernel modules, but these are invariably not for my distribution, and even if they were, they get old pretty fast, if you are tracking Linus releases. You can fall back on compiling from the provided module sources, but if your kernel is 6-12 months newer than your release of Workstation, this will usually fail and needs manual fixing.

So… the module breaking regularly is par for the course, at least when considering VMware.