VMX virtualization runs afoul of split-lock detection
To review quickly: a "split lock" occurs when a processor instruction locks a range of memory that crosses a cache-line boundary. Implementing such locks requires locking the entire memory bus, with unpleasant effects on the performance of the system as a whole. Most architectures do not allow split locks at all, but x86 does; only recently have some x86 processors gained the ability to generate a trap when a split lock is requested.
Kernel developers are interested in enabling split-lock detection as a way of eliminating a possible denial-of-service attack vector as well as just getting rid of a performance problem that could be especially problematic for latency-sensitive workloads. In short, there is a desire for x86 to be like other architectures in this regard. The implementation of this change has evolved considerably over time; in the patch that was merged, there is a new boot-time parameter (split_lock_detect=) that can have one of three values. Setting it to off disables this feature, warn causes a warning to be issued when user-space code executes a split lock, and fatal causes a SIGBUS signal to be sent. The default value is warn.
The various discussions around split-lock detection included virtualization, which has always raised some interesting questions. A system that runs virtualized guests is a logical place to enable split-lock detection, since a guest can disrupt others with hostile locking behavior. But a host that turns on split-lock detection risks breaking guests that are unprepared for it; this problem extends to the guest operating system, which will be directly exposed to the alignment-check traps caused by split-lock detection. It may not be possible for the administrator of the host to even know whether the guest workloads are ready or not. So various kernel developers wondered what the best policy regarding virtualization should be.
It seems that some of that discussion fell by the wayside as the final patch was being prepared, leading to an unpleasant surprise. Kenneth Crudup first reported that split-lock detection caused VMware guests to crash, but the problem turns out to be a bit more widespread than that.
Intel's "virtual machine extensions" (VMX, also referred to as "VT-x") implements hardware-supported virtualization on x86 processors. A VMLAUNCH instruction places the processor in the virtualized mode, where the client's system software can (mostly) behave like it is running on bare hardware while being contained within its sandbox. It turns out that, if split-lock detection is enabled and code running within a virtual machine attempts a split lock, the processor will happily deliver an alignment-check trap to a thread running in the VMX mode; what happens next depends on the hypervisor. And most hypervisors are not prepared for this to happen; they will often just forward the trap into the virtual machine, which, not being prepared for it, will likely crash. Any hypervisor using VMX is affected by this issue.
Thomas Gleixner responded to the problem with a short patch series trying to cause the right things to happen. One of the affected hypervisors is KVM; since it is a part of the kernel, the right solution is to just make KVM handle the trap properly. Gleixner included a patch causing KVM to check to see whether the machine was configured to receive an alignment-check trap and only deliver it if so. That patch is likely to be superseded by a different series written by Xiaoyao Li, but the core idea (make KVM handle the trap correctly) is uncontroversial.
The real question is what should be done the rest of the time. All of the other VMX-using hypervisors are out-of-tree, so they cannot be fixed directly. Gleixner's original patch was arguably uncharacteristic of his usual approach to such things: it disabled split-lock detection globally if a hypervisor module was loaded into the kernel. But, since modules don't come with a little label saying "this is a hypervisor", Gleixner's patch would, instead, read through each module's executable code at load time in search of a VMLAUNCH instruction. Should such an instruction exist, the module is deemed to be a hypervisor. Unless a special flag ("sld_safe") is set in the module info area, the hypervisor will be assumed to be unready for split-lock detection and the feature will be turned off.
It is not at all clear that this approach will be adopted. Among other
things, it turns out that not all VMX hypervisors include VMLAUNCH
instructions in their code. As Gleixner noted
later in the discussion, VirtualBox doesn't directly contain any of
the VMX instructions; those are loaded separately by the VirtualBox module,
outside of the kernel's module-loading mechanism. "This 'design'
probably comes from the original virtualbox implementation which
circumvented GPL that way
", Gleixner observed. Other modules use
VMXON rather than VMLAUNCH.
Eventually these sorts of problems could be worked around, but there is another concern with this approach that was expressed, in typical style, by Christoph Hellwig:
There is a fair amount of sympathy for this approach in kernel-development circles, but there is still a reluctance to ship something that is certain to create unexpected failures for end users even if it is not seen as a regression in the usual sense. So a couple of other ideas for how to respond to this problem have been circulating.
One of those is to continue scanning module code for instructions that indicate hypervisor functionality. But, rather than disabling split-lock detection on the system as a whole, the kernel would simply refuse to load the module. There are concerns about the run-time cost of scanning through module code, but developers like Peter Zijlstra also see an opportunity to prevent the loading of modules that engage in other sorts of unwelcome behavior, such as directly manipulating the CPU's control registers. A patch implementing such checks has subsequently been posted.
An alternative, suggested by Hellwig, is to find some other way to break the modules in question and prevent them from being loaded. Removing some exported symbols would be one way to do that. Zijlstra posted one attempt at "fixing" the problem that way; Hellwig has a complementary approach as well.
As of this writing, it's not clear which approach will be taken; the final
5.7 kernel could be released with both of them, or some yet unseen third
technique. Then, just maybe, the long story of x86 split-lock detection
will come to some sort of conclusion.
Index entries for this article | |
---|---|
Kernel | Architectures/x86 |
Kernel | Virtualization |
Posted Apr 8, 2020 11:47 UTC (Wed)
by cesarb (subscriber, #6266)
[Link] (4 responses)
I wonder if the end game won't be something like requiring all out-of-tree modules to be compiled to WASM or similar.
Posted Apr 8, 2020 12:37 UTC (Wed)
by amacater (subscriber, #790)
[Link] (3 responses)
Posted Apr 8, 2020 13:35 UTC (Wed)
by dottedmag (subscriber, #18590)
[Link] (2 responses)
Posted Apr 8, 2020 14:18 UTC (Wed)
by Tov (subscriber, #61080)
[Link]
Posted Apr 8, 2020 18:01 UTC (Wed)
by imMute (guest, #96323)
[Link]
Posted Apr 9, 2020 4:08 UTC (Thu)
by jcm (subscriber, #18262)
[Link] (4 responses)
Posted Apr 9, 2020 13:42 UTC (Thu)
by jkingweb (subscriber, #113039)
[Link] (3 responses)
Posted Apr 10, 2020 5:40 UTC (Fri)
by jcm (subscriber, #18262)
[Link] (2 responses)
Posted Apr 10, 2020 15:44 UTC (Fri)
by cesarb (subscriber, #6266)
[Link] (1 responses)
Given how many problems they have had with TSX (which includes some so bad that they had to completely disable TSX on the affected processors in a microcode update), it's a good thing they didn't do it that way.
It makes sense that they chose the simplest possible implementation: other than legacy MS-DOS era software, no software should do atomic operations on values which are not naturally aligned in memory. Their mistake was to take so long to implement a trap on these misaligned atomic operations.
Posted Apr 11, 2020 22:10 UTC (Sat)
by nix (subscriber, #2304)
[Link]
After a huge amount of effort to get TSX working for locks in glibc, actual benchmarking was done. TSX was only slightly faster than no-TSX, and algorithmic changes made no-TSX massively faster than TSX. Whoops. (TSX remains dead useful for attackers trying to trigger speculative attacks that then erase the evidence of their execution.)
Posted Apr 9, 2020 19:23 UTC (Thu)
by kmweber (guest, #114635)
[Link] (2 responses)
Like I said, I know nothing about this, so this isn't meant as a commentary on whether it is a good solution or not, whether the benefits outweigh the costs, and so forth--I'm just teying to better understand what's going on.
Posted Apr 10, 2020 13:06 UTC (Fri)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Apr 12, 2020 21:48 UTC (Sun)
by hmh (subscriber, #3838)
[Link]
Posted Apr 19, 2020 21:09 UTC (Sun)
by robbe (guest, #16131)
[Link]
So… the module breaking regularly is par for the course, at least when considering VMware.
VMX virtualization runs afoul of split-lock detection
VMX virtualization runs afoul of split-lock detection
VMX virtualization runs afoul of split-lock detection
VMX virtualization runs afoul of split-lock detection
VMX virtualization runs afoul of split-lock detection
VMX virtualization runs afoul of split-lock detection
VMX virtualization runs afoul of split-lock detection
VMX virtualization runs afoul of split-lock detection
VMX virtualization runs afoul of split-lock detection
VMX virtualization runs afoul of split-lock detection
VMX virtualization runs afoul of split-lock detection
VMX virtualization runs afoul of split-lock detection
I have a question: does this essentially mean putting a disassembler in the kernel?
There's been a disassembler in the kernel for ages (or actually many, for many architectures: used e.g. for kprobes and uprobes etc to analyze function prologues to drop probe points there). This is just one of many users.
VMX virtualization runs afoul of split-lock detection
VMX virtualization runs afoul of split-lock detection