Breaking the application binary interface (ABI) between the kernel and user
space is a well-known taboo for Linux. That line may seem a little
blurrier to some when it comes to the ABI for tools like perf that ship
with the kernel. As a recent discussion on the linux-kernel mailing list
shows, though, Linus Torvalds and others still have that line in sharp focus.
The issue stems from what appears to be a fairly serious bug in some x86
processors. Back in
July, David Ahern reported
that KVM-based virtual machines would crash when recording certain
events on the host. On some x86 processors, the "Precise Events
Based Sampling" (PEBS) mechanism can be used to gather precise counts of
events like CPU cycles. Unfortunately, PEBS and hardware virtualization
don't play nicely together.
As Ahern reported, running:
perf record -e cycles:p -ag -- sleep 10
on the host would reliably crash all of the guests. That
particular command will record the events specified, CPU
cycles in this case, to a file; more information about perf
. It turns out that PEBS
incorrectly treats the contents of the Data Segment (DS) register as a guest address,
rather than as a host address. That leads to memory
corruption in the guest, which will crash all of the virtual machines on the
" (precise) attribute on the cycles
event (which can be
repeated for higher precision levels as in cycles:pp
) asks for more
which leads to PEBS being used. Without that attribute, the
cycle counts measured are less accurate, but do not cause the VM crashes.
That problem led Peter Zijlstra to change
perf_event.c in the kernel to disallow precise measurements
has been specifically excluded. Using the ":H" (host-only)
attribute will still allow precise measurements as perf will
set the exclude_guest flag on the event. That flag will inhibit
PEBS activity while in the guest. In addition, Ahern changed
perf so that exclude_guest would be automatically
selected if the "precise" attribute was set. There's just one problem with those solutions: existing
perf binaries do not set exclude_guest, so users
would get an EOPNOTSUPP error.
It turns out that one of those existing users is Torvalds, who complained that:
perf record -e cycles:pp
no longer worked for him. Ahern suggested
", but that elicited an annoyed response
from Torvalds. Why should he
have to add a new flag to deal with virtualization, when he isn't running
it? "That whole 'exclude_guest' test is insane when there isn't any
virtualization going on.
Ahern countered that it's worse to have VMs
explode because someone runs a precise perf. But that's beside
the point, as Torvalds pointed out:
You broke the WORKING case for old binaries in order to give an error
return in a case that NEVER EVEN WORKED with those binaries. Don't you
see how insane that is?
The 'H' flag is totally the wrong way around. Exactly because it only
"fixes" a case that was already working, and makes a case that never
worked anyway now return an error value. That's not sane. Since the
old broken case never worked, nobody can have depended on it. See why
I'm saying that it's the people who use virtualization who should be
forced to use the new flag, not the other way around?
Forcing existing perf binary users to change their habits is the
crux of the matter. Beyond breaking the ABI, which is clearly
not allowed, it makes perf break for real users as Ingo Molnar said: "Old, working binaries are actually our _most_
important usecase: it's 99.9% of our current installed base ...".
While it is certainly a problem that older kernels can have all their
guests crashed with a simple command, the proper solution is not to require
either upgrading perf or changing the flags (which could well be
buried in scripts or other automation).
Existing perf binaries set the exclude_guest flag to
zero, while binaries that have Ahern's change set it to one.
That means newer kernels that seek to fix the crashing
guest bug cannot rely on a particular value for that flag. The "proper"
way to have handled the problem is to use a new include_guest
flag (or similar), which defaults to zero. Older binaries cannot change
that flag (since they don't know about it), so the kernel code can use it
to exclude the precise flag for guests on x86 systems. Other architectures
may not suffer from the same restriction.
Beyond that, Torvalds argues that if the
user asks for a precise measurement but doesn't specify either the
"H" or "G" (include
guests) attribute, the code should try to do the right thing. That means it
should measure both the host and guests on systems that support it, while
backing off to just the host for x86. Meanwhile it could return
EOPNOTSUPP if the user explicitly asks for a broken combination
(e.g. precise and include guests on x86). Molnar concurred. Ahern seemed a
bit unhappy about things, but said
that he would start working on a patch that has not appeared yet.
It is worth noting that Torvalds admitted
that he could trivially recompile perf to get around the whole
problem; it was a principle that he was standing up for. Even though some
like perf are distributed with the kernel tree, that does not
relax the "no regressions" rule. Some critics of the move to add tools to
the kernel tree were concerned that it would facilitate ABI changes that
could be glossed over by requiring keeping the tools and kernel in
sync. This discussion clearly shows that not to be the case.
Having a way to crash all the VMs on a system is clearly undesirable, but
as Torvalds pointed out, that had been true for quite some time.
Undesirable behavior does not rise to the level of allowing ABI breakage,
In addition, distributions and administrators can always limit access to
to the root user—though that obviously may still lead to unexplained
Ahern noted. Molnar pointed out that the virtualization use case
much smaller piece of the pie, so making everyone else pay for a problem they
may never encounter just doesn't make sense. Either through a patch or a
revert, it would seem that the
"misbehavior" will disappear before 3.8 is released.
to post comments)