|| ||Thomas Gleixner <tglx-AT-linutronix.de>|
|| ||Zachary Amsden <zach-AT-vmware.com>|
|| ||Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree|
|| ||Tue, 06 Mar 2007 11:59:34 +0100|
|| ||Ingo Molnar <mingo-AT-elte.hu>, akpm-AT-linux-foundation.org, ak-AT-suse.de,
Daniel Hecht <dhecht-AT-vmware.com>,
Virtualization Mailing List <virtualization-AT-lists.osdl.org>,
Jeremy Fitzhardinge <jeremy-AT-goop.org>,
Rusty Russell <rusty-AT-rustcorp.com.au>,
On Tue, 2007-03-06 at 00:55 -0800, Zachary Amsden wrote:
> > a proper CE device also has the added bonus of making high-res timers
> > guests work automatically. It should be simple: just pass it through to
> > your hypervisor, a hyper-CE-device, like a hyper-clocksource device has
> > essentially no guest-side complexity.
> It is not so simple. In theory it works great. In reality, the i386
> implementation is completely hardwired to work the way hardware works,
> and breaking the clockevent code out of the deep ties to the APIC is
> extremely non-trivial. We tried, and could not accomplish it for 2.6.21
> because the hrtimers integration was complex, and introduced many bugs
> for us.
Why is this so non-trivial ? All you have to do is _NOT_ register
PIT/HPET/APIC timers and register a per CPU hyper-CE-device instead,
which uses the hypervisor timer emulation instead of real hardware.
clockevents breaks the hardwired assumptions of the old timer code and
allows you to remove _ALL_ the hardwired hackery in vmitimer.c, i.e.
/* Disable PIT. */
outb_p(0x3a, PIT_MODE); /* binary, mode 5, LSB/MSB, ch 0 */
> We worked around this by keeping NO_IDLE_HZ support, which now
> you deprecated. So now we are using NO_HZ without a hyper-CE device,
> and it is working fine. We understand the benefits of moving to the CE
> model - but it cannot be done overnight.
This is ugly as hell. NO_HZ enables the dyntick functions in idle(),
irq_enter() and irq_exit() so the clockevents code is actually invoked.
I have not looked close enough why this does work at all.
I have the feeling that "working fine" means something like "does not
We really want to fix this now instead of pushing some not know why it
works hack into the kernel.
to post comments)