User: Password:
Subscribe / Log in / New account


SmartOS: virtualization with ZFS and KVM

September 21, 2011

This article was contributed by Koen Vervloesem

On August 15, at the KVM forum 2011, Bryan Cantrill, VP Engineering at Joyent, gave a presentation entitled "Experiences Porting KVM to SmartOS." The SmartOS in the title is Joyent's illumos-based operating system that is the foundation of its public cloud and its SmartDataCenter product. With this talk, Cantrill essentially announced that Joyent has ported KVM to the illumos (Solaris) kernel.

Thanks to its illumos base, Joyent's SmartOS already had several key features for a cloud operating system, such as the ZFS file system, the dynamic tracing possibilities of DTrace, network virtualization with Crossbow, and operating system-level virtualization (Zones) to isolate virtual operating systems, all running on the same kernel. However, one essential piece was missing in this puzzle of enterprise technologies: hardware virtualization. Granted, a few years ago OpenSolaris had Xen Dom0 support (called xVM), even with hardware virtualization, but the project was abandoned even before Oracle walked away from OpenSolaris.

Joyent (which is a member of the Open Virtualization Alliance dedicated to the awareness and adoption of KVM) believes in the thesis that the best hypervisor is the host operating system itself, because anyone attempting to implement a thin hypervisor would end up retracing the history of operating systems. This is exactly the vision of KVM, so when Joyent decided in the fall of last year that it needed to port KVM to SmartOS, this was a natural (but not trivial) choice.

Because its resources were constrained, Joyent decided to focus exclusively on KVM support for Intel processors. More specifically, a machine running KVM on illumos needs an Intel processor with VT-x and EPT (Extended Page Tables), such as the Nehalem Core i3/i5/i7. However, the developers made sure that they didn't make decisions that would impede later AMD support. Also, only x86-64 hosts and x86 and x86-64 guests are supported. Apart from these constraints, one of the design goals was that the KVM port to illumos would maintain compatibility with the QEMU/KVM interface as much as possible.

Porting an unportable component

At first sight, it seems impossible to port a component that is essentially not designed to be portable: KVM is very specific to Linux. Some of the Linux-specific facilities that KVM uses could be emulated in the illumos kernel, but, because of the big differences between the Linux and the illumos kernel, this would not be a clean solution. Joyent engineer Max Bruning started working on the port in the fall of 2010 by copying the KVM bits from the stable Linux 2.6.34 source and getting it to compile on illumos; in April 2011 he was joined by Robert Mustacchi and Bryan Cantrill. As Cantrill explained in his presentation, DTrace (invented by Cantrill when he was working at Sun) was essential in the porting process: it let them understand how much the still unported code was used by virtual machines.

In his blog post KVM on illumos, Cantrill explains why the porting effort was so grueling:

Because KVM is so tightly integrated into the Linux kernel, it was difficult to determine dependencies - and hard to know when to pull in the Linux implementation of a particular facility or function versus writing our own or making a call to the illumos equivalent (if any).

According to Joyent's measurements, the illumos KVM port performs as well as Linux KVM, at bare-metal speeds for entirely CPU-bound workloads. For other workloads, such as a MySQL benchmark, the performance is obviously not at full bare-metal speed, but the Linux and illumos implementations of KVM don't diverge much from each other. Tested guest operating systems include SmartOS, 64-bit Linux, Windows Server 2008 R2, FreeBSD, OpenBSD, QNX, Plan 9 and Haiku. As for VM resources, the same limitations as for Linux KVM apply: up to 256 virtual CPUs per virtual machine and up to 2 TB virtual memory.

Limitations and enhancements

Illumos KVM diverges from Linux KVM in some areas. For starters, apart from the focus on Intel and x86/x86-64 there are some limitations in functionality. As a cloud provider, Joyent doesn't believe in overselling memory, so it locks down guest memory in KVM: if the host hasn't enough free memory to lock down all the needed memory for the virtual machine, the guest fails to start. Using the same argument, Joyent also didn't implement kernel same-page merging (KSM), the Linux functionality for memory deduplication. According to Cantrill's presentation, it's technically possible to implement this in illumos, but Joyent doesn't see an acute need for it. Another limitation is that illumos KVM doesn't support nested virtualization.

However, tying KVM to illumos instead of Linux also makes some interesting enhancements possible. For instance, you can create ZFS volumes for your virtual machine images. At first sight, this looks like just a convenient way to store your VM images, but it really improves the virtualization experience. Because ZFS can clone volumes in constant time, you can provision new KVM guests nearly instantly if you already have a reference image. Moreover, ZFS remote replication with the zfs send and zfs receive commands makes an efficient foundation for remote cloning and migration of virtual machines over the network. To streamline this, Cantrill intends to integrate QEMU migration with ZFS remote replication. Also, because ZFS has a unified adaptive replacement cache (ARC), guest I/O will be efficiently cached in the host, resulting in improved random I/O operations.

Another improvement that illumos makes possible is the use of Zones, the OS virtualization feature. While the illumos KVM implementation doesn't require it, SmartOS runs KVM guests in a local zone, with the QEMU process as the only process running in the zone. While zones were originally used for resource management, quality of service, I/O throttling, and so on, containing QEMU in its own zone also improves security by reducing the attack surface for QEMU exploits. If an exploit 'breaks out of the virtual machine', it's still contained in the local zone and has no access to the global zone of the virtualization host or the local zones of the other virtualization guests.

Another interesting feature of illumos is network virtualization, also known as Crossbow. With a few commands, you can create a virtual network interface card (VNIC) per KVM guest. The SmartOS developers have written some glue code to connect this feature to virtio and have been able to attain 1Gbit/second data rates to/from a KVM guest. VNICs also makes managing the virtual machine's network usage more easy. Thanks to flow management, guests can be capped at specified levels of bandwidth, and guests can be confined to specified IP addresses, hereby making IP spoofing impossible.

And last but not least, the dynamic tracing possibilities of the DTrace framework let the system administrator understand the workload characteristics of his virtual machines and helps with troubleshooting. For this purpose, Joyent has added some DTrace probes to QEMU and KVM to examine the behavior of KVM guests. They have even integrated several of these metrics into their Cloud Analytics tool to visualize the KVM guest behavior in graphs. In his presentation, Cantrill even suggests that, thanks to the better visibility in guest behavior, DTrace will help in finding performance improvements for KVM, which will likely carry from illumos KVM to the original Linux implementation.

The source

Joyent was already a big contributor to illumos, the successor to the OpenSolaris community. However, their KVM port is the first major addition of functionality to the illumos source since Oracle let the OpenSolaris community die. The source code is published in two parts: a GitHub repository for KVM itself (illumos-kvm) and one with some minor patches to QEMU 0.14.1, all of which they intend to upstream (illumos-kvm-command). Other KVM-specific tooling (such as the kvmstat command for monitoring of KVM statistics) has already been upstreamed to illumos itself. According to Joyent, this port is at or near production quality.

Joyent hasn't pushed any bug fixes back to KVM, but the reason for this is quite simple: they didn't find any bugs in KVM. Cantrill explains this in an email interview:

I was actually surprised by this: while I knew that KVM broadly worked, I would have assumed that we would have found some bug at some point in KVM – but all of the bugs we found in the course of the project were essentially self-inflicted wounds. The high quality of KVM is a tribute to both Avi Kivity and to the KVM engineering team – but also to the folks who have put together the automated testing for KVM: after having met Lucas Rodrigues and the KVM autotest team, it's clear that the quality in KVM is due at least in part to a superlative verification effort.

As for the enhancements Joyent's team made to illumos KVM: all of them are specific to illumos features like ZFS, DTrace, Zones, Crossbow, kstats, and the like. As these features do not exist on a Linux host, it doesn't make any sense to upstream them.

But of course there have been a lot of changes to Linux KVM since 2.6.34, the version on which illumos KVM is based. Cantrill is not very concerned about this, he explains:

In part because KVM is so rock-solid, we are less concerned about being based on Linux 2.6.34: we are monitoring patches against 2.6.34, and will incorporate those patches into our implementation as appropriate, but we don't feel a desire to track Linux KVM any more tightly than that. The features that have been implemented since 2.6.34 are not ones that we feel strongly about integrating. For example, nested virtualization adds a tremendous amount of complexity but brings essentially no value for us – we are using KVM in a datacenter environment where nested virtualization is of dubious utility. All of this is not to say that we won't revisit this in the future – but for now we are 2.6.34-based.

The license

Of course some questions arose about the license: Joyent has copied the GPL-ed KVM code from Linux, while the illumos kernel uses the CDDL (Common Development and Distribution License). However, according to Cantrill this doesn't pose any problems. On his blog he answers a question from a reader about the issue:

Our KVM port remains GPL and its own work (and lives in its own repo) - the illumos kernel is CDDL but is in no way a derived work of our KVM port.

And on Hacker News he clarifies that their KVM port doesn't use the hooks that Linux KVM has into the Linux kernel (which are marked as EXPORT_SYMBOL_GPL in the Linux kernel): "Actually, our port does not use these hooks – there were zero mods to the illumos kernel to support KVM per se." So, although there seem to be some questions about the legality of the KVM module in illumos, the developers are fairly confident that the problems don't apply because the illumos kernel (CDDL-licensed) is not a derived work of the illumos KVM module (GPL-licensed).

SmartOS and OpenIndiana

SmartOS, for which you can find the source on GitHub (smartos-live), can be downloaded as an ISO or USB image, and it's a minimal live distribution meant to run as a virtualization server. SmartOS just boots from a USB stick or CD-ROM and runs from RAM. It's not meant to be installable, and Joyent doesn't intend to develop an installer, but with some elaborate commands it's possible to install SmartOS on a hard disk.

When running SmartOS from RAM, changes made on the running system naturally don't persist across boots. This doesn't have to be a big issue, as long as you don't want to change your operating system's configuration. Just create one or more ZFS pools with zfs create to initialize your hard disks and to be able to store data on them. However, because of the transient nature of SmartOS, you have to manually import all pools with the zpool import command after each boot.

There's some scarce and fragmented documentation on the SmartOS wiki, with some help about creating zones, creating virtual machines, and other tasks. If you're not comfortable with the Solaris commands, you can also read the topic Finding Your Way Around a SmartMachine on the wiki of SmartMachines, Joyent's commercial cloud offering based on SmartOS.

As SmartOS is really stripped down to only have the minimal bits to work as a virtualization hosts, many tools for other purposes are lacking out-of-the-box. To install extra software, you can use pkgsrc, NetBSD's portable package manager, by downloading the pkgsrc bootstrap image and unpacking it.

If you really want to have illumos KVM installed on hard disk instead and don't want to get your hands dirty with the manual installation of SmartOS, there's another option. OpenIndiana, the spiritual successor of the OpenSolaris distribution, recently released their development build 151a exactly one year after the initial release of the distribution. It's the first build based on the illumos core and it also has integrated Joyent's KVM support. Installing OpenIndiana oi_151a gives you a Gnome desktop for a workstation or a text-based installation for headless servers, and the KVM bits can be installed with the pkg install command of the package manager IPS (Image Packaging System). The OpenIndiana wiki shows you the needed commands.


If anyone doubted that illumos would be able to build enough momentum, Joyent's KVM port to illumos and the subsequent illumos-based OpenIndiana development release have surely answered these doubts. Illumos appears to be here to stay, and it offers a lot of interesting technology, such as ZFS, DTrace, Crossbow, Zones, and now KVM. For Linux users who were interested in these Solaris technologies but wouldn't want to lose their favorite hypervisor KVM, SmartOS and OpenIndiana are now able offer the best of both worlds.

Comments (15 posted)

Brief items

Distribution quotes of the week

I'm a bit concerned that the future of linux on the desktop is going to be one where your choices are things like Android, ChromeOS, Ubuntu, Gnome OS, or a "KDE OS." Each one would have its own package managers, repositories, distros, APIs, etc. Clearly there is some benefit from the vertical integration (Android and ChromeOS have a very high level of polish, and Ubuntu is approaching this often by just writing their own stuff). Instead of working to influence other projects (which can be frustrating), big distros are looking to just eliminate dependencies outside of themselves.

This will be a big challenge for a smaller distro like Gentoo. Obviously we can't just go write our own Wayland replacement, even if we did essentially make our own "systemd" of sorts.

-- Rich Freeman

On a more general note, while I agree that it's preferable to get packages into Ubuntu by way of Debian, it's important to recognize that the two distros have very different workflows, schedules, tools, and cultures. Both work very well when you're mainly concentrating on one distro or the other, but there are still a lot of rough spots when crossing over between the two. It's true that Ubuntu benefits greatly from the work done in Debian, and I think it would be valuable to find ways for Debian to benefit more from the work done in Ubuntu. Somehow it just seems there should be better ways to share bugs, branches, reputation, reviews, and experience.
-- Barry Warsaw

Comments (none posted)

Tumbleweed backs off on systemd for now

The openSUSE "Tumbleweed" distribution has removed systemd (which had been available as an optional package) for now. "Due to a number of interdependancies on packages that are not ready for Tumbleweed, and other interactions with the system that are causing problems for some users, I'm going to remove systemd from Tumbleweed today to allow the developers to spend more time on getting it stable for Factory and 12.1 instead of having to chase down problems that are specific to Tumbleweed only." Interestingly, almost all of the followup discussion is about whether this change should be announced more widely or kept quiet.

Full Story (comments: 65)

Distribution News

Debian GNU/Linux

Upcoming Debian point releases and call for test

The Debian Project has announced that the upcoming point releases for Debian 5 "Lenny" and Debian 6 "Squeeze" are scheduled for October 1st and October 8th respectively. Interested users are encouraged to test these packages before their release, especially on systems that use the updated drivers.

Full Story (comments: none)

Newsletters and articles of interest

Distribution newsletters

Comments (none posted)

Arch Linux - "It is what you make it" (The H)

The H looks at the simplicity of Arch Linux and its usefulness as a base for custom distributions. "Every Arch installation is different to every other Arch installation and is defined by its technical elegance and its adherence to the demands of the individual user or set of users. The aim is not a Linux for every man, but a Linux that is moulded to fit the demands of the individual user. So, as the Arch wiki expresses it, simplicity in the context of Arch Linux means "without unnecessary additions, modifications, or complications. In short; an elegant, minimalist approach.""

Comments (38 posted)

Fresh wind for openSUSE (The H)

The H talks with Andreas Jaeger at the openSUSE Conference in Nuremberg. "However, 'zero versions' reportedly tend to be regarded as 'major updates' and generate a high level of expectation among users. Apparently, this has repeatedly given rise to the opinion that a version wasn't ready for release, and that it contained too few new features for a zero release. Internally, the openSUSE team has never treated the zero versions as major releases, said Jaeger. The developers therefore decided to skip the 'zero release' and not release a version 12.0."

Comments (12 posted)

Poortvliet: openSUSE Conference Fun!

Jos Poortvliet wraps up last week's openSUSE Conference. "Lots of talks and discussions ranging from development and low-level kernel tools to social and marketing sessions have taken place over the last four days, all focused on world domination of course. There was a large number of sessions around packaging, both focusing on teaching as well as improving current packaging quality and more steam lined maintenance of our repositories. Robert Schwelkert's talk on "Where do we improve?" proposed a lot of changes like improved translations, documentation, separating the bugzilla and getting rid of Novell's iChain. The openSUSE Project Meeting discussed a number of interesting ideas and developments including the current status of the openSUSE Foundation and upcoming elections. The board said it was working on the foundation but it is a slow process. We want to have a long-term solution with buy-in from all parties. As Attachmate has just joined this process it has taken time to get them up to speed but there is progress now. "

Comments (none posted)

Zacchiroli: why there are so many debian derivatives

Stefano Zacchiroli considers the advantages of using Debian as a base for derivative distributions. "But there are also a couple of "political" reasons for basing derivatives on Debian. One is quite subtle and applies mostly to commercial distributions. If you are designing one such commercial distro, you have to be based on an independent distro with no commercial interests, lest risking that petty (technical or otherwise) choices might be made just to undermine your business. Among "popular" GNU/Linux distros, Debian is essentially the only one which is both volunteer-based and not ascribable to any specific company." (Thanks to Paul Wise)

Comments (none posted)

Page editor: Rebecca Sobol
Next page: Development>>

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds