LWN.net Logo

NLUUG: Minimizing downtime on servers using NanoBSD, ZFS, and jails

May 12, 2010

This article was contributed by Koen Vervloesem

On May 6, NLUUG held its Spring Conference with the theme System Administration. There were a lot of talks about very specific tools or case studies, but one struck your author because it married conceptual simplicity with a useful goal: Minimizing service windows on servers using NanoBSD + ZFS + jails by Paul Schenkeveld. Over the last four years, Paul has searched for methods to upgrade applications on a server with minimal downtime. The system he implemented is in production now on various servers, which require only a few seconds downtime for an application upgrade and the same amount of time for a rollback if the upgrade fails.

Jail time

System administrators have to keep their servers up-to-date, but when they upgrade the operating system or applications, the server may be unavailable for the users for some time, varying from a few seconds to much longer. If the administrator is lucky, he can schedule the upgrade after office hours when nobody uses the server, but if that's not possible, there will be some noticeable downtime for the users. In the latter case, it's extremely important that the downtime remains minimal, or the system administrator will face some angry shouting.

And that's where Murphy comes around the corner: upgrades are notorious ways to introduce unexpected problems. If an upgrade breaks things, the system administrator has to roll back to the previous upgrade. But this introduces some additional downtime, which is often longer than the upgrade downtime because rollbacks are not as easy as upgrades.

To minimize the risk of upgrades, there has been a trend to use one (virtual) server per application. Doing this, a problematic upgrade will impact only one application at a time. As a FreeBSD user, Paul obviously chose to isolate applications in jails, a lightweight form of operating-system-level virtualization. Each jail can be upgraded separately, minimizing the risk that other applications on the same server have to be brought down to fix one application's failed upgrade.

ZFS snapshots to the rescue

The root filesystem of each jail in Paul's system is in fact a read-only snapshot of the filesystem of a template jail (Paul called it a "prototype jail"). This allows the system administrator to prepare upgrades offline: upgrade the operating system and applications (FreeBSD ports) in the template jail and create a new snapshot of its filesystem (with the zfs snapshot command). Upgrading a production jail then involves stopping the jail, changing its root filesystem to the new snapshot and then restarting it. This upgrade takes only a few seconds because it has been prepared offline. Using snapshots rather than just multiple root filesystem images has the advantage that it saves space because the filesystems of all the jails are mostly identical.

When unexpected problems crop up, rolling back is as easy as stopping the jail, changing its root filesystem to the previous snapshot and then restarting it. The system administrator can then investigate and fix the problem offline, without any unneeded downtime.

Obviously, a read-only filesystem is not enough: the applications that run inside the jail need to be able to write their data. Therefore, the directory tree for a jail is set up as a combination of the read-only snapshot of the template jail's root filesystem with some read-write filesystems for the applications, such as /var and /home.

ZFS, which is deemed production ready since FreeBSD 8, easily handles many filesystems, has a flexible quotas system, and offers fast snapshots (thanks to the copy-on-write nature of the filesystem). Therefore, it's a perfect match to create a robust way of upgrades and rollbacks, something that has also been used by Nexenta.

Inspired by embedded systems

Combining jails and ZFS is straightforward, but Paul went further and thought about the underlying operating system. He got inspired by the way embedded systems are operated, and that's why he looked at NanoBSD. This is a toolkit that comes with the FreeBSD base system and that creates a FreeBSD system image for embedded applications that is suitable for use on a Compact Flash card.

NanoBSD is an interesting server choice for a number of reasons. First, it has the same functionality as FreeBSD, unless specific features are explicitly removed from the NanoBSD image when it is created. Moreover, every application that exists as a FreeBSD port or package can be installed and used in NanoBSD. The main differences are that the complete operating system is built and upgraded offline and that the root filesystem is mounted read-only.

The drive where NanoBSD is installed is divided into three partitions by default: two image partitions and one configuration partition. All of them are mounted read-only. The /etc and /var directories are memory disks (i.e. RAM disks). After the system boot, the configuration partition is briefly mounted read-only under the /cfg directory and all files in it are copied to /etc. If the system administrator wants to make persistent changes to a file (say, /etc/resolv.conf), they have to mount the configuration partition under /cfg, copy the modified files from /etc to /cfg and unmount the configuration partition. The whole system is set up in this way to minimize the number of writes to the flash drive and prolong its life. And with this completely read-only system, there is no necessity to run fsck after a non-graceful shutdown, so the system reboots with minimal downtime.

The update process of NanoBSD is also well thought-out. While the running NanoBSD is installed on one of the image partitions, a newly built NanoBSD image is written to the other image partition. Then the system is rebooted and started from the newly installed partition. If anything goes wrong, the system can be rebooted back into the previous partition, which still contains the old, working image. The system administrator can then investigate and fix the problem offline.

Tying it all together

Combining these three technologies (NanoBSD, ZFS, and jails), Paul reached his goal of setting up a FreeBSD server that can be upgraded with minimal downtime. All user-visible applications run in jails. Underneath the jails, a minimal FreeBSD operating system runs, built using the NanoBSD script. This holds the kernel, some low-level services, and the tools for building a new system image for upgrading the operating system. The NanoBSD system image can be put on a partition of a regular disk drive, but Paul prefers to put it on a separate flash drive, because NanoBSD is specifically designed for it and using a separate drive for the operating system makes it easier for the system administrator when the hard drives with the jails fail.

System administration on this system is much the same as on a regular FreeBSD system, except for software maintenance. Of course the embedded roots of NanoBSD mean that the system administrator needs to be aware of the differences from a regular server operating system. The volatile nature of /etc is one example: it's easy to forget to copy all changed configuration files to /cfg to preserve the changes after a reboot.

The directory /var is also a memory disk, so by default NanoBSD doesn't keep log files, which is not helpful to the system administrator. One solution is to put /var on a hard disk instead of using a memory disk, but then the operating system depends on the hard disk, which Paul wanted to avoid. Therefore, he chose another solution: telling syslog to log to a syslog daemon on another host or to a syslog daemon inside a jail on the same system.

A general architecture

Each system administrator has their own way of configuring servers. In the beginning of his talk, Paul warned that he didn't mean to provide the best solution for every situation, he just wanted to describe the way that he builds and maintains FreeBSD servers. Although his talk was specifically about the combination of NanoBSD, ZFS, and jails, the architecture he described was general enough to be usable elsewhere. The same ideas can be implemented with other minimal operating systems, other filesystems with snapshot abilities, and other forms of operating-system-level virtualization.

While FreeBSD jails are probably the most well-known type of operating-system-level virtualization, other operating systems have it too. OpenSolaris for instance has Zones, which are even more flexible than FreeBSD jails. Linux has similar solutions, such as OpenVZ and LXC (Linux containers). In particular, OpenVZ (and its proprietary variant Virtuozzo Containers) is popular among providers of virtual private servers. So on the virtualization level, the same architecture that Paul uses is perfectly possible on Linux.

The second important component in Paul's scheme are the filesystem snapshots. Although ZFS is not available for Linux (at least not in kernel space), there are many other snapshot technologies. For example, LVM (Logical Volume Manager) has a snapshots facility, just as Btrfs (which we looked at a while back). So for example with OpenVZ and LVM, Linux should be perfectly capable of creating OpenVZ containers based on read-only snapshots of some template container. Proxmox already makes use of LVM snapshots to create a backup of a container without downtime.

The last step Paul took is maybe the most interesting one: in a time where operating systems, even for servers, get more and more bloated, it's refreshing to see him take some inspiration from the embedded world. While at first the limitations of an embedded operating system on read-only storage seem too cumbersome, it actually helps a lot in clearly separating the operating system from its configuration and applications, which can only be good. Moreover, the dual-image approach and fast reboots have the advantage of more robust and less intrusive upgrades to the host operating system.

Because Linux shines in the embedded world, it's not very far-fetched to try this approach in the Linux world too. When leaving Paul's talk, your author heard a couple of people in the audience thinking out loud that it would be interesting to have such a system using Linux. A quick search didn't turn up anything useful, but it's clear that it should be possible, most likely with LVM snapshots and OpenVZ containers.


(Log in to post comments)

NLUUG: Minimizing downtime on servers using NanoBSD, ZFS, and jails

Posted May 13, 2010 16:55 UTC (Thu) by bronson (subscriber, #4806) [Link]

Funny, I'm evaluating basically this setup using LXC and btrfs on 2.6.32. Verdict: not bad. There are some wrinkles, like LXC can't tell when guests have restarted or shut down, and sometimes it takes effort to figure out which guest is the one hammering system resources. I also expect that btrfs will get cranky once I go past 50% disk utilization but I haven't seen any problems there yet.

I'm I'm trying F12 (333MB), Debian Sid (211MB), and Ubuntu Lucid (152MB) as guests, all set up with debootstrap/febootstrap. They're all fairly bloated. I'd love something like NanoBSD but still feels like a regular Linux server.

Overall, it's been nice and stable. What to create a new guest? Just create a btrfs snapshot, copy a config file, and lxc-start it. Takes all of 10 seconds.

Not sure I'd recommend this for a production environment yet but I'm very happy with it so far. I'm excited to watch LXC continue to mature.

NLUUG: Minimizing downtime on servers using NanoBSD, ZFS, and jails

Posted May 14, 2010 19:28 UTC (Fri) by pspinler (subscriber, #2922) [Link]

Is the full presentation available anywhere online?

-- Pat

NLUUG: Minimizing downtime on servers using NanoBSD, ZFS, and jails

Posted May 17, 2010 9:00 UTC (Mon) by biged (subscriber, #50106) [Link]

> Is the full presentation available anywhere online?

I think this is it:

video: http://www.youtube.com/watch?v=LT85JTFJGxM

handout: http://www.psconsult.nl/talks/AsiaBSDcon2010-Servers/hand...

NLUUG: Minimizing downtime on servers using NanoBSD, ZFS, and jails

Posted May 17, 2010 16:56 UTC (Mon) by jeremiah (subscriber, #1221) [Link]

I think I'd be inclined to do it the following way to keep it a little more commodity distribution based. KVM and btrfs.
1) snapshot the running server.
2) copy snapshoted image and start it up as a new running instance
3) upgrade new instance
4) verify instance
5) restart original server point at new instance image.
5.a) maybe shut down original instance and rsync data directories over
5.b) maybe copy new image over onto original image.

I'm sure a decent portion of this could be scriptable as well. Anyway, just my two cents. This is similar to what I do now, except a number of things have to be shut down before the image can be made. btrfs and snapshots seem like it would be good to take a snapshot of an image while at rest to make sure there aren't corruption issues, without the need for all of that extra space.

NLUUG: Minimizing downtime on servers using NanoBSD, ZFS, and jails

Posted May 18, 2010 18:51 UTC (Tue) by bronson (subscriber, #4806) [Link]

Can multiple KVM instances run well sharing a single btrfs filesystem? That's what's so great about containers/jails/etc over full virtualization -- there are no problems sharing the same filesystem.

Maybe you meant snapshotting using LVM? I've done that before. It works fine, and you can boot different machines onto different snapshots of the same filesystem. For some reason though my attempts to use LVM usually end up in tears.

Or maybe you're prosposing doing everything offline and never booting more than one machine on the btrfs filesystem. That would work but I'm not sure what btrfs would buy you over using a more mature filesystem on LVM. Ultimately, I'm skeptical the minor benefits would be worth the downtime and effort it would take to do set that up!

NLUUG: Minimizing downtime on servers using NanoBSD, ZFS, and jails

Posted May 18, 2010 22:21 UTC (Tue) by jeremiah (subscriber, #1221) [Link]

I'm suggesting using multiple disk images sitting on a single btrfs system. That way an image can be copied w/o shutting down the instance, because we made a snapshot of the whole btrfs system before we started the instance, or instances The only reason I'm suggesting btrfs in place of lvm, is that I don't care for lvm, and I keep hearing people say it's the future of Linux filesystems.

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds