On May 6, NLUUG held its
Spring Conference with the theme System
Administration. There were a lot of talks about very specific tools or
case studies, but one struck your author because it married conceptual
simplicity with a useful goal: Minimizing
service windows on servers using NanoBSD + ZFS + jails by Paul
Schenkeveld. Over the last four years, Paul has searched for methods to
upgrade applications on a server with minimal downtime. The system he
implemented is in production now on various servers, which require only a few seconds downtime for an application upgrade and the same amount of time for a rollback if the upgrade fails.
System administrators have to keep their servers up-to-date, but when they upgrade the operating system or applications, the server may be unavailable for the users for some time, varying from a few seconds to much longer. If the administrator is lucky, he can schedule the upgrade after office hours when nobody uses the server, but if that's not possible, there will be some noticeable downtime for the users. In the latter case, it's extremely important that the downtime remains minimal, or the system administrator will face some angry shouting.
And that's where Murphy comes around the corner: upgrades are notorious
ways to introduce unexpected problems. If an upgrade breaks things, the
system administrator has to roll back to the previous upgrade. But this
introduces some additional downtime, which is often longer than the upgrade downtime because rollbacks are not as easy as upgrades.
To minimize the risk of upgrades, there has been a trend to use one (virtual) server per application. Doing this, a problematic upgrade will impact only one application at a time. As a FreeBSD user, Paul obviously chose to isolate applications in jails, a lightweight form of operating-system-level virtualization. Each jail can be upgraded separately, minimizing the risk that other applications on the same server have to be brought down to fix one application's failed upgrade.
ZFS snapshots to the rescue
The root filesystem of each jail in Paul's system is in fact a read-only snapshot of the filesystem of a template jail (Paul called it a "prototype jail"). This allows the system administrator to prepare upgrades offline: upgrade the operating system and applications (FreeBSD ports) in the template jail and create a new snapshot of its filesystem (with the zfs snapshot command). Upgrading a production jail then involves stopping the jail, changing its root filesystem to the new snapshot and then restarting it. This upgrade takes only a few seconds because it has been prepared offline. Using snapshots rather than just multiple root filesystem images has the advantage that it saves space because the filesystems of all the jails are mostly identical.
When unexpected problems crop up, rolling back is as easy as stopping the jail, changing its root filesystem to the previous snapshot and then restarting it. The system administrator can then investigate and fix the problem offline, without any unneeded downtime.
Obviously, a read-only filesystem is not enough: the applications that run inside the jail need to be able to write their data. Therefore, the directory tree for a jail is set up as a combination of the read-only snapshot of the template jail's root filesystem with some read-write filesystems for the applications, such as /var and /home.
ZFS, which is deemed production
ready since FreeBSD 8,
easily handles many filesystems, has a flexible quotas system, and offers
fast snapshots (thanks to the copy-on-write nature of the
filesystem). Therefore, it's a perfect match to create a robust way of
upgrades and rollbacks, something that has also been used by Nexenta.
Inspired by embedded systems
Combining jails and ZFS is straightforward, but Paul went further and
thought about the underlying operating system. He got inspired by the way
embedded systems are operated, and that's why he looked at NanoBSD. This
is a toolkit that comes with the FreeBSD base system and that creates a
FreeBSD system image for embedded applications that is suitable for use on a Compact Flash card.
NanoBSD is an interesting server choice for a number of reasons. First, it has the same functionality as FreeBSD, unless specific features are explicitly removed from the NanoBSD image when it is created. Moreover, every application that exists as a FreeBSD port or package can be installed and used in NanoBSD. The main differences are that the complete operating system is built and upgraded offline and that the root filesystem is mounted read-only.
The drive where NanoBSD is installed is divided into three partitions by
default: two image partitions and one configuration partition. All of them
are mounted read-only. The /etc and /var directories are
disks (i.e. RAM disks). After the system boot, the configuration partition is briefly mounted read-only under the /cfg directory and all files in it are copied to /etc. If the system administrator wants to make persistent changes to a file (say, /etc/resolv.conf), they have to mount the configuration partition under /cfg, copy the modified files from /etc to /cfg and unmount the configuration partition. The whole system is set up in this way to minimize the number of writes to the flash drive and prolong its life. And with this completely read-only system, there is no necessity to run fsck after a non-graceful shutdown, so the system reboots with minimal downtime.
The update process of NanoBSD is also well thought-out. While the running NanoBSD is installed on one of the image partitions, a newly built NanoBSD image is written to the other image partition. Then the system is rebooted and started from the newly installed partition. If anything goes wrong, the system can be rebooted back into the previous partition, which still contains the old, working image. The system administrator can then investigate and fix the problem offline.
Tying it all together
Combining these three technologies (NanoBSD, ZFS, and jails), Paul reached his goal of setting up a FreeBSD server that can be upgraded with minimal downtime. All user-visible applications run in jails. Underneath the jails, a minimal FreeBSD operating system runs, built using the NanoBSD script. This holds the kernel, some low-level services, and the tools for building a new system image for upgrading the operating system. The NanoBSD system image can be put on a partition of a regular disk drive, but Paul prefers to put it on a separate flash drive, because NanoBSD is specifically designed for it and using a separate drive for the operating system makes it easier for the system administrator when the hard drives with the jails fail.
System administration on this system is much the same as on a regular FreeBSD system, except for software maintenance. Of course the embedded roots of NanoBSD mean that the system administrator needs to be aware of the differences from a regular server operating system. The volatile nature of /etc is one example: it's easy to forget to copy all changed configuration files to /cfg to preserve the changes after a reboot.
The directory /var is also a memory disk, so by default NanoBSD
doesn't keep log files, which is not helpful to the system
administrator. One solution is to put /var on a hard disk instead
of using a memory disk, but then the operating system depends on the hard
disk, which Paul wanted to avoid. Therefore, he chose another solution:
telling syslog to log to a syslog daemon on another host or to a
syslog daemon inside a jail on the same system.
A general architecture
Each system administrator has their own way of configuring servers. In
the beginning of his talk, Paul warned that he didn't mean to provide the
best solution for every situation, he just wanted to describe the way that
he builds and maintains FreeBSD servers. Although his talk was specifically
about the combination of NanoBSD, ZFS, and jails, the architecture he
described was general enough to be usable elsewhere. The same ideas can be implemented with other minimal operating systems, other filesystems with snapshot abilities, and other forms of operating-system-level virtualization.
While FreeBSD jails are probably the most well-known type of
operating-system-level virtualization, other operating systems have it
too. OpenSolaris for instance has Zones,
which are even more flexible than FreeBSD jails. Linux has similar
solutions, such as OpenVZ and LXC (Linux containers). In
particular, OpenVZ (and its proprietary variant Virtuozzo Containers) is popular among providers of virtual private servers. So on the virtualization level, the same architecture that Paul uses is perfectly possible on Linux.
The second important component in Paul's scheme are the filesystem
snapshots. Although ZFS is not available for Linux (at least not in kernel
space), there are many other snapshot technologies. For example, LVM (Logical Volume Manager) has a
snapshots facility, just as Btrfs (which we looked at a while back). So for example with OpenVZ and LVM, Linux should be perfectly capable of creating OpenVZ containers based on read-only snapshots of some template container. Proxmox already makes use of LVM snapshots to create a backup of a container without downtime.
The last step Paul took is maybe the most interesting one: in a time
where operating systems, even for servers, get more and more bloated, it's refreshing to see him take some inspiration from the embedded world. While at first the limitations of an embedded operating system on read-only storage seem too cumbersome, it actually helps a lot in clearly separating the operating system from its configuration and applications, which can only be good. Moreover, the dual-image approach and fast reboots have the advantage of more robust and less intrusive upgrades to the host operating system.
Because Linux shines in the embedded world, it's not very far-fetched to try this approach in the Linux world too. When leaving Paul's talk, your author heard a couple of people in the audience thinking out loud that it would be interesting to have such a system using Linux. A quick search didn't turn up anything useful, but it's clear that it should be possible, most likely with LVM snapshots and OpenVZ containers.
to post comments)