Etcd and fleet
Etcd
"Etcd" stands for "/etc distributed"; it is meant to be a highly reliable configuration mechanism that provides a uniform view across a cluster of machines. It offers "sequential consistency," meaning that changes are visible in the same order on all machines — though not necessarily at the same time. Unlike competing projects (like Zookeeper), Brandon said, etcd allows for runtime reconfigurability.
Clusters built around etcd are highly interconnected (but not necessarily completely so). There is one system that is elected to be the "leader"; all others are "followers." At its core, etcd is a simple key/value data store. Along with the usual store, fetch, and delete operations, etcd provides an atomic compare-and-swap operation; there is a compare-and-delete operation as well. The etcdctrl command-line tool can be used to make changes to the etcd database or to watch for changes made by others.
The sequential consistency feature ensures that all etcd changes appear in
the same order on all machines in the cluster. Changes may not happen at
the same time; some machines may see a specific change before others do.
So there may be periods of time where a pair of machines may disagree on
the value of a specific parameter stored in etcd. Changes have sequence
numbers attached to them; those numbers can be used to wait until a given
change has been distributed throughout the cluster. Sequence numbers are
unique and are not reused; getting a value at a specific sequence number
will always return the same result, anywhere in the cluster.
Etcd supports the notion of a "quorum get," which is guaranteed to return the latest version of a given parameter. Quorum gets will be a bit slower, though, since they must be run via the cluster leader. It is also possible to wait for a change in a given parameter; the HTTP "long poll" mechanism is used to implement this operation.
With regard to availability, etcd will, on a cluster with 2f+1 systems, continue working in the presence of f failures. Once that point has been exceeded, changes to the database will no longer be accepted. The loss of the leader system will cause the service to be briefly unavailable until the remaining systems can run an election and continue under the new leader. The recovery time, Brandon said, is about ten times the round-trip time between systems in the cluster.
Beyond service configuration, there are a number of practical applications for etcd's functionality. The "locksmith" system is intended to bring order to system updates. When an update needs to be rolled out, it would be nice to avoid rebooting all of the systems in the cluster at once. Locksmith uses etcd to implement a cluster-wide reboot lock. Each system, when it gets ready to reboot, will decrement a semaphore stored in etcd. Once it has successfully rebooted, it increments the semaphore, letting the next system in line perform its update.
Kubernetes is a cluster management system from Google that was built on top of etcd. The CoreOS "fleet" utility (about which more will be said shortly) performs a similar task: it schedules tasks on machines in the cluster. The work description for a given task is written into an etcd key; agents on the other machines then pick up work assignments from etcd and use it to report their results.
At the core of etcd is a consensus algorithm called
raft. Consensus algorithms
have been around for some time; the most popular among them is a system
called "Paxos."
But Paxos is complex and difficult to understand, Brandon said, so they
came up with adopted raft, which he described as "an
engineer's approach to Paxos."
Raft was originally developed
by Diego Ongaro and John Ousterhout; it is, Brandon said, far more
understandable and easy to work with than Paxos.
There were a number of mistakes made while developing etcd. The first of those was log files, which, he said, are hard to do right. If you are not careful, filesystems will corrupt or truncate data. There need to be checksums within a logfile to ensure that it is sane. Brandon expressed a wish that the kernel developers would simply document the best way to safely write log files.
Another problem had to do with naming in etcd; they trusted their users to come up with proper, unique names. "Never trust users," Brandon said. The result was a lot of misconfigured systems and, presumably, a lot of phone calls for the CoreOS support team. Now naming is handled by generating a unique ID number when a cluster member starts up.
Plans for the future, and the upcoming 1.0 release in particular, include the addition of nonblocking snapshots. An MVCC data store is on the list. There is a general effort to improve the scalability of read and write operations. There will be a read-only proxy system that can handle all get requests locally; that should help users avoid the use of expensive watch operations. Finally, there will be a "fast promotion" mechanism that allows a hot-standby node to join a cluster quickly if need be.
Fleet
Etcd is good for passing information to tasks on a clustered system, but what about starting, stopping, and managing those tasks in general? That is the responsibility of the fleet tool. Fleet, too, was developed in house, but the developers did not have to start from scratch; instead, fleet is based on systemd, which, Brandon said, gives them "a lot of good stuff." Systemd handles resource limits, control groups, and the secure computing (seccomp) subsystem; it can also take care of process monitoring and the notification of dependent processes. It is, he said, the first time they have had an init system driven by an API that gives them control over the system as a whole.
Fleet is a cluster scheduler, meaning that its job is to distribute tasks across the machines in a cluster. It needs to respond to events like a machine going down and reschedule tasks as needed. The fleet scheduler gets its marching orders (the "manifest") via etcd, then gets systemd to do the real work. It is thus not surprising that fleet's commands look a lot like systemd commands. One uses "fleetctl start" to start a task in the cluster, for example; it will cause the named service file to be scheduled on some remote machine.
Fleet can handle a number of requirements attached to the tasks it runs. Some tasks, for example, need to be run together on the same machine; others need to run on a specific system within the cluster. Information about such requirements goes into the systemd unit files, using the special "X sections" that are ignored by systemd itself.
Fleet works well, but, naturally, there is a list of desired enhancements. At the top of the list is an official stable API release so other programs can know how to talk to fleet. A signed schedule mechanism will add security to the system. And, while the service discovery mechanism built into the system now works, a better one is planned for the future.
To summarize, etcd and fleet form a part of the core of CoreOS. Between
them, they allow the specification and coordination of tasks run across a
cluster of systems. While these tools were developed for use within
CoreOS, it would not be surprising to see them expand to uses in other
settings as well.
| Index entries for this article | |
|---|---|
| Conference | LinuxCon Europe/2014 |
