Distributing filesystem images and updates with casync
Recently, Lennart Poettering announced a new tool called casync for efficiently distributing filesystem and disk images. Deployment of virtual machines or containers often requires such an image to be distributed for them. These images typically contain most or all of an entire operating system and its requisite data files; they can be quite large. The images also often need updates, which can take up considerable bandwidth depending on how efficient the update mechanism is. Poettering developed casync as an efficient tool for distributing such filesystem images, as well as for their updates.
Poettering found that none of the existing system image delivery mechanisms suited his requirements. He wanted to conserve bandwidth when updating the images, minimize disk space usage on the server and on clients, make downloads work well with content delivery networks (CDNs), and for the mechanism to be simple to use. Poettering considered Docker's layered tarball, OSTree's direct file delivery via HTTP with packed deltas for updates, and other systems that deliver entire filesystem images.
Docker's approach of "layers" of updates on top of an initial tarball required tracking revisions and history, which Poettering believes a deployment should not be burdened with. OSTree's method of serving individual files would be detrimental to the performance of content distribution networks if there were a plethora of small files, as synchronization will hammer the CDN with multiple HTTP GET requests. Delivering entire filesystem images repeatedly for every update would be an unacceptably high use of bandwidth and server disk space, even though the delivery would be simple to implement. In the end, Poettering decided that, while existing systems have their merits, he had to roll his own solution optimized for the use case of filesystem image delivery with frequent updates. Casync was inspired by rsync (which copies and syncs files based on deltas) and Git (which provides content-addressable storage based on hashing).
Casync can be used to distribute directory trees as well as raw disk images. When operating on directories, all data in the target directory is serialized into a stream of bytes, much like the tar utility does. Poettering created his own serialization as the output of tar varies from implementation to implementation; he required consistent output without being dependent on any particular flavor of tar.
When invoked on either a directory or a disk image, casync will create a repository of data that mirrors the original, but reorganized such that it is broken into data chunks of similar size, stored inside a directory called a "chunk store", together with an index file that stores the metadata for the repository. The directory and index file can both be served via a web server (or with another network file transfer protocol) to a client, which can use casync to reassemble the original data.
Chunking
Casync works by chunking the target data from a stream of bytes into into a set of variable-sized chunks, though the sizes do not vary by much. Chunking helps reduce the bandwidth consumed when a user synchronizes their repository, since the index file can be used to determine which chunks have changed or been added, and only those chunks need to be downloaded.
The chunking algorithm will create the same chunks for the same data,
even at different offsets. To accomplish this, casync makes use of a cyclic
polynomial
hash (also known as Buzhash) to find the offsets for identical
data. Buzhash is a hash-based search algorithm that can that can be used to find patterns in a
stream of bytes more efficiently than brute-force scanning.
The basic idea of Buzhash is that, given two strings of data where one may contain the other, it is possible to search for the target by looking at a few bytes (called the window) at every byte offset and hashing them (this is called a "rolling hash"). The resulting hash is compared against the hash of a search key of the same size as the window; a match provides a strong indicator that the rest of the string might also match the other data, and a full hash at the indicated offset can be executed to confirm it. An advantage of a rolling hash is that it can be performed quickly since the next hash can be generated from the previous one by a computationally cheap operation.
When chunking for the first time, the Buzhash algorithm is used with a 48-byte window across the stream, moving one byte at a time and calculating a hash. A chunk boundary is placed whenever the value of the calculated hash, h satisfies the following equation:
h mod k == k - 1
The constant k is chosen to reflect the intended average chunk size. Assuming an even distribution of the hash h, the probability that the function h mod k will yield a specific value, in this case k-1, is 1/k. Therefore, the probability is such that for roughly every k bytes read, the equation will evaluate to true and a boundary is placed. To guarantee the chunks are neither too small or too large, there are hard limits on the chunk size enforced by the algorithm. The implementation is similar to what rsync does when chunking, except that rsync uses a smaller window, works with individual files rather than a serialized directory tree, and the hash algorithm used is the Adler-32 checksum.
Each time a chunk is created, its contents are hashed with SHA-256 to create a digest
for the chunk, which is then recorded in an index together with the chunk
size and a
filename. The chunks are kept in compressed form in the chunk store. If a
chunk arriving in the store hashes to the same digest as an existing one in
the index, the chunk need not be added. This gives the chunk store
deduplication for data, which is particularly efficient for filesystem
images that do not differ much between versions. The chunks can then be
delivered over HTTP, along with the index, and they can be reassembled on
the client side.
Serializing directories requires the preservation of metadata such as ownership and access control lists (ACLs). A user can specify what metadata to save in the chunk archive when running the tool. Casync will also store extended attributes, file capabilities, ACLs, Linux chattr file attributes, and FAT file attributes. Casync recognizes pseudo-filesystems such as /proc and sysfs; it will not include them when creating an archive. Additionally, if the underlying filesystem supports reflinks, which save space by sharing disk blocks for identical files (with copy-on-write semantics), then casync can take advantage of this; instead of creating identical files, it will reflink them instead. Casync supplies a FUSE facility for read-only mounting of filesystem or disk images directly from the HTTP source.
Trying it out
There are packages for casync created by third parties available for Ubuntu, Arch Linux, and Fedora. I tried it out by compiling it from the GitHub repository, which requires the Meson build system. The installed binaries let you create chunked repositories and reconstruct them, both locally and over HTTP, FTP, or SFTP. The README contains a list of commands you can run to try out the various features of casync.
Future work
Casync is not intended as a replacement for rsync or zsync, as it is more for filesystem delivery than fine-grained file-based backup. It also does not attempt to find the most optimal deduplication and smallest deltas, but has a "good enough" heuristic to save bandwidth and storage. It is a welcome addition in the space of filesystem delivery, where something like rsync would be useful, but the fine-grained, per-file granularity is not required.
Poettering has stated that he has "concrete plans" for adding encryption to casync, so that it could be used as a backup tool like restic, BorgBackup, or Tarsnap. He also intends to automate GPG validation of data, so that chunks can be signed and verified without user intervention. Casync does not expose an API for third party tools, although it is designed to be able to do so eventually. This will enable things such as GNOME's GVfs to access casync repositories, and make it modular enough so that components like the HTTP delivery mechanism can be replaced with customized implementations. Other plans are support for local network caches of chunks and automated home-directory synchronization.
Casync only works on Linux at the moment, but Poettering says he is open to accepting patches for portability that do not interfere with the fundamental operation of casync. Currently, casync is developed mainly by Poettering, with a few other contributors. The project is not yet completely stable, although it is usable and has many features implemented already. There may be changes to the file formats down the road, so any index or serialized files made with the current version might break in the future.
Conclusion
Casync is a new option to complement tools like rsync, which may prove useful to anyone who needs to distribute large filesystem images that also need to be regularly updated. The granularity of "chunks" that casync uses is reminiscent of BitTorrent, but the fact that it is network protocol independent should make the distribution of data friendlier to firewalls and content distribution networks. It should be a useful tool for cloud providers, software distributions, developers sharing customized virtual machine images, and anyone else who needs an efficient way of providing large and constantly updated bundles of data.
[I would like to thank Lennart Poettering for his help in clarifying some of the inner workings of casync.]
Index entries for this article | |
---|---|
GuestArticles | Hussein, Nur |
Posted Jun 29, 2017 8:22 UTC (Thu)
by mezcalero (subscriber, #45103)
[Link] (2 responses)
"Buzhash is a hash-based search algorithm". Well, buzhash is a hash function. People use it for different things, one prominent use-case being searching. But you can use it for other stuff as well, for example chunking. But saying that buzhash itself was a "search algorithm" isn't precisely right.
Thank you very much for putting together this article!
Lennart
Posted Jun 29, 2017 13:42 UTC (Thu)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Jun 29, 2017 17:00 UTC (Thu)
by andrewsh (subscriber, #71043)
[Link]
Posted Jun 29, 2017 14:05 UTC (Thu)
by walters (subscriber, #7396)
[Link] (2 responses)
Posted Jun 29, 2017 15:16 UTC (Thu)
by mezcalero (subscriber, #45103)
[Link] (1 responses)
Posted Jun 30, 2017 12:33 UTC (Fri)
by walters (subscriber, #7396)
[Link]
Basically I think this sentence should just be deleted:
> OSTree's method of serving individual files would be detrimental to the performance of content distribution networks if there were a plethora of small files, as synchronization will hammer the CDN with multiple HTTP GET requests.
Now, a definite OSTree issue is *managing* deltas, but for the OS update case, it's really really easy to just run `ostree static-delta generate` before publishing. When you involve multiple branches, managing deltas between them gets a little harder but it's just a matter of scripting.
Posted Jun 29, 2017 15:18 UTC (Thu)
by joib (subscriber, #8541)
[Link]
Posted Jun 29, 2017 16:45 UTC (Thu)
by tau (subscriber, #79651)
[Link] (1 responses)
Was going to implement a prototype but I'm too busy these days.
Posted Jul 10, 2017 22:46 UTC (Mon)
by helsleym (guest, #92730)
[Link]
Distributing filesystem images and updates with casync
Distributing filesystem images and updates with casync
Distributing filesystem images and updates with casync
Distributing filesystem images and updates with casync
Distributing filesystem images and updates with casync
Distributing filesystem images and updates with casync
Distributing filesystem images and updates with casync
Merkle trees
Merkle trees