Distributing filesystem images and updates with casync

June 28, 2017

This article was contributed by Nur Hussein

Recently, Lennart Poettering announced a new tool called casync for efficiently distributing filesystem and disk images. Deployment of virtual machines or containers often requires such an image to be distributed for them. These images typically contain most or all of an entire operating system and its requisite data files; they can be quite large. The images also often need updates, which can take up considerable bandwidth depending on how efficient the update mechanism is. Poettering developed casync as an efficient tool for distributing such filesystem images, as well as for their updates.

Poettering found that none of the existing system image delivery mechanisms suited his requirements. He wanted to conserve bandwidth when updating the images, minimize disk space usage on the server and on clients, make downloads work well with content delivery networks (CDNs), and for the mechanism to be simple to use. Poettering considered Docker's layered tarball, OSTree's direct file delivery via HTTP with packed deltas for updates, and other systems that deliver entire filesystem images.

Docker's approach of "layers" of updates on top of an initial tarball required tracking revisions and history, which Poettering believes a deployment should not be burdened with. OSTree's method of serving individual files would be detrimental to the performance of content distribution networks if there were a plethora of small files, as synchronization will hammer the CDN with multiple HTTP GET requests. Delivering entire filesystem images repeatedly for every update would be an unacceptably high use of bandwidth and server disk space, even though the delivery would be simple to implement. In the end, Poettering decided that, while existing systems have their merits, he had to roll his own solution optimized for the use case of filesystem image delivery with frequent updates. Casync was inspired by rsync (which copies and syncs files based on deltas) and Git (which provides content-addressable storage based on hashing).

Casync can be used to distribute directory trees as well as raw disk images. When operating on directories, all data in the target directory is serialized into a stream of bytes, much like the tar utility does. Poettering created his own serialization as the output of tar varies from implementation to implementation; he required consistent output without being dependent on any particular flavor of tar.

When invoked on either a directory or a disk image, casync will create a repository of data that mirrors the original, but reorganized such that it is broken into data chunks of similar size, stored inside a directory called a "chunk store", together with an index file that stores the metadata for the repository. The directory and index file can both be served via a web server (or with another network file transfer protocol) to a client, which can use casync to reassemble the original data.

Chunking

Casync works by chunking the target data from a stream of bytes into into a set of variable-sized chunks, though the sizes do not vary by much. Chunking helps reduce the bandwidth consumed when a user synchronizes their repository, since the index file can be used to determine which chunks have changed or been added, and only those chunks need to be downloaded.

The chunking algorithm will create the same chunks for the same data, even at different offsets. To accomplish this, casync makes use of a cyclic polynomial hash (also known as Buzhash) to find the offsets for identical data. Buzhash is a hash~~-based search algorithm that can~~ that can be used to find patterns in a stream of bytes more efficiently than brute-force scanning.

The basic idea of Buzhash is that, given two strings of data where one may contain the other, it is possible to search for the target by looking at a few bytes (called the window) at every byte offset and hashing them (this is called a "rolling hash"). The resulting hash is compared against the hash of a search key of the same size as the window; a match provides a strong indicator that the rest of the string might also match the other data, and a full hash at the indicated offset can be executed to confirm it. An advantage of a rolling hash is that it can be performed quickly since the next hash can be generated from the previous one by a computationally cheap operation.

When chunking for the first time, the Buzhash algorithm is used with a 48-byte window across the stream, moving one byte at a time and calculating a hash. A chunk boundary is placed whenever the value of the calculated hash, h satisfies the following equation:

    h mod k == k - 1

The constant k is chosen to reflect the intended average chunk size. Assuming an even distribution of the hash h, the probability that the function h mod k will yield a specific value, in this case k-1, is 1/k. Therefore, the probability is such that for roughly every k bytes read, the equation will evaluate to true and a boundary is placed. To guarantee the chunks are neither too small or too large, there are hard limits on the chunk size enforced by the algorithm. The implementation is similar to what rsync does when chunking, except that rsync uses a smaller window, works with individual files rather than a serialized directory tree, and the hash algorithm used is the Adler-32 checksum.

Each time a chunk is created, its contents are hashed with SHA-256 to create a digest for the chunk, which is then recorded in an index together with the chunk size ~~and a filename~~. The chunks are kept in compressed form in the chunk store. If a chunk arriving in the store hashes to the same digest as an existing one in the index, the chunk need not be added. This gives the chunk store deduplication for data, which is particularly efficient for filesystem images that do not differ much between versions. The chunks can then be delivered over HTTP, along with the index, and they can be reassembled on the client side.

Serializing directories requires the preservation of metadata such as ownership and access control lists (ACLs). A user can specify what metadata to save in the chunk archive when running the tool. Casync will also store extended attributes, file capabilities, ACLs, Linux chattr file attributes, and FAT file attributes. Casync recognizes pseudo-filesystems such as /proc and sysfs; it will not include them when creating an archive. Additionally, if the underlying filesystem supports reflinks, which save space by sharing disk blocks for identical files (with copy-on-write semantics), then casync can take advantage of this; instead of creating identical files, it will reflink them instead. Casync supplies a FUSE facility for read-only mounting of filesystem or disk images directly from the HTTP source.

Trying it out

There are packages for casync created by third parties available for Ubuntu, Arch Linux, and Fedora. I tried it out by compiling it from the GitHub repository, which requires the Meson build system. The installed binaries let you create chunked repositories and reconstruct them, both locally and over HTTP, FTP, or SFTP. The README contains a list of commands you can run to try out the various features of casync.

Future work

Casync is not intended as a replacement for rsync or zsync, as it is more for filesystem delivery than fine-grained file-based backup. It also does not attempt to find the most optimal deduplication and smallest deltas, but has a "good enough" heuristic to save bandwidth and storage. It is a welcome addition in the space of filesystem delivery, where something like rsync would be useful, but the fine-grained, per-file granularity is not required.

Poettering has stated that he has "concrete plans" for adding encryption to casync, so that it could be used as a backup tool like restic, BorgBackup, or Tarsnap. He also intends to automate GPG validation of data, so that chunks can be signed and verified without user intervention. Casync does not expose an API for third party tools, although it is designed to be able to do so eventually. This will enable things such as GNOME's GVfs to access casync repositories, and make it modular enough so that components like the HTTP delivery mechanism can be replaced with customized implementations. Other plans are support for local network caches of chunks and automated home-directory synchronization.

Casync only works on Linux at the moment, but Poettering says he is open to accepting patches for portability that do not interfere with the fundamental operation of casync. Currently, casync is developed mainly by Poettering, with a few other contributors. The project is not yet completely stable, although it is usable and has many features implemented already. There may be changes to the file formats down the road, so any index or serialized files made with the current version might break in the future.

Conclusion

Casync is a new option to complement tools like rsync, which may prove useful to anyone who needs to distribute large filesystem images that also need to be regularly updated. The granularity of "chunks" that casync uses is reminiscent of BitTorrent, but the fact that it is network protocol independent should make the distribution of data friendlier to firewalls and content distribution networks. It should be a useful tool for cloud providers, software distributions, developers sharing customized virtual machine images, and anyone else who needs an efficient way of providing large and constantly updated bundles of data.

[I would like to thank Lennart Poettering for his help in clarifying some of the inner workings of casync.]

Index entries for this article
GuestArticles	Hussein, Nur

Distributing filesystem images and updates with casync

Posted Jun 29, 2017 8:22 UTC (Thu) by mezcalero (subscriber, #45103) [Link] (2 responses)

Quick correction: "…which is then recorded in an index together with the chunk size and a filename". That's not entirely correct. While the hash value and the chunk size are recorded, no filename is recorded. The idea of casync is after all to not care so much about file boundaries.

"Buzhash is a hash-based search algorithm". Well, buzhash is a hash function. People use it for different things, one prominent use-case being searching. But you can use it for other stuff as well, for example chunking. But saying that buzhash itself was a "search algorithm" isn't precisely right.

Thank you very much for putting together this article!

Lennart

Distributing filesystem images and updates with casync

Posted Jun 29, 2017 13:42 UTC (Thu) by nix (subscriber, #2304) [Link] (1 responses)

Yeah. The rolling hash mechanism is an ingenious way to convert a hash or checksum (like, say, BuzHash or Adler32) into a content-sensitive chunk-boundary detector. It amazes me that, with this algorithm out there for decades, people are *still* writing deduplicators and other chunkers that just split into N-byte-sized blocks; a rolling hash is almost always a better way. (With a modern memory hierarchy it's nearly free, too -- you need to incur delays to get the thing you're working over into cache no matter what, and compared to that the overhead of rolling-hashing vanishes in the memory latencies.)

Distributing filesystem images and updates with casync

Posted Jun 29, 2017 17:00 UTC (Thu) by andrewsh (subscriber, #71043) [Link]

I guess it’s just because it’s a concept that’s a bit difficult to wrap your head around, so people just think ‘scrap it, I’m just going to split the input into the equally-sized bits and be done with it’.

Distributing filesystem images and updates with casync

Posted Jun 29, 2017 14:05 UTC (Thu) by walters (subscriber, #7396) [Link] (2 responses)

For the OSTree case, the expectation is that CDNs serve deltas. That's existed for years now. Yes, you can hit the individual fetch path when doing downgrades and the like, but with HTTP2, the overhead of individual requests is a lot lower. Still though, I don't think it's fair to ding OSTree for the individual fetch problem when deltas exist.

Distributing filesystem images and updates with casync

Posted Jun 29, 2017 15:16 UTC (Thu) by mezcalero (subscriber, #45103) [Link] (1 responses)

Yeah. My blog story mentions the deltas actually.

Distributing filesystem images and updates with casync

Posted Jun 30, 2017 12:33 UTC (Fri) by walters (subscriber, #7396) [Link]

Yes, but you implied in your post that there was an issue, and the article author took that literally (it looks like without doing their own research, which is understandable).

Basically I think this sentence should just be deleted:

> OSTree's method of serving individual files would be detrimental to the performance of content distribution networks if there were a plethora of small files, as synchronization will hammer the CDN with multiple HTTP GET requests.

Now, a definite OSTree issue is *managing* deltas, but for the OS update case, it's really really easy to just run `ostree static-delta generate` before publishing. When you involve multiple branches, managing deltas between them gets a little harder but it's just a matter of scripting.

Distributing filesystem images and updates with casync

Posted Jun 29, 2017 15:18 UTC (Thu) by joib (subscriber, #8541) [Link]

For the "casync mount" usecase, a similar tool is CernVM-FS, http://cernvm.cern.ch/portal/filesystem . Though the use case is maybe slightly different in that cvmfs downloads stuff on demand and has a limited size local disk cache (LRU, I guess), rather than downloading the entire tree in the background like casync mount apparently does.

Merkle trees

Posted Jun 29, 2017 16:45 UTC (Thu) by tau (subscriber, #79651) [Link] (1 responses)

I was thinking about a similar problem recently. Why not use a Merkle tree store for disk images? Kind of like what dm-verity does, except use the Merkle tree for deduplication instead of integrity. After all, files (which may be present in many images within the store) within a filesystem tend to be aligned on block boundaries, and the leaves of the tree would successfully deduplicate no matter what order the files' underlying storage blocks are allocated in. The leaves within the store would also be aligned on disk block boundaries themselves, allowing for efficient read-only mounting of a virtual block device view in real time.

Was going to implement a prototype but I'm too busy these days.

Merkle trees

Posted Jul 10, 2017 22:46 UTC (Mon) by helsleym (guest, #92730) [Link]

Just be careful about your block size. If the data you need to dedup is offset by 1, 2, ... blocksize-1 bytes then you're out of luck. The larger the block size the less likely you will be able to deduplicate more -- I think this is a corollary to Zipf's law. Yet higher block sizes reduce overhead of storing the merkle tree. Could still be interesting and worthwhile though!