|
|
Subscribe / Log in / New account

Git + FUSE + Python = GitFS

By Jake Edge
August 12, 2015

EuroPython

Typically, filesystems are developed in C, often as part of a kernel like Linux. But the folks at Presslabs took an unconventional approach by developing a filesystem based on Git as a user-space filesystem written in Python. Vlad Temian and Manu Danci, who both work for the company, came to EuroPython 2015 in Bilbao, Spain to talk about that filesystem, which is called GitFS.

Presslabs is a Romanian WordPress hosting site with as many as 2.2 billion monthly page views. One of the problems its customers have encountered, though, is conflicts between site publishers (i.e. site owners) and their developers. Publishers will try to change the code that developers have written, but many don't have the knowledge required, so the site becomes broken. Nobody knows who did what to the code base, which results in a "big pile of chaos", Danci said.

So Presslabs came up with the idea of a self-versioning filesystem that uses Git behind the curtains. But in order to be user-friendly, GitFS turns the Git version tree into something more accessible using the timestamp of each commit. The mount point has two directories: current, which contains the latest snapshot, and history, which contains directories of other snapshots based on the date and time of the commit:

    mnt/history/
	2015-08-09/
	    14-32-12-eb18db93bc9/
	    14-37-39-e256395058f/
	    16-03-23-25b39c058fe/
	2015-08-10/
	    11-17-48-ef1fdf9fbff/
	...
Users can read and write to files in the current directory, but the history directories are all read-only. The idea is to "humanize" the contents of the .git directory.

[Manu Danci]

Mounting the filesystems is done with the gitfs command, as Temian and Danci demonstrated live. The command requires the remote repository and a mount point; there are also options that can be specified for timeouts and other parameters. When someone makes a change to the repository (e.g. a developer commits changes), GitFS updates the current directory and adds a new entry into history in the mounted filesystem. Similarly, when someone makes a change in current, a commit is generated after a specified timeout occurs (to try to package up multiple changes into a single commit).

The filesystem is written entirely in Python and has been released under the Apache 2.0 license. There is more that could be done with GitFS, Danci said. He suggested that anyone interested contribute to the project so that "maybe we can grow it further".

As part of the research that eventually led to GitFS, the team came up with some requirements. There were two main hurdles to overcome. The first was how to handle Git objects efficiently, both in terms of speed and memory usage. The other was to be able to implement the filesystem operations in an efficient manner.

Python-based solutions were found for parts of the problem. GitFS uses pygit2 for its Git bindings. Pygit2 is written in C, so it uses Git's C API and doesn't waste time using the shell to access Git commands. GitFS also uses fusepy for access to the Filesystem in Userspace (FUSE) API.

[Vlad Temian]

Temian then took over to describe some of the architecture for GitFS. The main concept in the filesystem is that of a "view". There are several types of view, depending on what kind of object is being accessed. For example, an open() call goes to a router that chooses between the view types: CurrentView, HistoryView, CommitView, or IndexView. There are also some higher-level view classes, including PassthroughView (used for current) and ReadOnlyView (which is the parent of the history views).

Eventually there will be conflicts that need to be resolved between different versions of the files. GitFS has a adopted a simple strategy called "always accept mine" that is safest for Presslabs's customers. It effectively takes the local changes and applies them on top of any changes to the remote repository. That strategy is not implemented in pygit2, so it was added as a plugin. Others can write their own strategy to be used with the filesystem.

One of the first problems encountered with GitFS was its performance on real-world repositories. As a test, the WordPress repository with some 17,000 commits was used, but it took 34 minutes to do a directory listing on history. Some profiling led to the addition of three levels of caching. After that, the directory listing completed in around three seconds.

If you unzip 1000 files, you don't want to do 1000 commits for them, Temian said, so a smarter mechanism for upstream synchronization is needed. That led to some additional components in GitFS: FUSE threads, a commit queue, sync worker, and fetch worker. The FUSE threads are not directly under the control of GitFS — FUSE determines how many there are, for example. Those threads add commit jobs to a commit queue when they have completed writing a file.

That commit queue is consumed by the "sync worker" process, which batches up the changes based on the sync timeout that has been specified for the filesystem. When it is making a commit, it locks out any other writes to the filesystem by notifying the FUSE threads, waits for completion of the merge and push, then unlocks the filesystem. There is also the "fetch worker" that periodically (also dependent on a timeout) checks for changes on the remote repository. It is also told to pause during sync operations.

Danci then took back the microphone for some concluding remarks. First off, there is an Ubuntu personal package archive for GitFS that is maintained by Presslabs, as well as community-maintained versions for Fedora and Arch Linux. There is also a version available for Mac OS X.

There are a number of things that attendees should take away from the talk, Danci said. You can build a production filesystem in Python. Presslabs did it and has been using it in production for almost a year now. People suggested doing it in C, but the team preferred Python and it is working well.

Building a FUSE-based filesystem is straightforward, he said, but getting it right can be challenging. The current model was not the first the team came up with. It took a number of refinements to get to where it is today. His final point was that by combining appropriate technology, you can "put powerful tools in the hands of non-technical users". The main purpose of GitFS is to make people's lives easier by putting Git in the hands of regular users.


Index entries for this article
ConferenceEuroPython/2015


to post comments

Git + FUSE + Python = GitFS

Posted Aug 16, 2015 9:02 UTC (Sun) by kugel (subscriber, #70540) [Link] (2 responses)

Unfortunately I failed to install the Arch packages. It seems to have dependencies that don't readily exist in Arch (or at a too old version).

Git + FUSE + Python = GitFS

Posted Aug 18, 2015 15:37 UTC (Tue) by mikedep333 (subscriber, #91993) [Link] (1 responses)

As a user of ClearCase (which provides MVFS), I welcome this. One of MVFS's issues is that it does not use FUSE, but instead an out-of-tree kernel module.

Git + FUSE + Python = GitFS

Posted Aug 18, 2015 15:39 UTC (Tue) by mikedep333 (subscriber, #91993) [Link]

My bad, this should have been a separate comment.


Copyright © 2015, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds