By Jake Edge
December 8, 2010
As its introduction says, git-annex
sounds like something of a paradox. It uses Git to manage files that are larger
than Git can easily handle—without checking them into the
repository. But git-annex provides ways to track those files using much of
the same infrastructure as Git, so that moving or deleting
those files can all be tracked in much the same way as committed files.
In addition, git-annex allows for branches and distributed clones of its
trees.
Developer Joey Hess lists
two use cases for git-annex that will appeal to folks who juggle
many large files on multiple storage devices, frequently move between
different locations
and computers, or some combination thereof. Because git-annex tracks the
locations of the actual data files, which may not be locally present, it
can act like a hierarchical storage manager. The filenames will be
present in the repository, but their content may need to be fetched from
elsewhere or from a
currently offline disk. git-annex will fetch the data if it can find it in
an online repository or ask
that a particular repository be made available.
In addition, git-annex ensures that there is at least one copy—though
it can be configured to keep more than one—of a file's
contents
available before dropping the file from a local repository. That way, the
user can drop a large file (or files) from their laptop, say, while knowing
that the contents are still available on some other repository that
git-annex was able to contact. For "The Archivist", which is one of Hess's
use cases, that is essential, so that they can reorganize their files at
will, while knowing that they can't be accidentally deleted.
But those same attributes are useful to "The Nomad" (Hess's other use
case):
When she has 1 bar on her cell, Alice queues up interesting files on her
server for later. At a coffee shop, she has git-annex download them to her
USB drive. High in the sky or in a remote cabin, she catches up on
podcasts, videos, and games, first letting git-annex copy them from her USB
drive to the netbook (this saves battery power).
When she's done, she tells git-annex which to keep and which to
remove. They're all removed from her netbook to save space, and Alice
[knows] that next time she syncs up to the net, her changes will be synced
back to her server.
It does all this via a git-annex binary that is built from Haskell
sources. That allows git-annex to integrate with Git, so using it is as
simple as "git annex ...". Unlike many free software
utilities,
git-annex also comes with fairly extensive documentation, including a man page and a walk-through. As
might be expected, the code is available via a Git
repository—though Debian unstable users can apt-get install
it.
When files are added to git-annex, their content is moved to a
.git/annex/objects directory and a symbolic link is created
using the original filename and pointing to the content. Those symbolic
links are handled by Git directly, while git-annex arranges for
the content to be present as requested. Creating a repository is pretty
straightforward:
$ mkdir ~/annextst
$ cd ~/annextst
$ git init
$ git annex init "desktop repo"
The "
git annex" command gives the annex a name that can be
used to
identify the repository later on. One then adds files to the repository in
a fairly obvious way:
$ cp /tmp/big_file .
$ git annex add .
add big_file ok
$ git commit -a -m "added big_file"
The last command may seem a bit surprising, but Git is what will track the
symbolic
link(s) that the
git annex add created. As the
walk-through
shows,
that Git repository can be cloned elsewhere (on another
machine or a removable USB device for example) and then each of those
repositories can be added as remote repositories (i.e.
git remote)
of each other. The only additional step for turning it into a git-annex
repository is to do:
$ git annex init "some other repo"
in the cloned directory.
Getting file content is as simple as doing:
$ git annex get some_file
while removing files is done with:
$ git annex drop some_file
though that may fail if git-annex cannot find another copy in the
repositories it can currently contact (which can, of course, be
overridden). Syncing between repositories is done with the usual
"
git pull" command. Another nice feature of git-annex is that
it works seamlessly with
files that are already present in the git repository, so handling a
combination of giant and normal-sized files is easy.
There are several types of storage back-ends that
git-annex can use to
store the key-value pairs that relate the filename to its contents. The
default is WORM (write once, read many), which is also the least expensive
because it assumes that file contents do not change once they have been stored.
The SHA1 backend stores the file content object based on its SHA1 hash,
which can be an expensive operation on very large files, but will track
changes to the contents. There is also a
URL backend that fetches the content from an external URL (as the name
implies).
This only scratches the surface of git-annex and what it can do, so
interested readers should take a wander through the documentation that Hess
provides. In the announcement of git-annex, Hess also points to two other
projects that he calls "software tools that use git in ways that were
never intended". The first is mr, which treats a set of
repositories in various repository formats (svn, git, cvs, hg, bzr, ...) as
if they were one combined repository. The other, etckeeper, hooks into
package managers like apt and yum to commit changes to files in
/etc when they are changed by a package update. One of the
advantages of free software is that it allows folks to do things that were
unanticipated by the original developer; it would certainly seem that Hess
has done just that.
Comments (1 posted)
Brief items
Also, anytime you are creating a new commit with the same changes
as another commit, you are destroying `git blame`'s ability to tell
you who to flog publicly. And as we all know, public floggings are
the lifeblood of software development teams.
--
Paul
Stadig
Many of the economic arguments in favor of releasing code as open
source, and dedicating a significant fraction of an engineer's time
to serve as a OSS project maintainer or kernel subsystem
maintainer, are ones that make much more sense at a very large
company like Google or IBM. That's not because startups are evil,
or deficient in any way; just the economic realities that at a
successful startup, everything has to be subordinated to the
central goal of proving that they have a sustainable, scalable
business model and that they have a good product/market fit.
Everything else, and that includes participating in an open source
community, is very likely a distraction from that central goal.
--
Ted
Ts'o
The results over the last year have been really amazing. Between
the two of us Andrew [Bartlett] and I have pushed over 2500 patches to the
Samba master repository over a year of pair programming, which is
more than twice what we managed in the previous year. I find it
really interesting that despite only one of us typing at a time, we
get much more done with pair programming than when we work
separately. The results are even more notable when you take into
account that in the last year Andrew has been rebuilding his house
and looking after a new baby!
I think the reason it works so well is that it tends to minimise
procrastination. When I code alone and I'm stuck on a bit of code,
I often find myself drifting off to read slashdot or muck about
with some new application that I've found. That happens a lot less
when someone else is watching over your shoulder on VNC. We discuss
how we're going to solve the problem and then we solve it, without
the hours of procrastination in between.
--
Andrew Tridgell
Comments (none posted)
The GRUB project has announced its decision to add ZFS support to the GRUB
bootloader, despite the facts that (1) Oracle has not assigned
copyright to the FSF, and (2) ZFS is not thought to carry a
GPL-compatible license. "
The ZFS code that has been imported into
GRUB derives from the OpenSolaris version of GRUB Legacy. On one hand,
this code was released to the public under the terms of the GNU GPL. On
the other, binary releases of Solaris included this modified GRUB, and as a
result Oracle/Sun is bound by the GPL." (Thanks to Luis Rodriguez)
Full Story (comments: 9)
KDE SC 4.6 Beta 2 has been released. "
KDE SC 4.6 Beta2 is targeted at testers
and those that would like to have an early look at what's coming to their
desktops and netbooks this summer. KDE is now firmly in beta mode, meaning
that the primary focus is on fixing bugs and preparing the stable release of
the software compilation this summer. Since the release of the first
beta two weeks
ago 1318 bugs have been reported and 1176 bugs have been closed."
Full Story (comments: none)
KDE.News carries
the
news that KOffice has been rebranded as "the Calligra Suite" and given
a wider focus. "
The Calligra Suite introduces the Calligra Office
Engine which makes it easy for developers to create new user experiences,
target new platforms and create specialized versions for new kinds of
users. Currently, there are two main user experiences: the desktop UI with
the applications mentioned above, and FreOffice which is the only free
mobile office suite in existence."
Comments (29 posted)
Psycopg, which is a PostgreSQL adapter for Python, has released version 2.3.1. It is simply a fix for a CentOS build bug in the unannounced 2.3.0 version. Major new features in 2.3.0 are:
- dict to hstore adapter and hstore to dict typecaster, using both 9.0
and pre-9.0 syntax.
- Two-phase commit protocol support as per DBAPI specification.
- Support for payload in notifications received from the backed.
- namedtuple-returning cursor.
- Query execution cancel.
Full Story (comments: none)
PublicSQL, which is new way to handle SQL queries from within web applications by storing the data in tables in JavaScript code, has been announced. The tables are generated from the query and will be loaded automatically into the web page. This allows for web applications that don't require a database server, but can still provide SQL services.
Full Story (comments: none)
Newsletters and articles
Comments (none posted)
It would seem that there is a brewing conflict between the development community for
Hudson, which is an open source continuous integration server, and Oracle, who own the trademark to the name, over where the code and development infrastructure will be hosted. Over at the Hudson Labs blog, R. Tyler Croy
lays out a timeline of the disagreement, along with some of his opinion of what's going on. "
The fundamental issue here is that the developers want to make a change in how they contribute to Hudson, and have made their voices heard to that end. From the users' perspective, such a change would have literally zero impact on them, which makes Oracle's conflation of the two sides of Hudson particularly frustrating." (Thanks to Croy and Christof Damian for bringing it to our attention).
Comments (18 posted)
Henrik Ingo has posted
the
results of a study on project governance concluding that the key factor
distinguishing large and successful projects is the existence of a
nonprofit governing foundation. "
There appears to be a glass ceiling
for single vendor projects prohibiting their growth from the Large category
upwards. To truly reach their fullest potential, open source projects are
recommended to consider the proven governance model of a non-profit
foundation around which participants collaborate."
Comments (25 posted)
Page editor: Jonathan Corbet
Next page: Announcements>>