March 2, 2005
This article was contributed by Frank Pohlmann
There was a time when there were only a few open source version control systems:
CVS and
RCS
were the most prominent examples and there was little else.
Since the late 1990s a huge number of Source Code Management
(SCM) systems have come into
existence.
GNU Arch,
Subversion and
Monotone
are some of the more prominent projects, but there seems to be no
consensus as to what constitutes a good approach to Source Code
Management. As a result, open source SCMs fill a huge number of
niches, although - as
Larry McVoy
has pointed out a while ago - except for systems that scale well for hundreds of users, there is little money to be made from consultancy or support. Famously, Linus Torvalds uses Larry's commercial package
BitKeeper.
Architecture and Features
GNU Arch is a distributed version control management system, i.e. it allows
the "cloning" of a tree containing the source or binary files stored at a
local or remote repository. The word "directory" is used advisedly here,
since Arch creates new repositories and archives by creating new
directories inside ftp, sftp or WebDav servers. There is no underlying
database or special file format underlying GNU Arch; as the documentation
points out, "remote archives do not require an Arch specific server." GNU
Arch setup is therefore remarkably simple.
Tom Lord designed and
wrote GNU Arch.
In keeping with the fractious history of open source SCM tools, GNU Arch
spawned its own secessionist project named
ArX, which was written in C++
and is being led by
Walter Landry.
Tom Lord started the GNU Arch project as a shell script collection to avoid
having to use CVS; CVS uses a client-server model and does not support
certain types of merge operations, among other things.
Since each branch has its own version of the source tree, and all commands work across local and remote version of the source tree, it is perfectly possible for someone with read access to a remote source branch to merge the changes committed by a different user at the remote branch with her own source tree: no centralized server is necessary.
Commits are always accomplished atomically on source trees; the changesets in Arch handle a huge variety of data, for instance symbolic link additions, directory changes, and very importantly, renames.
Revisions are always uniquely and globally identifiable. It is perfectly possible to remove and add the same changes to permit experimentation with the code. The merging process will forgive such cruelty, recording the change history and even making the subsets of changes viewable by other developers.
Atomic commits make it possible for changes to propagate to all repositories. If the commiter is working from an http repository, the remote user can only accept changes. The commiter cannot write the
changes to the remote repository. If all users of GNU Arch use
ftp, sftp or WebDav, the commiter can work from whatever repository he chooses, since he is likely to have cloned the master repository. Once he is finished working, he can propagate the changes to the master repository, or he can just make them available to all members of the project.
It helps that GNU Arch is built on standard Unix utilities, since the files Arch is working with essentially consist of a number of
tar files saved in a Unix directory tree with a few control files thrown in
for good measure. All commits and imports just send compressed tar files to
the remote repository. This, as Tom Lord elaborates on in some depth, could
lead to performance problems. GNU Arch is trying to transfer the
performance load mostly onto client side machines and it is also taking
advantage of the fact that disk space is a lot cheaper
(in terms of cost and performance) than bandwidth.
In short, there are several mechanisms to cope with this problem:
one is cached revisions. The user is able to choose a reasonably
spaced interval at which a cached revision is going to be stored
in the master or local repository.
This avoids the problem of sucking down dozens of change sets during
a major update, and having to live with the concomitant strong network bandwidth burden. After comparing the size of the compressed source tree revision and the number and size of changesets, a caching policy can be chosen by the user. This is not always considered an advantage by some users, and high-traffic developmental sites might find this feature problematic.
Another policy consists in using so-called read-only archive mirrors.
It is perfectly possible to store revisions and changesets at special
archive mirror locations. This can lessen the load on the master
repository, and simplify the work for a developer who is making
all and sundry changes.
A final - and completely client-side - feature of GNU arch configuration is
called a revision library. Again, by using local disk
space, pre-built copies of read-only source tree revisions are stored locally, but files that have been left unmodified during changes are shared between revisions. It uses some file-linking magic that makes new changesets that are not shared with previous source incarnations private to the newly patched tree.
Other features make GNU arch truly shine, in particular in with regard to merging, although it has to be said that low-level work with GNU Arch
can be demanding. It has an extremely complex command set, allowing a
level of control and granularity that is unusual, even for source code
management professionals.
It is not easy to compare GNU Arch to other OSS version control management systems, unless one is willing to
compare it to other distributed architectures.
Neither CVS nor Subversion fall into that category.
For anyone migrating from CVS or Subversion, it is possible to
feel at home, since the base command sets are similar.
It is useful to budget some time for the migration, since
GNU Arch documentation is not entirely comprehensive.
But in all, it is a very fast, very powerful version control management system perfectly suited to the distributed world of open source development.
(
Log in to post comments)