May 14, 2008
This article was contributed by Diego Pettenò
One of the important rights that Free Software gives you is the ability
to take the source code of any software, modify it, and release it again
under a compatible Free Software license.
It is a very important freedom, as it allows not only users to
customize the software they use to better suit their requirements,
but also enables distributions to patch software to build in their
environment. Environmental changes include new architectures and
different versions of system tools and libraries.
As with other important freedoms, this ability can prove to be a huge
problem if not handled properly. There can be problems for
the original author, the person doing the fork, and the users of the
various versions of the software.
The story of Free Software is full of good examples of forks handled correctly,
like the EGCS
fork that transformed the GNU C Compiler into the
GNU Compiler Collection (GCC),
or more recently the replacement of
Jörg Schilling's
cdrtools
with the cdrkit
package that is now found in most distributions.
Unfortunately, the list of bad examples is longer.
Historically, forking a project was a difficult task for most single developers: handling version control repositories (especially with CVS)
was not something done easily. It limited the task of forking to
experienced developers, who usually had enough common sense to know
when forking was not an option.
Nowadays, forking is much easier,
Subversion
allows to developers to easily fetch the whole history of a project.
Distributed version control systems (DVCS) like git, Mercurial,
Bazaar-NG and others remove the need for a central repository, making
forking and branching two very similar activities.
Recently, the GitHub hosting site
has made this action even more prominent by adding a "fork" button on the
pages for the repository hosted on their servers, allowing anybody to
create a new branch (or fork) of a project in a simple mouse click.
The Downsides of Forking
Forking is not always the best option. It should probably be considered
the last resort. Forking divides efforts
as the two projects often take slightly different turns.
The result of the fork is that the two versions of the code diverge, even
though they share the same interface and most
of the background logic.
This creates a series of problems, of a technical nature, that reflects
on the non-technical attributes of a program.
A forked project reuses a big part of the code from the original
project. This causes code duplication, with its usual problems, and one
in particular: security risks. A forked project is usually
vulnerable to the problems the original project had, unless that part
of the code has been rewritten or modified with time.
As the forks evolve, authors often miss the security issues fixed by
their ancestor, making it harder for developers to track the issues down.
Another common problem is the division of users' contributions.
Users usually just report issues to one project, the one they use.
So either the developers of the two projects exchange information about
the bugs they fix in the common code, or the problems will likely be
ignored by one of the two projects, making the distance between the
projects increase.
You can find this very problem with software like
Ghostscript, the
omnipresent PostScript processor, used to generate, view and convert
PostScript files. Its development is currently divided into multiple
forks which do not always give their code back to the originating
project.
You can find one version released under the AFPL (Aladdin Free Public
License), one released under the GPL, a commercial/proprietary one,
and one version that used to be developed by Easy Software
Products, the authors of the CUPS printing system.
The reasons for the forks here were mostly related to licensing issues.
And, in the case of ESP, to better support CUPS.
In the end, the development of different bloodlines for the project
caused, and still causes, problems for distribution maintainers.
Distribution issues include keeping packages aligned, which means
doubling the effort needed to fix the code if it breaks or if it
doesn't follow policy.
Another case where dividing the development effort has caused problems
is in the universe of Logitech mouse control software.
The
lmctl
project was started as a tool to control some
settings of Logitech devices, like resolution and cordless channels.
The code has to know which devices have which settings available.
To do this, it keeps a table of USB identifiers. As new devices started
appearing on the market, and Linux users started using them and the table
became outdated.
Distributions patched this up, but in different ways, creating
inconsistent tables. Some users started releasing their own modified
version of lmctl with an extended table to support different devices.
While explicit forks of entire projects have problems, the fact that
they delineate where they took the code from makes it easier to track
down the source of bugs and handle security vulnerabilities. On the
other hand, when a project borrows some code and imports it in its
source distribution, this kind of tracking becomes more difficult.
Free Software licenses explicitly allow, and push for, importing code
between projects; cross-pollination also improves general code quality
over time.
For most distributions, an internal imported copy of a library inside
another project is also a violation of policy. For this reason the
developers will most likely try to make the project use a shared, external
copy of the code.
This works fine when the other library is simply bundled together
untouched, but it becomes a nuisance if there are subtle changes
which might not be apparent at a first impression.
One thing to take into account when you want to have an internal copy of
a library is to consider it as an untouchable piece of code.
instead of spending time fixing bugs inside that copy of the code, the
developers should try to fix the bugs in the original sources, so that
everybody (including themselves) can make use of the improvement.
In the real world, one example of this can be the
FFmpeg source code.
FFmpeg is imported by many different Free Software projects in the area
of multimedia: xine, MPlayer, GStreamer. While it is a very wide common ground for all these projects, as well for some others that aren't
importing a copy of it like VLC, some of the imports change the source
code, in more or less subtle ways. In the case of xine, the whole build
system is replaced to integrate it with the automake-based build system
used by the rest of the library. Further patching is done to the
sources themselves so that they behave in a slightly different way than
the original. The code rots quickly and bugs that were already
fixed in the in-development sources of FFmpeg still sprout in xine-lib.
Maintaining such an import is a difficult and boring task, to the point
that the developers, in the past two years, have spent a lot of energy
toward the goal of not using an internal copy of FFmpeg anymore.
The result is that the difference between the original FFmpeg and the
internal copy is quite smaller, mostly limited to the build system.
Instead of advising against using an external copy of FFmpeg, it is
advised not to use the internal one. For the next minor version of
xine-lib, FFmpeg is being used pristine, entirely unpatched, and it will
probably not even be bundled with the library in the next future.
Successful Forks
Of course it's not all bad. There are successful forks in Free Software,
and many of them are now more famous than their parents. I've already named
the GNU Compiler Collection, which is the GCC that almost all Free Software
users have at hand at the moment. Most people use GCC version 3
and later, which started as a fork of the other GCC (the GNU C Compiler), version 2. The original development of GCC was, like many other GNU projects, very closed to the community.
As Eric S. Raymond defined it in his book The Cathedral and the
Bazaar, it was a Cathedral-style development that often prepares the ground for forks, and this was no exception. Multiple forks of the GCC
code were created. Their goals, while different, often didn't clash, but could have easily been worked on at the same time. Some of the forks were
then merged into the EGCS project, which eventually replaced the original
GCC.
Again citing GNU's Cathedral-style of development, it's difficult not to
talk about GNU Emacs
and its brother XEmacs.
Created originally to
support one particular product, the XEmacs project is nowadays a mostly
standalone project. XEmacs is kept at an arm's length from GNU Emacs,
mostly because of licensing and copyright assignment issues.
Neither version can be considered a superset of the other because they
both implement features in their own way.
Better is the state of
Claws Mail,
started as a different
branch of Sylpheed,
with the name Sylpheed Claws. Originally the intention was
to develop new features that could one day find their way back to the
original code. Claws Mail has since declared itself independent and
is now a stand-alone project. In this case, the exchange of code between the two projects has basically halted, as the code bases have diverged so
much that they retain very little in common.
In the case of the Ultima Online server emulators, forks became daily
events, and cross-pollination had grown to the point where at least five
projects were linked by family ties.
The UOX3 source code has been
forked, reused, imported and cut down so that it is present in WolfPack,
LoneWolf, NoX-Wizard and Hypnos.
Almost all of the UOX3 forks involved re-writing parts of the code,
as it had stratified to the point of not being maintainable.
The forks continued copying one from the other to make use of the best
features available.
Forking vs. Branching
There are a few good reasons why you might want to detach, temporarily,
from a given development track. Development of experimental features, new
interfaces, backend rewrites or resurrection of a project whose original authors are unavailable.
In most of these cases, forking is not the best solution but
branching most likely is. Although the border between these two
actions started slimming down thanks to distributed VCS, branching
usually doesn't involve setting up a new web page for the project,
changing its name or finding a new goal. And a branch is usually
related, tightly or not, to the original project. Merges between
the two code bases often happen at more or less regular intervals,
and ideas and bug reports are shared.
Branches usually have the target of being merged in the main
development track, sooner for small, testing branches, or later for huge
rewrites. They don't usually require dividing of the efforts as the
problems affecting the main branch get their fixes propagated to the
other branches when they merge back the original code.
One common problem with developing through branches involved bad support
in the Subversion version control system. In Subversion the branches are
represented as a different path in the repository, with almost no help
for branches in the merge operations.
With a modern distributed VCS, branches are so cheap that
any checkout is, from some points of view, a different branch, and the
merge operations are one of the main focuses.
Projects like the Linux kernel or xine-lib rely heavily on an
above-average number of branches. These are often short-lived and
used for testing purposes.
Looking to the Future
Forks will never end in Free Software as they are supported by one of
the freedoms that make Free Software what we all want it to be.
The future will, of course, bring new forks.
Recently there has been a lot of talk about
Funpidgin,
a fork of the widespread Pidgin Instant Messaging client (formerly Gaim).
Again it seems like it was the Cathedral-style development of the original
code that motivated a fork that could give (some of) the users what
they wanted.
And even though GNU Emacs opened its process quite a lot, its forks
haven't stopped sprouting. This is despite the fact that
Richard Stallman, original author and mastermind behind the GNU project,
stepped down as maintainer, putting in place Stefan Monnier and Chong Yidong.
The Aquamacs Emacs is still diverging from the original GNU
Emacs for supporting Apple's Mac OS X, while different versions
are being developed to support the multiple user interfaces one can use
on that operating system. Similarly, although the Windows port of Emacs
is already pretty solid, there are extensions being written to make it
easier for users to adapt it to the Microsoft environment.
Forks are usually the effect of a closed-circle development, a Cathedral,
where some of the developers or users can't see their objective being
fulfilled, will all their energy being poured in. So just look for the
projects that don't seem to be getting much love from a community, and
you might find a fork starting to make its first leaves.
Then there is the
Poppler project,
which merged together the modified versions of the XPDF code imported by
projects like GNOME and KDE for their PDF viewers.
Poppler is soon going to be a nearly omnipresent PDF viewer on Free
Software desktops and beyond.
This summer's milestone KDE 4.1 release will include the release of
the new oKular document viewer, oKular will use Poppler for PDF rendering
on the (stable) KDE users' desktops.
Conclusions
I'd suggest that anybody thinking about creating a fork should think
twice. Forking is rarely a good choice, better choices can be
branching, or if you need just part of a code, working together like
Poppler developers did to separate the code to share the common parts.
When you want to make some changes to a software project, propose
branching it, show the results to the original developers and discuss
with them on how to improve the code. Most of the times you'll find
authors are open to the changes.
A fork is a grave matter. It might bring innovation to the Free Software
community, but it could also separate developers that could otherwise
work together, maybe in a better way. In this light, GitHub's one click
forking capability seems like a dangerous feature.
The ever-increasing ease of forking everything, from small projects
to part of, or even entire distributions (think about Debian's
repositories and Gentoo's overlays) is increasing the fragmentation of
Free Software projects. Biodiversity in software can be a very good
thing, just like in nature, but people should first try their best to
work together, rather than one against the other.
(
Log in to post comments)