The Freedom of Fork
One of the important rights that Free Software gives you is the ability to take the source code of any software, modify it, and release it again under a compatible Free Software license. It is a very important freedom, as it allows not only users to customize the software they use to better suit their requirements, but also enables distributions to patch software to build in their environment. Environmental changes include new architectures and different versions of system tools and libraries. As with other important freedoms, this ability can prove to be a huge problem if not handled properly. There can be problems for the original author, the person doing the fork, and the users of the various versions of the software.
The story of Free Software is full of good examples of forks handled correctly, like the EGCS fork that transformed the GNU C Compiler into the GNU Compiler Collection (GCC), or more recently the replacement of Jörg Schilling's cdrtools with the cdrkit package that is now found in most distributions. Unfortunately, the list of bad examples is longer.
Historically, forking a project was a difficult task for most single developers: handling version control repositories (especially with CVS) was not something done easily. It limited the task of forking to experienced developers, who usually had enough common sense to know when forking was not an option.
Nowadays, forking is much easier, Subversion allows to developers to easily fetch the whole history of a project. Distributed version control systems (DVCS) like git, Mercurial, Bazaar-NG and others remove the need for a central repository, making forking and branching two very similar activities. Recently, the GitHub hosting site has made this action even more prominent by adding a "fork" button on the pages for the repository hosted on their servers, allowing anybody to create a new branch (or fork) of a project in a simple mouse click.
The Downsides of Forking
Forking is not always the best option. It should probably be considered the last resort. Forking divides efforts as the two projects often take slightly different turns. The result of the fork is that the two versions of the code diverge, even though they share the same interface and most of the background logic. This creates a series of problems, of a technical nature, that reflects on the non-technical attributes of a program.
A forked project reuses a big part of the code from the original project. This causes code duplication, with its usual problems, and one in particular: security risks. A forked project is usually vulnerable to the problems the original project had, unless that part of the code has been rewritten or modified with time. As the forks evolve, authors often miss the security issues fixed by their ancestor, making it harder for developers to track the issues down.
Another common problem is the division of users' contributions. Users usually just report issues to one project, the one they use. So either the developers of the two projects exchange information about the bugs they fix in the common code, or the problems will likely be ignored by one of the two projects, making the distance between the projects increase.
You can find this very problem with software like Ghostscript, the omnipresent PostScript processor, used to generate, view and convert PostScript files. Its development is currently divided into multiple forks which do not always give their code back to the originating project. You can find one version released under the AFPL (Aladdin Free Public License), one released under the GPL, a commercial/proprietary one, and one version that used to be developed by Easy Software Products, the authors of the CUPS printing system.
The reasons for the forks here were mostly related to licensing issues. And, in the case of ESP, to better support CUPS. In the end, the development of different bloodlines for the project caused, and still causes, problems for distribution maintainers. Distribution issues include keeping packages aligned, which means doubling the effort needed to fix the code if it breaks or if it doesn't follow policy.
Another case where dividing the development effort has caused problems is in the universe of Logitech mouse control software. The lmctl project was started as a tool to control some settings of Logitech devices, like resolution and cordless channels. The code has to know which devices have which settings available. To do this, it keeps a table of USB identifiers. As new devices started appearing on the market, and Linux users started using them and the table became outdated. Distributions patched this up, but in different ways, creating inconsistent tables. Some users started releasing their own modified version of lmctl with an extended table to support different devices.
While explicit forks of entire projects have problems, the fact that they delineate where they took the code from makes it easier to track down the source of bugs and handle security vulnerabilities. On the other hand, when a project borrows some code and imports it in its source distribution, this kind of tracking becomes more difficult. Free Software licenses explicitly allow, and push for, importing code between projects; cross-pollination also improves general code quality over time.
For most distributions, an internal imported copy of a library inside another project is also a violation of policy. For this reason the developers will most likely try to make the project use a shared, external copy of the code. This works fine when the other library is simply bundled together untouched, but it becomes a nuisance if there are subtle changes which might not be apparent at a first impression. One thing to take into account when you want to have an internal copy of a library is to consider it as an untouchable piece of code. instead of spending time fixing bugs inside that copy of the code, the developers should try to fix the bugs in the original sources, so that everybody (including themselves) can make use of the improvement.
In the real world, one example of this can be the FFmpeg source code. FFmpeg is imported by many different Free Software projects in the area of multimedia: xine, MPlayer, GStreamer. While it is a very wide common ground for all these projects, as well for some others that aren't importing a copy of it like VLC, some of the imports change the source code, in more or less subtle ways. In the case of xine, the whole build system is replaced to integrate it with the automake-based build system used by the rest of the library. Further patching is done to the sources themselves so that they behave in a slightly different way than the original. The code rots quickly and bugs that were already fixed in the in-development sources of FFmpeg still sprout in xine-lib.
Maintaining such an import is a difficult and boring task, to the point that the developers, in the past two years, have spent a lot of energy toward the goal of not using an internal copy of FFmpeg anymore. The result is that the difference between the original FFmpeg and the internal copy is quite smaller, mostly limited to the build system. Instead of advising against using an external copy of FFmpeg, it is advised not to use the internal one. For the next minor version of xine-lib, FFmpeg is being used pristine, entirely unpatched, and it will probably not even be bundled with the library in the next future.
Successful Forks
Of course it's not all bad. There are successful forks in Free Software, and many of them are now more famous than their parents. I've already named the GNU Compiler Collection, which is the GCC that almost all Free Software users have at hand at the moment. Most people use GCC version 3 and later, which started as a fork of the other GCC (the GNU C Compiler), version 2. The original development of GCC was, like many other GNU projects, very closed to the community.
As Eric S. Raymond defined it in his book The Cathedral and the Bazaar, it was a Cathedral-style development that often prepares the ground for forks, and this was no exception. Multiple forks of the GCC code were created. Their goals, while different, often didn't clash, but could have easily been worked on at the same time. Some of the forks were then merged into the EGCS project, which eventually replaced the original GCC.
Again citing GNU's Cathedral-style of development, it's difficult not to talk about GNU Emacs and its brother XEmacs. Created originally to support one particular product, the XEmacs project is nowadays a mostly standalone project. XEmacs is kept at an arm's length from GNU Emacs, mostly because of licensing and copyright assignment issues. Neither version can be considered a superset of the other because they both implement features in their own way.
Better is the state of Claws Mail, started as a different branch of Sylpheed, with the name Sylpheed Claws. Originally the intention was to develop new features that could one day find their way back to the original code. Claws Mail has since declared itself independent and is now a stand-alone project. In this case, the exchange of code between the two projects has basically halted, as the code bases have diverged so much that they retain very little in common.
In the case of the Ultima Online server emulators, forks became daily events, and cross-pollination had grown to the point where at least five projects were linked by family ties. The UOX3 source code has been forked, reused, imported and cut down so that it is present in WolfPack, LoneWolf, NoX-Wizard and Hypnos. Almost all of the UOX3 forks involved re-writing parts of the code, as it had stratified to the point of not being maintainable. The forks continued copying one from the other to make use of the best features available.
Forking vs. Branching
There are a few good reasons why you might want to detach, temporarily, from a given development track. Development of experimental features, new interfaces, backend rewrites or resurrection of a project whose original authors are unavailable. In most of these cases, forking is not the best solution but branching most likely is. Although the border between these two actions started slimming down thanks to distributed VCS, branching usually doesn't involve setting up a new web page for the project, changing its name or finding a new goal. And a branch is usually related, tightly or not, to the original project. Merges between the two code bases often happen at more or less regular intervals, and ideas and bug reports are shared.
Branches usually have the target of being merged in the main development track, sooner for small, testing branches, or later for huge rewrites. They don't usually require dividing of the efforts as the problems affecting the main branch get their fixes propagated to the other branches when they merge back the original code.
One common problem with developing through branches involved bad support in the Subversion version control system. In Subversion the branches are represented as a different path in the repository, with almost no help for branches in the merge operations. With a modern distributed VCS, branches are so cheap that any checkout is, from some points of view, a different branch, and the merge operations are one of the main focuses. Projects like the Linux kernel or xine-lib rely heavily on an above-average number of branches. These are often short-lived and used for testing purposes.
Looking to the Future
Forks will never end in Free Software as they are supported by one of the freedoms that make Free Software what we all want it to be. The future will, of course, bring new forks. Recently there has been a lot of talk about Funpidgin, a fork of the widespread Pidgin Instant Messaging client (formerly Gaim). Again it seems like it was the Cathedral-style development of the original code that motivated a fork that could give (some of) the users what they wanted.
And even though GNU Emacs opened its process quite a lot, its forks haven't stopped sprouting. This is despite the fact that Richard Stallman, original author and mastermind behind the GNU project, stepped down as maintainer, putting in place Stefan Monnier and Chong Yidong. The Aquamacs Emacs is still diverging from the original GNU Emacs for supporting Apple's Mac OS X, while different versions are being developed to support the multiple user interfaces one can use on that operating system. Similarly, although the Windows port of Emacs is already pretty solid, there are extensions being written to make it easier for users to adapt it to the Microsoft environment.
Forks are usually the effect of a closed-circle development, a Cathedral, where some of the developers or users can't see their objective being fulfilled, will all their energy being poured in. So just look for the projects that don't seem to be getting much love from a community, and you might find a fork starting to make its first leaves.
Then there is the Poppler project, which merged together the modified versions of the XPDF code imported by projects like GNOME and KDE for their PDF viewers. Poppler is soon going to be a nearly omnipresent PDF viewer on Free Software desktops and beyond. This summer's milestone KDE 4.1 release will include the release of the new oKular document viewer, oKular will use Poppler for PDF rendering on the (stable) KDE users' desktops.
Conclusions
I'd suggest that anybody thinking about creating a fork should think twice. Forking is rarely a good choice, better choices can be branching, or if you need just part of a code, working together like Poppler developers did to separate the code to share the common parts.
When you want to make some changes to a software project, propose branching it, show the results to the original developers and discuss with them on how to improve the code. Most of the times you'll find authors are open to the changes.
A fork is a grave matter. It might bring innovation to the Free Software community, but it could also separate developers that could otherwise work together, maybe in a better way. In this light, GitHub's one click forking capability seems like a dangerous feature.
The ever-increasing ease of forking everything, from small projects to part of, or even entire distributions (think about Debian's repositories and Gentoo's overlays) is increasing the fragmentation of Free Software projects. Biodiversity in software can be a very good thing, just like in nature, but people should first try their best to work together, rather than one against the other.
