|
|
Log in / Subscribe / Register

The Freedom of Fork

May 14, 2008

This article was contributed by Diego Pettenò

One of the important rights that Free Software gives you is the ability to take the source code of any software, modify it, and release it again under a compatible Free Software license. It is a very important freedom, as it allows not only users to customize the software they use to better suit their requirements, but also enables distributions to patch software to build in their environment. Environmental changes include new architectures and different versions of system tools and libraries. As with other important freedoms, this ability can prove to be a huge problem if not handled properly. There can be problems for the original author, the person doing the fork, and the users of the various versions of the software.

The story of Free Software is full of good examples of forks handled correctly, like the EGCS fork that transformed the GNU C Compiler into the GNU Compiler Collection (GCC), or more recently the replacement of Jörg Schilling's cdrtools with the cdrkit package that is now found in most distributions. Unfortunately, the list of bad examples is longer.

Historically, forking a project was a difficult task for most single developers: handling version control repositories (especially with CVS) was not something done easily. It limited the task of forking to experienced developers, who usually had enough common sense to know when forking was not an option.

Nowadays, forking is much easier, Subversion allows to developers to easily fetch the whole history of a project. Distributed version control systems (DVCS) like git, Mercurial, Bazaar-NG and others remove the need for a central repository, making forking and branching two very similar activities. Recently, the GitHub hosting site has made this action even more prominent by adding a "fork" button on the pages for the repository hosted on their servers, allowing anybody to create a new branch (or fork) of a project in a simple mouse click.

The Downsides of Forking

Forking is not always the best option. It should probably be considered the last resort. Forking divides efforts as the two projects often take slightly different turns. The result of the fork is that the two versions of the code diverge, even though they share the same interface and most of the background logic. This creates a series of problems, of a technical nature, that reflects on the non-technical attributes of a program.

A forked project reuses a big part of the code from the original project. This causes code duplication, with its usual problems, and one in particular: security risks. A forked project is usually vulnerable to the problems the original project had, unless that part of the code has been rewritten or modified with time. As the forks evolve, authors often miss the security issues fixed by their ancestor, making it harder for developers to track the issues down.

Another common problem is the division of users' contributions. Users usually just report issues to one project, the one they use. So either the developers of the two projects exchange information about the bugs they fix in the common code, or the problems will likely be ignored by one of the two projects, making the distance between the projects increase.

You can find this very problem with software like Ghostscript, the omnipresent PostScript processor, used to generate, view and convert PostScript files. Its development is currently divided into multiple forks which do not always give their code back to the originating project. You can find one version released under the AFPL (Aladdin Free Public License), one released under the GPL, a commercial/proprietary one, and one version that used to be developed by Easy Software Products, the authors of the CUPS printing system.

The reasons for the forks here were mostly related to licensing issues. And, in the case of ESP, to better support CUPS. In the end, the development of different bloodlines for the project caused, and still causes, problems for distribution maintainers. Distribution issues include keeping packages aligned, which means doubling the effort needed to fix the code if it breaks or if it doesn't follow policy.

Another case where dividing the development effort has caused problems is in the universe of Logitech mouse control software. The lmctl project was started as a tool to control some settings of Logitech devices, like resolution and cordless channels. The code has to know which devices have which settings available. To do this, it keeps a table of USB identifiers. As new devices started appearing on the market, and Linux users started using them and the table became outdated. Distributions patched this up, but in different ways, creating inconsistent tables. Some users started releasing their own modified version of lmctl with an extended table to support different devices.

While explicit forks of entire projects have problems, the fact that they delineate where they took the code from makes it easier to track down the source of bugs and handle security vulnerabilities. On the other hand, when a project borrows some code and imports it in its source distribution, this kind of tracking becomes more difficult. Free Software licenses explicitly allow, and push for, importing code between projects; cross-pollination also improves general code quality over time.

For most distributions, an internal imported copy of a library inside another project is also a violation of policy. For this reason the developers will most likely try to make the project use a shared, external copy of the code. This works fine when the other library is simply bundled together untouched, but it becomes a nuisance if there are subtle changes which might not be apparent at a first impression. One thing to take into account when you want to have an internal copy of a library is to consider it as an untouchable piece of code. instead of spending time fixing bugs inside that copy of the code, the developers should try to fix the bugs in the original sources, so that everybody (including themselves) can make use of the improvement.

In the real world, one example of this can be the FFmpeg source code. FFmpeg is imported by many different Free Software projects in the area of multimedia: xine, MPlayer, GStreamer. While it is a very wide common ground for all these projects, as well for some others that aren't importing a copy of it like VLC, some of the imports change the source code, in more or less subtle ways. In the case of xine, the whole build system is replaced to integrate it with the automake-based build system used by the rest of the library. Further patching is done to the sources themselves so that they behave in a slightly different way than the original. The code rots quickly and bugs that were already fixed in the in-development sources of FFmpeg still sprout in xine-lib.

Maintaining such an import is a difficult and boring task, to the point that the developers, in the past two years, have spent a lot of energy toward the goal of not using an internal copy of FFmpeg anymore. The result is that the difference between the original FFmpeg and the internal copy is quite smaller, mostly limited to the build system. Instead of advising against using an external copy of FFmpeg, it is advised not to use the internal one. For the next minor version of xine-lib, FFmpeg is being used pristine, entirely unpatched, and it will probably not even be bundled with the library in the next future.

Successful Forks

Of course it's not all bad. There are successful forks in Free Software, and many of them are now more famous than their parents. I've already named the GNU Compiler Collection, which is the GCC that almost all Free Software users have at hand at the moment. Most people use GCC version 3 and later, which started as a fork of the other GCC (the GNU C Compiler), version 2. The original development of GCC was, like many other GNU projects, very closed to the community.

As Eric S. Raymond defined it in his book The Cathedral and the Bazaar, it was a Cathedral-style development that often prepares the ground for forks, and this was no exception. Multiple forks of the GCC code were created. Their goals, while different, often didn't clash, but could have easily been worked on at the same time. Some of the forks were then merged into the EGCS project, which eventually replaced the original GCC.

Again citing GNU's Cathedral-style of development, it's difficult not to talk about GNU Emacs and its brother XEmacs. Created originally to support one particular product, the XEmacs project is nowadays a mostly standalone project. XEmacs is kept at an arm's length from GNU Emacs, mostly because of licensing and copyright assignment issues. Neither version can be considered a superset of the other because they both implement features in their own way.

Better is the state of Claws Mail, started as a different branch of Sylpheed, with the name Sylpheed Claws. Originally the intention was to develop new features that could one day find their way back to the original code. Claws Mail has since declared itself independent and is now a stand-alone project. In this case, the exchange of code between the two projects has basically halted, as the code bases have diverged so much that they retain very little in common.

In the case of the Ultima Online server emulators, forks became daily events, and cross-pollination had grown to the point where at least five projects were linked by family ties. The UOX3 source code has been forked, reused, imported and cut down so that it is present in WolfPack, LoneWolf, NoX-Wizard and Hypnos. Almost all of the UOX3 forks involved re-writing parts of the code, as it had stratified to the point of not being maintainable. The forks continued copying one from the other to make use of the best features available.

Forking vs. Branching

There are a few good reasons why you might want to detach, temporarily, from a given development track. Development of experimental features, new interfaces, backend rewrites or resurrection of a project whose original authors are unavailable. In most of these cases, forking is not the best solution but branching most likely is. Although the border between these two actions started slimming down thanks to distributed VCS, branching usually doesn't involve setting up a new web page for the project, changing its name or finding a new goal. And a branch is usually related, tightly or not, to the original project. Merges between the two code bases often happen at more or less regular intervals, and ideas and bug reports are shared.

Branches usually have the target of being merged in the main development track, sooner for small, testing branches, or later for huge rewrites. They don't usually require dividing of the efforts as the problems affecting the main branch get their fixes propagated to the other branches when they merge back the original code.

One common problem with developing through branches involved bad support in the Subversion version control system. In Subversion the branches are represented as a different path in the repository, with almost no help for branches in the merge operations. With a modern distributed VCS, branches are so cheap that any checkout is, from some points of view, a different branch, and the merge operations are one of the main focuses. Projects like the Linux kernel or xine-lib rely heavily on an above-average number of branches. These are often short-lived and used for testing purposes.

Looking to the Future

Forks will never end in Free Software as they are supported by one of the freedoms that make Free Software what we all want it to be. The future will, of course, bring new forks. Recently there has been a lot of talk about Funpidgin, a fork of the widespread Pidgin Instant Messaging client (formerly Gaim). Again it seems like it was the Cathedral-style development of the original code that motivated a fork that could give (some of) the users what they wanted.

And even though GNU Emacs opened its process quite a lot, its forks haven't stopped sprouting. This is despite the fact that Richard Stallman, original author and mastermind behind the GNU project, stepped down as maintainer, putting in place Stefan Monnier and Chong Yidong. The Aquamacs Emacs is still diverging from the original GNU Emacs for supporting Apple's Mac OS X, while different versions are being developed to support the multiple user interfaces one can use on that operating system. Similarly, although the Windows port of Emacs is already pretty solid, there are extensions being written to make it easier for users to adapt it to the Microsoft environment.

Forks are usually the effect of a closed-circle development, a Cathedral, where some of the developers or users can't see their objective being fulfilled, will all their energy being poured in. So just look for the projects that don't seem to be getting much love from a community, and you might find a fork starting to make its first leaves.

Then there is the Poppler project, which merged together the modified versions of the XPDF code imported by projects like GNOME and KDE for their PDF viewers. Poppler is soon going to be a nearly omnipresent PDF viewer on Free Software desktops and beyond. This summer's milestone KDE 4.1 release will include the release of the new oKular document viewer, oKular will use Poppler for PDF rendering on the (stable) KDE users' desktops.

Conclusions

I'd suggest that anybody thinking about creating a fork should think twice. Forking is rarely a good choice, better choices can be branching, or if you need just part of a code, working together like Poppler developers did to separate the code to share the common parts.

When you want to make some changes to a software project, propose branching it, show the results to the original developers and discuss with them on how to improve the code. Most of the times you'll find authors are open to the changes.

A fork is a grave matter. It might bring innovation to the Free Software community, but it could also separate developers that could otherwise work together, maybe in a better way. In this light, GitHub's one click forking capability seems like a dangerous feature.

The ever-increasing ease of forking everything, from small projects to part of, or even entire distributions (think about Debian's repositories and Gentoo's overlays) is increasing the fragmentation of Free Software projects. Biodiversity in software can be a very good thing, just like in nature, but people should first try their best to work together, rather than one against the other.



to post comments

some notes on EGCS

Posted May 15, 2008 4:20 UTC (Thu) by JoeBuck (guest, #2330) [Link]

The EGCS effort was successful in part because it was carefully designed to make a re-merge, or even a takeover of FSF GCC development, possible, as well as a very careful effort to keep people with completely incompatible desires happy. I have to admit that there was a considerable amount of spin involved: we had to simultaneously persuade people to believe completely opposite things about what EGCS was: a temporary experimental branch that would eventually get merged back, therefore no threat? Complete liberation from dealing with RMS ever again? Freedom for Cygnus management to control GCC? A hacker playground? An effort to put ESR's CatB book into practice? There were people who wanted to believe each of those things.

But the goal from a very early stage, before the public ever heard about egcs, was gcc3: not a fork, but a takeover, so that there would be only one GCC, based on EGCS. For this to be possible, we needed to require papers assigning all the code to the FSF, we followed the GNU style rules. Even so, it took about a year of negotiation for the FSF to accept EGCS as the new GCC and give the EGCS team control of the compiler.

The Freedom of Fork

Posted May 15, 2008 4:55 UTC (Thu) by dlang (guest, #313) [Link] (1 responses)

with distributed VCS systems the difference between a branch and a fork boils down to intent
about as much as any technical matters.

the distributed VCS systems also make it much easier to merge branches and forks togeather
(with git as the extreme case where there is no difference, and no technical distinction
between the 'original' and the 'fork')

with this in mind the GitHub 'fork' button isn't as nearly as bad as if the same thing was
being done with CVS or subversion.

The Freedom of Fork

Posted Aug 30, 2008 23:28 UTC (Sat) by Dieter_be (guest, #53677) [Link]

Exactly.
Technically, a fork on github is actually a git clone.
People doing 'forks' on github will in 99% of the cases try to get their code merged in the mainline (issuing a 'pull request' of the fork). The other 0.9% just wants his own codebase to play with and also does not intend a real fork in the classic meaning.
So this is something totally different then forks in the classic terminology.

Also, git makes it easy to import changes after branches have diverged a lot, so that should promote the 'giving back to the original project/other forks' part.

Besides, I think that just because the barrier is so low, people who want to fork in the classic sense but without thinking things trough will abandon their fork pretty soon / won't achieve results that pulls users away anyway. Not a big loss for the broader public there... A fork can only be successful and get much adoption if it's well thought out and in that case we don't have much to fear.
just my 2c.

Ghostscript

Posted May 15, 2008 7:08 UTC (Thu) by DeletedUser32991 ((unknown), #32991) [Link] (1 responses)

GPL Ghostscript never was a fork (just somewhat behind AFPL Ghostscript) and even more laudably the publisher has long since switched the licensing to GPL altogether.

Ghostscript

Posted May 15, 2008 12:31 UTC (Thu) by nix (subscriber, #2304) [Link]

Ghostscript has also merged the ESP Ghostscript changes, with the result that that fork, too,
is now dead.

The Freedom of Fork

Posted May 15, 2008 12:26 UTC (Thu) by nix (subscriber, #2304) [Link]

I'm not sure cdrkit really is a good example. Maintenance isn't exactly very active, one might
say. (Of course, that could be because it already works well enough for almost everyone.)

The Freedom of Fork

Posted May 18, 2008 15:17 UTC (Sun) by oak (guest, #2786) [Link] (2 responses)

This forgot an important fork: XFree vs. Xorg...

The Freedom of Fork

Posted May 22, 2008 4:39 UTC (Thu) by roelofs (guest, #2599) [Link]

A quick review of some of the history behind the current crop of BSDs might have been nice, too.

Greg

The Freedom of Fork

Posted May 25, 2008 1:27 UTC (Sun) by Flameeyes (guest, #51238) [Link]

I sincerely can't think of Xorg too much as a fork sincerely, mostly 
because I haven't seen XFree doing anything in a long time. I was looking 
at forks that worked somewhat in parallel for a while at least. It's a 
bit like Ethereal and Wireshark, while technically a fork, the latter is 
considered more like a continuation with a different name.

For what concerns BSDs... they'd be worth a long article on their own, 
their history is quite contorted. Other names are missing, sure, like 
Wine and its derivatives, or OpenOffice and NeoOffice (on OSX). I suppose 
it could be a good topic for a series, looking deeply at the history of a 
specific project and its forks/children.


Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds