Time recently published an article entitled
Getting
rich off those who work for free which, among other things, talked
about free software this way:
Open-source, volunteer-created computer software like the Linux
operating system and the Firefox Web browser have also established
themselves as significant and lasting economic realities.
It is not uncommon to see Linux referred to as a volunteer-created system,
as opposed to the corporate-sponsored, proprietary alternatives. There has
been little research, however, into how much work on Linux is truly
"volunteer" - done on a hacker's spare, unpaid time. In general, the
assumption that Linux is created by volunteers is simply accepted.
Determining the real provenance of free software can be a daunting task.
There is a wealth of information available for those who look, however. In
an attempt to shine some light in this area, your editor hacked up some
scripts to do a lot of digging around in the kernel git repository. The
idea was that, by looking at who is putting changes into the kernel, we can
get a sense for where our source is coming from.
Who got patches into 2.6.20
This study looked at the stream of patches that changed the 2.6.19 kernel
into the current 2.6.20 release. There were, as it turns out 4983
non-merge changesets in this release, contributed by 741 different
developers. (Merge changesets mark where the contents of other
repositories were pulled into the mainline, but they do not carry any code
changes, so the analysis skipped them).
These patches added 286,439 lines of code and removed 159,812
others, for a total growth of 126,627 lines over the 2.6.20 development
cycle.
Your editor's scripts looked over every non-merge commit in 2.6.20. For
each, the developer listed as the "author" was given credit for the patch.
This approach is not entirely fair, since one developer will, in some
cases, be submitting code written by a group of people. In general,
though, there is no easy way of getting around this problem - the true
breakdown of authorship of a joint work simply is not available in the
mainline repository. Your editor believes that this inaccuracy affects the
accounting of a relatively small portion of the patches merged into the
mainline.
Beyond that, how one generates statistics from a patch stream is an
interesting question. How does one measure the productivity of
programmers? One possibility is to look at the number of changesets
merged. By that metric, this is the list of the most prolific contributors
to 2.6.20:
| Developers with the most changesets |
| Al Viro | 241 | 4.8% |
| Andrew Morton | 92 | 1.8% |
| Jiri Slaby | 92 | 1.8% |
| Adrian Bunk | 87 | 1.7% |
| Gerrit Renker | 79 | 1.6% |
| Josef Sipek | 79 | 1.6% |
| Avi Kivity | 68 | 1.4% |
| Tejun Heo | 67 | 1.3% |
| Patrick McHardy | 63 | 1.3% |
| Ralf Baechle | 61 | 1.2% |
| Randy Dunlap | 59 | 1.2% |
| Alan Cox | 58 | 1.2% |
| Mariusz Kozlowski | 57 | 1.1% |
| Andrew Victor | 53 | 1.1% |
| Paul Mundt | 52 | 1.0% |
| Stefan Richter | 49 | 1.0% |
| David S. Miller | 48 | 1.0% |
| Russell King | 44 | 0.9% |
| Benjamin Herrenschmidt | 44 | 0.9% |
| Akinobu Mita | 43 | 0.9% |
Looking at patch counts rewards developers who put in large numbers of
small patches. Al Viro's patches include a vast number of code annotations
(to enable better checking with sparse), include file fixups,
etc. Many of the changes are small - many do not affect the resulting
kernel executable at all - but there are a lot of them. Even so, as the
biggest contributor, Al generated less than 5% of the
total changesets added to the kernel. The top 20 contributors, all
together, generated 28% of the total changesets in 2.6.20.
One could make the argument that a better way to look at the problem is by
the number of lines affected by a patch. In this way, a contributor's
portion of the whole will not depend on whether it has been split into a
long series of small patches or not. On the other hand, simply renaming a
file can make it look like a developer has touched a large amount of code.
Be that as it may, by looking at lines changed (defined
as the greater of the number of lines added or removed by each individual
changeset), one gets a table like this:
| Developers with the most changed lines |
| Jeff Garzik | 20712 | 6.0% |
| Patrick McHardy | 15024 | 4.3% |
| Jiri Slaby | 13917 | 4.0% |
| Avi Kivity | 11726 | 3.4% |
| Andrew Victor | 9710 | 2.8% |
| Amit S. Kale | 9537 | 2.7% |
| Stephen Hemminger | 9120 | 2.6% |
| Geoff Levand | 8396 | 2.4% |
| Michael Chan | 8307 | 2.4% |
| Chris Zankel | 8099 | 2.3% |
| Mauro Carvalho Chehab | 7390 | 2.1% |
| Adrian Bunk | 6138 | 1.8% |
| Yoshinori Sato | 5232 | 1.5% |
| Al Viro | 4981 | 1.4% |
| Benjamin Herrenschmidt | 4588 | 1.3% |
| Thierry MERLE | 4549 | 1.3% |
| Dan Williams | 4516 | 1.3% |
| Jonathan Corbet | 3924 | 1.1% |
| Gerrit Renker | 3857 | 1.1% |
| Jiri Kosina | 3805 | 1.1% |
Jeff Garzik comes out on top of this particular measurement by virtue of
having deleted the long-unmaintained floppy tape subsystem. Patrick
McHardy's work includes a number of additions to the netfilter subsystem,
Jiri Slaby did a great deal of driver cleanup work, Avi Kivity was
the contributor of the KVM
virtualization code, and Andrew Victor contributed a number of
ARM-related patches and the Atmel AT91 i2c driver. (The contributions made
by other authors can be found by searching out their name in the 2.6.20 short-form changelog).
Most of the developers in the above list got there by adding code to the
kernel. It can be said, however, that the true heroes in the development
community are those who remove code and make the kernel smaller. The
developers who were best at removing more code than they added were:
| Developers with the most lines removed |
| Jeff Garzik | 19862 | 12.4% |
| Chris Zankel | 5608 | 3.5% |
| Adrian Bunk | 5528 | 3.5% |
| Arnd Bergmann | 2224 | 1.4% |
| Linus Torvalds | 1739 | 1.1% |
| Atsushi Nemoto | 1425 | 0.9% |
| Thierry MERLE | 911 | 0.6% |
| David Gibson | 878 | 0.5% |
| Dominik Brodowski | 528 | 0.3% |
| Stefan Richter | 509 | 0.3% |
Once again, Jeff Garzik's removal of ftape comes out on top, by far. Chris
Zankel cleaned up the Xtensa architecture, removing a number of files in
the process. Adrian Bunk worked on the ftape removal, got rid of the frame
diverter code, removed an old, broken block driver, and generally performed
cleanups all over the tree. Mr. Bunk is, in fact, the bane of old code;
over the last year (since 2.6.16) he has removed a full 127,000 lines from
the kernel source tree. Arnd Bergman got rid of a bunch of
syscall*() macros. Linus Torvalds removed the broken x86 stack
unwinder code.
Finally, one could look at a different measure entirely: the number of
patches signed off by each developer. A Signed-off-by: line is an
indication that the person involved believes that the code is suitable for
merging into the kernel; it implies that some degree of attention has been
paid to the patch. Authors sign off their code, as do the subsystem
maintainers who pass it up the chain. The top signers-off in 2.6.20 were:
| Developers with the most signoffs |
| Andrew Morton | 1422 | 13.7% |
| Linus Torvalds | 1366 | 13.2% |
| David S. Miller | 483 | 4.7% |
| Jeff Garzik | 331 | 3.2% |
| Greg Kroah-Hartman | 269 | 2.6% |
| Al Viro | 241 | 2.3% |
| Paul Mackerras | 232 | 2.2% |
| Andi Kleen | 177 | 1.7% |
| Mauro Carvalho Chehab | 170 | 1.6% |
| Russell King | 166 | 1.6% |
| Adrian Bunk | 120 | 1.2% |
| Arnaldo Carvalho de Melo | 119 | 1.1% |
| Ralf Baechle | 117 | 1.1% |
| James Bottomley | 109 | 1.1% |
| Patrick McHardy | 96 | 0.9% |
| Jiri Slaby | 94 | 0.9% |
| Avi Kivity | 87 | 0.8% |
| Josef Sipek | 79 | 0.8% |
| Paul Mundt | 78 | 0.8% |
| Gerrit Renker | 78 | 0.8% |
There were a total of 10,354 signoff lines in the 2.6.20 patch stream, so
each changeset, on average, was signed off just over two times. It is interesting that
Linus, who ultimately merges every patch, only signed off 13% of them. It
seems that most patches, these days, go directly into the mainline from
subsystem repositories without a signoff from Linus or Andrew. Most of the
other names on that list, with just a few exceptions, are the maintainers
of subsystem or architecture trees.
Who paid them
So now we have a sense for who got their fingers on the code which went
into 2.6.20. But one interesting question still has not been answered: to
what extent was that code contributed by volunteers (or "hobbyists")?
Finding an answer to that question is somewhat trickier than looking at who
wrote the patches, mostly because very few developers say "I wrote this on
behalf of my employer."
The approach taken by your editor was relatively simplistic, but, perhaps,
the best that is practical. Any patch whose author's given email address
indicates a corporate affiliation is assumed to have been developed by an
employee of that corporation. So any patch posted by somebody with an
ibm.com email address is accounted as having been done by an IBM
employee. Things are complicated by the fact that many people who work for
companies do not use corporate
addresses; it is not unheard-of for companies to have policies explicitly
prohibiting code contributions associated with their domains. Your editor
has coped with this problem by filling in the relevant developer's
affiliation whenever it is known to him; in some cases, the developer was
asked for this information.
This method has the effect of crediting all of an employee's work to
his or her employer. In many cases, the situation is probably more
complicated than that; one assumes, for example, that a certain kernel
hacker's employer has not directed him to hack on
Battle for Wesnoth. When looking only at kernel code, however,
crediting all work to the employer is probably relatively safe.
Using this approach, the top sources of changesets were:
| Top changeset contributors by employer |
| (Unknown) | 1244 | 25.0% |
| Red Hat | 636 | 12.8% |
| (None) | 383 | 7.7% |
| IBM | 368 | 7.4% |
| Novell | 295 | 5.9% |
| Linux Foundation | 261 | 5.2% |
| Intel | 178 | 3.6% |
| Oracle | 126 | 2.5% |
| Google | 97 | 1.9% |
| University of Aberdeen | 79 | 1.6% |
| HP | 78 | 1.6% |
| Qumranet | 71 | 1.4% |
| Nokia | 67 | 1.3% |
| SGI | 64 | 1.3% |
| Astaro | 63 | 1.3% |
| MIPS Technologies | 61 | 1.2% |
| SANPeople | 53 | 1.1% |
| Miracle Linux | 43 | 0.9% |
| MontaVista | 41 | 0.8% |
| Broadcom | 39 | 0.8% |
Looking instead at the number of lines of code changed, the results become:
| Top lines changed by employer |
| (Unknown) | 66154 | 19.0% |
| Red Hat | 44527 | 12.8% |
| (None) | 38099 | 11.0% |
| IBM | 25244 | 7.3% |
| Astaro | 15306 | 4.4% |
| Linux Foundation | 13638 | 3.9% |
| Qumranet | 12108 | 3.5% |
| Novell | 11930 | 3.4% |
| Intel | 11652 | 3.4% |
| SANPeople | 9888 | 2.8% |
| NetXen | 9607 | 2.8% |
| Sony | 8497 | 2.4% |
| Broadcom | 8349 | 2.4% |
| Tensilica | 8195 | 2.4% |
| Nokia | 5581 | 1.6% |
| MontaVista | 4394 | 1.3% |
| University of Aberdeen | 4324 | 1.2% |
| LWN.net | 3975 | 1.1% |
| Secretlab | 3370 | 1.0% |
| HP | 3211 | 0.9% |
[Note that these tables have been updated once since the article was
originally published; the curious can see what the original versions looked like.]
In these tables, the line marked "(Unknown)" is exactly that: patches for
which existence of a supporting employer could not be determined. The line
marked "(None)", instead, indicates the patches from developers
known to be working on their own time.
Either way, the results come out about the same: at least 65% of the code
which went into 2.6.20 was created by people working for companies. If the
entire "unknown" group turns out to be developers working on a volunteer
basis - an unlikely result - then just over 1/3 of the 2.6.20 patch stream
was written by volunteers. The real number will be lower, but it still
shows that a significant portion of the code we run is written by
developers who are donating their time.
One year's worth of changes
Looking at a single kernel release is instructive, but it can also be
deceptive. The relatively short release cycle used by the kernel project
makes it fairly easy for prolific developers to see few of their patches go
into a specific release. In an attempt to gain a longer-term perspective,
your editor forced his suffering system to crank through the entire history
from 2.6.16 (released almost exactly one year ago) to the present. Some
28,000 non-merge changesets have been added to the mainline (by 1,961
developers) over that time,
replacing 1.26 million lines of old code with 2.01 million lines
of new code - the kernel grew by 754,000 lines.
The developers who touched the most lines over that time were:
| Developers with the most changed lines |
| Adrian Bunk | 134021 | 5.3% |
| Jeff Garzik | 87847 | 3.5% |
| Andrew Vasquez | 75195 | 3.0% |
| Mauro Carvalho Chehab | 68568 | 2.7% |
| David Teigland | 46607 | 1.9% |
| Ralf Baechle | 38559 | 1.5% |
| David S. Miller | 35958 | 1.4% |
| Andrew Victor | 35594 | 1.4% |
| Bryan O'Sullivan | 33901 | 1.4% |
| Paul Mundt | 27041 | 1.1% |
| Dave Kleikamp | 26615 | 1.1% |
| Lennert Buytenhek | 25192 | 1.0% |
| Haavard Skinnemoen | 24372 | 1.0% |
| Ben Dooks | 23207 | 0.9% |
| Patrick McHardy | 23175 | 0.9% |
| Ingo Molnar | 22456 | 0.9% |
| James Bottomley | 22205 | 0.9% |
| David Howells | 19168 | 0.8% |
| Jiri Slaby | 18335 | 0.7% |
| Divy Le Ray | 17909 | 0.7% |
The results for employers were:
| Top lines changed by employer |
| (Unknown) | 740990 | 29.5% |
| Red Hat | 361539 | 14.4% |
| (None) | 239888 | 9.6% |
| IBM | 200473 | 8.0% |
| QLogic | 91834 | 3.7% |
| Novell | 91594 | 3.6% |
| Intel | 78041 | 3.1% |
| MIPS Technologies | 58857 | 2.3% |
| Nokia | 39676 | 1.6% |
| SANPeople | 36038 | 1.4% |
| SteelEye | 36021 | 1.4% |
| Freescale | 35034 | 1.4% |
| Linux Foundation | 34163 | 1.4% |
| MontaVista | 30211 | 1.2% |
| Simtec | 26166 | 1.0% |
| Atmel | 25975 | 1.0% |
| HP | 23714 | 0.9% |
| SGI | 22057 | 0.9% |
| Oracle | 21251 | 0.8% |
| Open Grid Computing | 20505 | 0.8% |
The end result of all this is that a number of the widely-expressed
opinions about kernel development turn out to be true. There really are
thousands of developers - at least, almost 2,000 who put in at least one
patch over the course of the last year. Linus Torvalds is directly
responsible for a very small portion of the code which makes it into the
kernel. Contemporary kernel development is spread out among a broad group
of people, most of whom are paid for the work they do. Overall, the
picture is of a broad-based and well-supported development community.
There are many other interesting things to be learned by looking at the
kernel's development history. Expect more articles along these lines as
your editor finds the time to improve his scripts.
(
Log in to post comments)