|
|
Subscribe / Log in / New account

Who made Gentoo Linux, and when? A commit analysis

October 10, 2007

This article was contributed by Donnie Berkholz

Since LWN has published statistics on who wrote the Linux kernel, I thought readers might also be interested in who's writing other major open-source projects. I recently obtained the entire CVS repository history for Gentoo Linux, courtesy of Robin Johnson <robbat2 -AT- gentoo -dot- org>. Although some of the code has moved to Subversion or Git recently so these numbers may not be 100% accurate, the techniques used to analyze commits should be generally useful in understanding the progress and contributors to any project.

First, I wanted to understand the developer community. How much experience do our developers have with Gentoo, and how has that changed over time? To do this, I created a number called "lifetime" that's the length of time between the developer's first and last commits. Then I scanned across each month, checking the average developer lifetime. I used the scanning month for the last commit of active developers to get the developer's experience at that time, not the developer's experience today.

Developer lifetimes

What you can see is that the lifetimes go up roughly as a function of time since CVS history begins. This shows that the "average Gentoo developer" joins and stays involved for more than a year. Over a span of 3 years, the average lifetime increases from 1 year to 2 years.

Another way to look at this is to ask how many active and retired developers there are today as a function of when they gained commit access. The majority of active developers joined in 2005 and 2006, while the most retired developers joined in 2003 and 2004. This again shows that the average lifetime is around 2 years.

Active developers as a function of experience

Developer counts at any given time is also of interest. I found this by scanning across months again, checking for how many developers the month is during their commit lifetimes.

Total commiters over time

The most interesting part is a sharp decline starting in early 2006. I wanted to attribute this in part to the addition of Subversion, which was right around that time, but that would only account for it if the developers commiting to Subversion no longer commited to CVS. That certainly isn't the case for more than 100 people, since the main package tree remains in CVS.

Instead, I now attribute this drop to Gentoo's developer population returning toward an equilibrium after an explosive, uncontrolled growth. The Gentoo structure and governance could not scale quickly enough to deal with all the new developers, but it took some time to normalize and continues to do so.

Now that we've learned something about our developers, how about our code? The next three graphs show commits per month to each CVS module. The "gentoo-x86" module contains all of the ebuilds (the packages). There's nothing particularly unusual about this, except for a huge peak in early 2006, I suspect when someone accidentally branched the entire repository. Interestingly, there isn't as much of a decline in commits as you might expect, given the drop in developers by more than a third. Apparently, the actively commiting developers weren't the ones who quit. The "gentoo" module contains the website files as well as some projects such as the installer and the Catalyst LiveCD creator as well as patchsets for more complex packages. The website is fairly stable at this point, and many of the projects in this repository have reached maturity, so development has slowed down. The "gentoo-src" module contains a number of projects as well, but the huge drop near the beginning of 2006 indicates a move of active development to Subversion.

Commits to the gentoo-x86 module Commits to the gentoo module Commits to the gentoo-src module

And finally, let's tie the developers and the code together with a histogram. This shows the number of commits each developer's made, with a bin size of 100. You can see the incredibly long tail of the most active commiters, with most developers under 20,000 (note the scale) but the top developer at 120,000 commits.

Histogram of commits per developer

Now let's take a closer look at the long tail of the developers with the largest commit counts. The tables show any developer with at least 1% of the total commits.

All-time commits
Developer Percentage
Mike Frysinger 6.08
Chris White 4.72
Aron Griffis 4.34
Diego Pettenò 3.08
Robin H. Johnson 1.98
Michael Cummings 1.95
Michael Sterrett 1.80
Gustavo Zacarias 1.71
Jeremy Huddleston 1.64
Dan Armak 1.63
Seemant Kulleen 1.58
Markus Rothe 1.58
Daniel Robbins 1.54
Bryan Østergaard 1.47
Chris Gianelloni 1.28
Donnie Berkholz 1.15
Martin Holzer 1.03
Mamoru Komachi 1.01
Total 39.57

About 40% of the all-time commits to Gentoo come from just 18 developers. Unfortunately, I didn't have access to the size of the commits, just the number of them, so I couldn't try to rank them by changes in lines of code. One thing to be wary of is the very small commits, such as those indicating that a package works on a given architecture. But this list is not dominated by architecture developers.

2007 commits
Developer Percentage
Raúl Porcel 6.60
Diego Pettenò 4.50
Mike Frysinger 3.91
Michael Sterrett 3.57
Piotr Jaroszynski 3.04
Christian Faulhammer 3.04
Gustavo Zacarias 2.97
Michael Cummings 2.62
Markus Rothe 2.52
Jeroen Roovers 2.25
Samuli Suominen 2.18
Markus Ullmann 1.98
Tobias Scherbaum 1.75
Petteri Räty 1.66
Chris Gianelloni 1.62
Steve Dibb 1.48
Andrej Kacian 1.45
Christian Heim 1.40
Marius Mauch 1.36
Christoph Mende 1.33
Bryan Østergaard 1.21
Donnie Berkholz 1.10
Gysbert Wassenaar 1.06
Roy Marples 1.03
Stefan Schweizer 1.03
Joseph Jezak 1.02
Total 57.68

In 2007 so far, 26 developers accounted for nearly 60% of commits. Unlike the all-time list, a significant fraction of these developers are architecture developers, including the top commiter.

This analysis was mostly automated, using a combination of awk, bash shell, Python and gnuplot. The scripts are available upon request to the author <dberkholz -AT- gentoo -dot- org>.


Index entries for this article
GuestArticlesBerkholz, Donnie


to post comments

Who made Gentoo Linux, and when? A commit analysis

Posted Oct 11, 2007 13:38 UTC (Thu) by ekj (guest, #1524) [Link]

I now attribute this drop to Gentoo's developer population returning toward an equilibrium after an explosive, uncontrolled growth. The Gentoo structure and governance could not scale quickly enough to deal with all the new developers, but it took some time to normalize and continues to do so.

Gentoos popularity seems to be following a similar curve, for example the searchvolume on Google for "gentoo" peaks around the same time as your curve for active developers.

My guess is that Gentoo -was- at the time seen as a new and interesting distribution, many new developers arrived. That is no longer the case. Today gentoo is, essentially, a niche-distribution. Ubuntu and family has taken over as the biggest "up and coming".

Who made Gentoo Linux, and when? A commit analysis

Posted Oct 11, 2007 14:24 UTC (Thu) by tialaramex (subscriber, #21167) [Link] (3 responses)

“I now attribute this drop to Gentoo's developer population returning toward an equilibrium after an explosive, uncontrolled growth. The Gentoo structure and governance could not scale quickly enough to deal with all the new developers, but it took some time to normalize and continues to do so.”

Um, maybe I'm not being very bright, but isn't the most likely explanation for this part of the chart simply that you don't have a list of future commits yet to happen?

The reason I thought this was because of another chart with the same characteristic shape I've seen. The Potaroo automated address exhaustion site includes a chart of IPv4 address allocation vs usage against time, and whenever you look at it, there appears to be a very low ratio of usage for the last few months, which would suggest a disaster in Internet governance. But in reality if you look at the same period on the chart in a year's time the anomaly vanishes, because by then those allocated addresses are in use, it just takes time after they're allocated for that to happen.

I suspect that the same thing is wrong with your chart. Some of the people who haven't committed in July, August or September will pop up in November and thus retrospectively make the chart's "lifetime" indication wrong.

Who made Gentoo Linux, and when? A commit analysis

Posted Oct 11, 2007 15:56 UTC (Thu) by incase (guest, #37115) [Link] (1 responses)

Your comment would make sense if only applied to the developer lifetime charts. But if you take a look at the total number of commits, you also see a noticeable drop in the number of commits. Certainly not as noticeable as the developer lifetime charts seem to indicate, but still it is there.

cu, Sven

Who made Gentoo Linux, and when? A commit analysis

Posted Oct 11, 2007 17:45 UTC (Thu) by dberkholz (guest, #23346) [Link]

There will be a little weirdness on the very last data point, because it's Sept. 1-22 rather than the full month.

Who made Gentoo Linux, and when? A commit analysis

Posted Oct 11, 2007 23:00 UTC (Thu) by dberkholz (guest, #23346) [Link]

That's an intriguing idea. Any suggestions for ways to figure out whether this is happening, besides waiting a year?

Who made Gentoo Linux, and when? A commit analysis

Posted Oct 13, 2007 18:05 UTC (Sat) by eris23 (guest, #3632) [Link]

There's also the issue of Gentoo metadistributions. I, for instance, use SabayonLinux. Others can be found http://en.wikipedia.org/wiki/List_of_Linux_distributions#.... Plus the various Gentoo overlays. To check the health of Gentoo one should check the commits to all these.

Please use "long tail" correctly!

Posted Oct 14, 2007 17:31 UTC (Sun) by dthurston (guest, #4603) [Link]

You have the notion of "long tail" exactly backwards. A long tail in probability means that unlikely events have a large total probability; in your context, a "long tail" would mean that the developers with very few commits nevertheless contributed a substantial portion of the project. You haven't given us enough data to see if it happens, but it's certainly not the term you want to use.


Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds