User: Password:
Subscribe / Log in / New account


Visualizing open source projects and communities

April 7, 2010

This article was contributed by Koen Vervloesem

Visualization is a critical tool for exploring and understanding large amounts of data. Thanks to the computer power of the 21st century it has become possible to visualize ever-expanding amounts of data. Because the open source development model is massively decentralized and network-centric, it is by its nature the perfect domain for graph-based visualizations. Connections or dependencies between projects, communities, and code commits can be explored and displayed in a lot of ways. These visualizations can give us a unique perspective on open source projects and communities, such as fundamental differences in their approach.

Your author has a longstanding interest in visualizations, especially of non-numerical information. The classic books about visualizing complex data are of course Edward Tufte's works, beginning with The Visual Display of Quantitative Information. Recently, your author has enjoyed reading more programming-oriented books like Ben Fry's Visualizing Data and Toby Segaran's and Jeff Hammerbacher's Beautiful Data. What's even better is that a lot of open source software exists to put this theory into practice. We'll look at a few of the most interesting open source visualization programs and their application to open source projects and communities.


[Apache code_swarm]

Michael Ogawa, a Ph.D. student in the Visualization & Interface Design Innovation group of UC Davis, conducted some interesting research about software visualization. The purpose of this research is to help understand the relationship of the communication between developers and the evolution of the source code. In 2007, Michael published a paper about Visualizing Social Interaction in Open Source Software Projects [PDF]. In 2008, he presented StarGate [PDF], a system that grouped developers of a software project visually into clusters corresponding to the areas of the file repository they work on the most. In both visualization methods, he used Apache and PostgreSQL as case studies. Interested readers should consult the papers for some illustrative insights in these projects.

Michael's most popular visualization method is code_swarm, which shows the history of commits in an open source project as a video. Both developers and files are represented as moving elements. When a developer commits a file, it lights up and flies towards that developer. Files are colored according to their purpose, such as whether they are source code or a document. If files or developers have not been active for a while, they fade away. The design of code_swarm is explained in the paper code_swarm: A Design Study in Organic Software Visualization [PDF], which shows some case studies of Python, Eclipse, Apache, and PostgreSQL. Videos generated by code_swarm for these projects are also available on the web site.

The code for code_swarm, written in Ben Fry's Java-based open source programming environment Processing, is available under the GPL v3. It supports various types of repositories: Subversion, CVS, Git, Mercurial, Perforce, VSS, Starteam, and Darcs. The wikiswarm add-on even allows visualizing Wikipedia page histories and user contributions. By downloading and executing the code, everyone can create their own software visualizations, and there's a mailing list for help. The project's wiki also has some documentation, such as a step by step guide of how to generate a video, a FAQ, and a gallery of third-party code_swarm videos.


[Python Gource]

At the end of 2009, New Zealand software developer Andrew Caudwell presented his software visualization project Gource (a play on Source and Gorse) on his computer graphics blog The Alpha Blenders. Gource takes the logs from a version control system of a software project and displays them as an animated tree with the root directory of the project at its center. Directories appear as branches with files as "leaves", represented by spheres that are colored dependent on their file extension. Developers currently contributing to the project can be seen floating near the files they are modifying. The whole visualization looks organic and is interactive, as the user can rotate the view and move the camera position.

The code for Gource is available under the GPL v3. It's designed for use with Git, Mercurial, or Bazaar, but it has also scripts to support CVS and Subversion. It needs a 3D accelerated video card and uses OpenGL for rendering. The wiki has some documentation, such as how to show Gravatar images for developers or how to change the appearance. The wiki also explains how to produce a video and shows some example videos and screenshots. In January, Andrew showed some of his visualizations at

In the last few months, several enthusiasts have been experimenting with Gource. For example, Michael DeHaan used Gource to create a visualization of Red Hat's provisioning server Cobbler and he explained that it can be really useful to evaluate an open source project:

When evaluating OSS software for use in business, you always need to know if the community is solid and self sustaining. [Gource] allows you to watch a short video and find out. Coupled with looking through the mailing list archives, that's a pretty good check. It can also help identify interesting patterns of large scale refactoring, new development, or stagnation.

Michael's visualization inspired Daniel Berrange to do the same exercise for libvirt:

It is clear from the video just how much development of libvirt has been expanding over the past 4 years, particularly with the expansion to cover VirtualBox and VMWare ESX server as hypervisor targets.

Daniel also produced a visualization of libvirt using code_swarm, which makes it easy to compare the merits of both methods.


[CPAN Explorer]

A third, more research-oriented and more general graph visualization tool is Gephi, which is Java-based and is distributed under the GPL v3. Users of this tool call it "like PhotoShop for graphs", or should we say "like GIMP for graphs"? It's a very powerful and interactive tool for exploring, manipulating and visualizing graphs: users can manipulate the structure, shapes, colors, locations of the nodes, and so on, but they can also find the shortest path between nodes and compute graph metrics, find clusters, and conduct a lot of advanced graph analyses. Gephi was originally created in 2007 at WebAtlas, a French non-governmental organization involved in mapping the web and data mining, but is now developed by an international consortium of open source developers.

Gephi is not oriented specifically towards visualizing software projects: the wiki page about data sets that can be explored and visualized with Gephi gives examples such as the structure of internet, the topology of the Western States Power Grid of the United States, airlines, a network of disorders and disease genes linked by known disorder-gene associations, and a couple of social networks.

However, the social networks section shows some interesting data sets of open source projects, which have all been visualized in Gephi by Franck Cuny, a Perl hacker working at the French social media agency Linkfluence. He developed the CPAN Explorer web site, an interactive visualization to analyze relationships between developers and packages of CPAN (Comprehensive Perl Archive Network). The authors page shows relationships between developers inside CPAN: each developer is represented by a node with a size proportional to the number of modules the developer has released on CPAN. An edge between two developers is created when one developer uses a module from the other developer. The more uses of a developer's modules by other developers, the bigger the label.

One can deduce some interesting facts about developers from this graph: for example, Adam Kennedy gets a big node with a small label, because he has released a lot of modules on CPAN, but few of them are used by other CPAN developers. In contrast, Gisle Aas has a small node with a big label, because he has few modules, but some very popular ones like LWP and URI.

The community page shows web sites that write about Perl. Each website is represented by a node with a size dependent on the number of inbound links. A hyperlink between two websites is represented by an edge. The official Perl community web sites get a purple node, bloggers a green node, open source web sites a red node, companies a black node, and CPAN author pages a blue node. The last CPAN visualization on CPAN Explorer is a visualization of dependencies between CPAN module distributions. The CPAN Explorer developers offer static PDF and SVG version of all these graphs and dynamic JavaScript visualizations, but also the original data sets in gexf file format to explore in Gephi.

Last month, Franck introduced a new project, Github Explorer. He explained that this was a very natural choice:

I wanted to do something similar again, but not with the same data. So I took a look at what could be a good subject. One of the things that we saw from the map of the websites is the importance github is gaining inside the Perl community. Github provides a really good API, so I started to play with it.

This time, Franck didn't aim for the Perl community only, but the whole community of users of Github, a popular web-based hosting service for projects that use the Git revision control system. He warns that Github doesn't represent the whole open source community and that he has collected only a selection of all user profiles, but nevertheless it gives us a good picture. Each profile is represented by a node, and a link between two profiles is represented by an edge. The weight of the edge is incremented each time the person forks code from the target profile.

[Github Explorer PHP]

On his blog, Franck shows some Github visualizations he made with Gephi, colored by country and split according to the programming language. He has some thought-provoking analyses about some of the languages. For example, the Perl community is clearly split between the 'west' and Japan. In the Python community, there is clearly one main project, the web framework Django. PHP is the only community on Github where the visualization shows clusters of people working together on a specific project. The Ruby graph looks like a big ball of yarn with a couple of isolated countries. In other visualizations, Franck split the graphs according to their country. He offers the data he gathered for use in Gephi, he has published all the graphs on Flickr, and he will offer a printed version on posters of size A2 and A1 for sale soon.

Towards a better understanding of open source communities

Many of these visualizations are beautiful, but that's not the point: to paraphrase Richard Hamming's dictum "The goal of computation is not numbers, but insight.", your author would say "The goal of visualization is not beauty, but understanding." and visualization tools can help understand the internals and the dynamics of open source projects and communities. While code_swarm and Gource can show users a lot about patterns and evolutions in the development of a specific project, including how the developer community works together, CPAN Explorer and Github Explorer are about visualizing global connections between a lot of open source projects, which is also an important factor in open source communities. Now we just have to wait for some creative minds to visualize the SourceForge or Launchpad communities.

Comments (1 posted)

Brief items

Quotes of the week

Q: O Great Rabbi, Perl has so many precedence laws I feel I shall never learn them all. Which is the most important of these commandments?

A: As it was in the beginning, is now, and ever shall be, the First Commandment is the Law of Algebraic Precedence:


The Second Commandment is to think of those who come after you, most preferably before they do so:


Follow these two Commandments and all the days of your life will be blessed, for your code shall be ever right[eous] and all shall love you for it.

-- Tom Christiansen

Since Emacs is just an editor, not a god, it cannot do miracles.
--Richard Stallman

Comments (2 posted)

Firefox 3.6.3 security update now available

Mozilla has released Firefox 3.6.3 to address a critical security issue that could allow remote code execution.

Full Story (comments: 3)

Grease 0.2 Released

Grease is a Python-based 2D game engine and development framework, focused on quick development, good performance, and fun. The project's documentation includes a detailed tutorial on the creation of an asteroids-style game. "Grease does not attempt to provide one-size-fits-all solutions. Instead it provides pluggable components and systems than can be configured, adapted and extended to fits the particular needs at hand."

Full Story (comments: none)

Notmuch 0.1 released

Version 0.1 of the "Notmuch" email client (recently reviewed on LWN) has been released. "In trying to get notmuch to grow up a little bit, I've just added a version number (0.1 initially) and have started doing releases." More informative release notes are promised for the future.

Full Story (comments: 7)

StretchPlayer 0.500

The initial release of StretchPlayer is available. StretchPlayer is an audio file player with time-stretching and pitch-shifting features. The intended audience would appear to be musicians who want to slow down a song to learn how to play with it. More information can be found on the project's home page.

Full Story (comments: none)

A proposed Subversion vision and roadmap

A group of Subversion developers recently met in New York in an attempt to come up with a plan for the future development of this source code management system; a summary of that meeting has now been posted. "Subversion has no future as a DVCS tool. Let's just get that out there. At least two very successful such tools exist already, and to squeeze another horse into that race would be a poor investment of energy and talent. What's more, huge classes of users remain categorically opposed to the very tenets on which the DVCS systems are based. They need centralization. They need control. They need meaningful path-based authorization. They need simplicity. In short, they desperately need Subversion. It's this class of user -- the corporate developer -- that stands to benefit hugely from what Subversion brings to the party." Read the whole thing for details on how they plan to meet that developer's needs.

Full Story (comments: 227)

UFRaw 0.17 released

UFRaw is a utility for the processing of raw images from digital cameras. The biggest addition in the 0.17 release would appear to be the incorporation of the lensfun library, allowing UFRaw to correct for lens distortion using a database of hundreds of lenses. Also in this release are a new despeckling algorithm, hot pixel elimination by default, better zoom support, and more.

Full Story (comments: none)

X server 1.9 release thoughts

Fresh from the X.Org server 1.8 release, Keith Packard is pondering making some changes for the next time around. At the top of his list is shortening the release cycle to something closer to three months as a way of getting new hardware support to users more quickly. That proposal is not universally loved, though, so it's not clear if it will be adopted or not. He is proposing that the 1.9 release happen in late August. "I don't think there are any major changes planned for this release, so this shorter merge window seems like it should be sufficient. Nor do I necessarily think that this would also mean that the release date should be moved in; having the X server ready *before* the release seems like a good idea to me."

Full Story (comments: none)

Newsletters and articles

Development newsletters from the last week

Comments (none posted)

Georges Auberger: Songbird Singing A New Tune

Georges Auberger reports that the Songbird media player is dropping Linux support. "After careful consideration, we've come to the painful conclusion that we should discontinue support for the Linux version of Songbird. Some of you may wonder how a company with deep roots in Open Source could drop Linux and we want you to know it isn't without heartache. We have a small engineering team here at Songbird, and, more than ever, must stay very focused on a narrow set of priorities. Trying to deliver a raft of new features around all media types, and across a growing list of devices, we had to make some tough choices." An untested and unsupported version of Songbird for Linux will still be available for developers.

Comments (9 posted)

Page editor: Jonathan Corbet
Next page: Announcements>>

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds