Visualization is a critical tool for exploring and understanding large
amounts of data. Thanks to the computer power of the 21st century it has
become possible to visualize ever-expanding amounts of data. Because the
open source development model is massively decentralized and network-centric, it is by its nature the perfect domain for graph-based visualizations. Connections or dependencies between projects, communities, and code commits can be explored and displayed in a lot of ways. These visualizations can give us a unique perspective on open source projects and communities, such as fundamental differences in their approach.
Your author has a longstanding interest in visualizations, especially of
non-numerical information. The classic books about visualizing complex data
are of course Edward Tufte's works, beginning with The Visual Display of
Quantitative Information. Recently, your author has enjoyed reading more
programming-oriented books like Ben Fry's Visualizing Data and
Toby Segaran's and Jeff Hammerbacher's Beautiful Data. What's
even better is that a lot of open source software exists to put this theory
into practice. We'll look at a few of the most interesting open source
visualization programs and their application to open source projects and
Michael Ogawa, a Ph.D. student in the Visualization & Interface Design Innovation group of UC Davis, conducted some interesting research about software visualization. The purpose of this research is to help understand the relationship of the communication between developers and the evolution of the source code. In 2007, Michael published a paper about Visualizing Social Interaction in Open Source Software Projects [PDF]. In 2008, he presented StarGate [PDF], a system that grouped developers of a software project visually into clusters corresponding to the areas of the file repository they work on the most. In both visualization methods, he used Apache and PostgreSQL as case studies. Interested readers should consult the papers for some illustrative insights in these projects.
Michael's most popular visualization method is code_swarm, which shows the history of commits in an open source project as a video. Both developers and files are represented as moving elements. When a developer commits a file, it lights up and flies towards that developer. Files are colored according to their purpose, such as whether they are source code or a document. If files or developers have not been active for a while, they fade away. The design of code_swarm is explained in the paper code_swarm: A Design Study in Organic Software Visualization [PDF], which shows some case studies of Python, Eclipse, Apache, and PostgreSQL. Videos generated by code_swarm for these projects are also available on the web site.
The code for code_swarm, written in Ben Fry's Java-based open source programming environment Processing, is available under the GPL v3. It supports various types of repositories: Subversion, CVS, Git, Mercurial, Perforce, VSS, Starteam, and Darcs. The wikiswarm add-on even allows visualizing Wikipedia page histories and user contributions. By downloading and executing the code, everyone can create their own software visualizations, and there's a mailing list for help. The project's wiki also has some documentation, such as a step by step guide of how to generate a video, a FAQ, and a gallery of third-party code_swarm videos.
At the end of 2009, New Zealand software developer Andrew Caudwell presented his software visualization project Gource (a play on Source and Gorse) on his computer graphics blog The Alpha Blenders. Gource takes the logs from a version control system of a software project and displays them as an animated tree with the root directory of the project at its center. Directories appear as branches with files as "leaves", represented by spheres that are colored dependent on their file extension. Developers currently contributing to the project can be seen floating near the files they are modifying. The whole visualization looks organic and is interactive, as the user can rotate the view and move the camera position.
The code for Gource is available under
the GPL v3. It's designed for use with Git, Mercurial, or Bazaar, but it
has also scripts to support CVS and Subversion. It needs a
3D accelerated video card and uses OpenGL for rendering. The wiki has some
documentation, such as how to show Gravatar
images for developers or how to change the
appearance. The wiki also explains how to produce a video and
shows some example videos and screenshots. In
January, Andrew showed
some of his visualizations at linux.conf.au.
In the last few months, several enthusiasts have been experimenting with Gource. For example, Michael DeHaan used Gource to create a visualization of Red Hat's provisioning server Cobbler and he explained that it can be really useful to evaluate an open source project:
When evaluating OSS software for use in business, you always need to know if the community is solid and self sustaining. [Gource] allows you to watch a short video and find out. Coupled with looking through the mailing list archives, that's a pretty good check. It can also help identify interesting patterns of large scale refactoring, new development, or stagnation.
Michael's visualization inspired Daniel Berrange to do the same exercise for libvirt:
It is clear from the video just how much development of libvirt has been expanding over the past 4 years, particularly with the expansion to cover VirtualBox and VMWare ESX server as hypervisor targets.
Daniel also produced a visualization of libvirt using code_swarm, which makes it easy to compare the merits of both methods.
A third, more research-oriented and more general graph visualization tool is Gephi, which is Java-based and is distributed under the GPL v3. Users of this tool call it "like PhotoShop for graphs", or should we say "like GIMP for graphs"? It's a very powerful and interactive tool for exploring, manipulating and visualizing graphs: users can manipulate the structure, shapes, colors, locations of the nodes, and so on, but they can also find the shortest path between nodes and compute graph metrics, find clusters, and conduct a lot of advanced graph analyses. Gephi was originally created in 2007 at WebAtlas, a French non-governmental organization involved in mapping the web and data mining, but is now developed by an international consortium of open source developers.
Gephi is not oriented specifically towards visualizing software
projects: the wiki page about data sets that can be
explored and visualized with Gephi gives examples such as the structure of internet, the topology of the Western States Power Grid of the United States, airlines, a network of disorders and disease genes linked by known disorder-gene associations, and a couple of social networks.
However, the social networks section shows some interesting data sets of
open source projects, which have all been visualized in Gephi
by Franck Cuny, a Perl hacker working at the French social media agency Linkfluence. He developed the CPAN Explorer web site, an interactive visualization to analyze relationships between developers and packages of CPAN (Comprehensive Perl Archive Network). The authors page shows relationships between developers inside CPAN: each developer is represented by a node with a size proportional to the number of modules the developer has released on CPAN. An edge between two developers is created when one developer uses a module from the other developer. The more uses of a developer's modules by other developers, the bigger the label.
One can deduce some interesting facts about developers from this graph: for example, Adam Kennedy gets a big node with a small label, because he has released a lot of modules on CPAN, but few of them are used by other CPAN developers. In contrast, Gisle Aas has a small node with a big label, because he has few modules, but some very popular ones like LWP and URI.
Last month, Franck introduced a new project, Github Explorer. He explained that this was a very natural choice:
I wanted to do something similar again, but not with the same data. So I took a look at what could be a good subject. One of the things that we saw from the map of the websites is the importance github is gaining inside the Perl community. Github provides a really good API, so I started to play with it.
This time, Franck didn't aim for the Perl community only, but the whole community of users of Github, a popular web-based hosting service for projects that use the Git revision control system. He warns that Github doesn't represent the whole open source community and that he has collected only a selection of all user profiles, but nevertheless it gives us a good picture. Each profile is represented by a node, and a link between two profiles is represented by an edge. The weight of the edge is incremented each time the person forks code from the target profile.
On his blog, Franck shows some Github visualizations he made with Gephi, colored by country and split according to the programming language. He has some thought-provoking analyses about some of the languages. For example, the Perl community is clearly split between the 'west' and Japan. In the Python community, there is clearly one main project, the web framework Django. PHP is the only community on Github where the visualization shows clusters of people working together on a specific project. The Ruby graph looks like a big ball of yarn with a couple of isolated countries. In other visualizations, Franck split the graphs according to their country. He offers the data he gathered for use in Gephi, he has published all the graphs on Flickr, and he will offer a printed version on posters of size A2 and A1 for sale soon.
Towards a better understanding of open source communities
Many of these visualizations are beautiful, but that's not the point: to
paraphrase Richard Hamming's dictum "The goal of computation is not
numbers, but insight.", your author would say "The goal of
visualization is not beauty, but understanding." and visualization
tools can help understand the internals and the dynamics of open source
projects and communities. While code_swarm and Gource can show users a lot
about patterns and evolutions in the development of a specific project,
including how the developer community works together, CPAN Explorer and
Github Explorer are about visualizing global connections between a lot of
open source projects, which is also an important factor in open source
communities. Now we just have to wait for some creative minds to visualize the SourceForge or Launchpad communities.
to post comments)