Visualizing open source projects and communities
Visualization is a critical tool for exploring and understanding large amounts of data. Thanks to the computer power of the 21st century it has become possible to visualize ever-expanding amounts of data. Because the open source development model is massively decentralized and network-centric, it is by its nature the perfect domain for graph-based visualizations. Connections or dependencies between projects, communities, and code commits can be explored and displayed in a lot of ways. These visualizations can give us a unique perspective on open source projects and communities, such as fundamental differences in their approach.
Your author has a longstanding interest in visualizations, especially of non-numerical information. The classic books about visualizing complex data are of course Edward Tufte's works, beginning with The Visual Display of Quantitative Information. Recently, your author has enjoyed reading more programming-oriented books like Ben Fry's Visualizing Data and Toby Segaran's and Jeff Hammerbacher's Beautiful Data. What's even better is that a lot of open source software exists to put this theory into practice. We'll look at a few of the most interesting open source visualization programs and their application to open source projects and communities.
Code_swarm
![[Apache code_swarm]](https://static.lwn.net/images/2010/code_swarm-apache-sm.png)
Michael Ogawa, a Ph.D. student in the Visualization & Interface Design Innovation group of UC Davis, conducted some interesting research about software visualization. The purpose of this research is to help understand the relationship of the communication between developers and the evolution of the source code. In 2007, Michael published a paper about Visualizing Social Interaction in Open Source Software Projects [PDF]. In 2008, he presented StarGate [PDF], a system that grouped developers of a software project visually into clusters corresponding to the areas of the file repository they work on the most. In both visualization methods, he used Apache and PostgreSQL as case studies. Interested readers should consult the papers for some illustrative insights in these projects.
Michael's most popular visualization method is code_swarm, which shows the history of commits in an open source project as a video. Both developers and files are represented as moving elements. When a developer commits a file, it lights up and flies towards that developer. Files are colored according to their purpose, such as whether they are source code or a document. If files or developers have not been active for a while, they fade away. The design of code_swarm is explained in the paper code_swarm: A Design Study in Organic Software Visualization [PDF], which shows some case studies of Python, Eclipse, Apache, and PostgreSQL. Videos generated by code_swarm for these projects are also available on the web site.
The code for code_swarm, written in Ben Fry's Java-based open source programming environment Processing, is available under the GPL v3. It supports various types of repositories: Subversion, CVS, Git, Mercurial, Perforce, VSS, Starteam, and Darcs. The wikiswarm add-on even allows visualizing Wikipedia page histories and user contributions. By downloading and executing the code, everyone can create their own software visualizations, and there's a mailing list for help. The project's wiki also has some documentation, such as a step by step guide of how to generate a video, a FAQ, and a gallery of third-party code_swarm videos.
Gource
![[Python Gource]](https://static.lwn.net/images/2010/gource-python-sm.png)
At the end of 2009, New Zealand software developer Andrew Caudwell presented his software visualization project Gource (a play on Source and Gorse) on his computer graphics blog The Alpha Blenders. Gource takes the logs from a version control system of a software project and displays them as an animated tree with the root directory of the project at its center. Directories appear as branches with files as "leaves", represented by spheres that are colored dependent on their file extension. Developers currently contributing to the project can be seen floating near the files they are modifying. The whole visualization looks organic and is interactive, as the user can rotate the view and move the camera position.
The code for Gource is available under the GPL v3. It's designed for use with Git, Mercurial, or Bazaar, but it has also scripts to support CVS and Subversion. It needs a 3D accelerated video card and uses OpenGL for rendering. The wiki has some documentation, such as how to show Gravatar images for developers or how to change the appearance. The wiki also explains how to produce a video and shows some example videos and screenshots. In January, Andrew showed some of his visualizations at linux.conf.au.
In the last few months, several enthusiasts have been experimenting with Gource. For example, Michael DeHaan used Gource to create a visualization of Red Hat's provisioning server Cobbler and he explained that it can be really useful to evaluate an open source project:
Michael's visualization inspired Daniel Berrange to do the same exercise for libvirt:
Daniel also produced a visualization of libvirt using code_swarm, which makes it easy to compare the merits of both methods.
Gephi
![[CPAN Explorer]](https://static.lwn.net/images/2010/cpan-explorer-authors-sm.png)
A third, more research-oriented and more general graph visualization tool is Gephi, which is Java-based and is distributed under the GPL v3. Users of this tool call it "like PhotoShop for graphs
", or should we say "like GIMP for graphs
"? It's a very powerful and interactive tool for exploring, manipulating and visualizing graphs: users can manipulate the structure, shapes, colors, locations of the nodes, and so on, but they can also find the shortest path between nodes and compute graph metrics, find clusters, and conduct a lot of advanced graph analyses. Gephi was originally created in 2007 at WebAtlas, a French non-governmental organization involved in mapping the web and data mining, but is now developed by an international consortium of open source developers.
Gephi is not oriented specifically towards visualizing software projects: the wiki page about data sets that can be explored and visualized with Gephi gives examples such as the structure of internet, the topology of the Western States Power Grid of the United States, airlines, a network of disorders and disease genes linked by known disorder-gene associations, and a couple of social networks.
However, the social networks section shows some interesting data sets of open source projects, which have all been visualized in Gephi by Franck Cuny, a Perl hacker working at the French social media agency Linkfluence. He developed the CPAN Explorer web site, an interactive visualization to analyze relationships between developers and packages of CPAN (Comprehensive Perl Archive Network). The authors page shows relationships between developers inside CPAN: each developer is represented by a node with a size proportional to the number of modules the developer has released on CPAN. An edge between two developers is created when one developer uses a module from the other developer. The more uses of a developer's modules by other developers, the bigger the label.
One can deduce some interesting facts about developers from this graph: for example, Adam Kennedy gets a big node with a small label, because he has released a lot of modules on CPAN, but few of them are used by other CPAN developers. In contrast, Gisle Aas has a small node with a big label, because he has few modules, but some very popular ones like LWP and URI.
The community page shows web sites that write about Perl. Each website is represented by a node with a size dependent on the number of inbound links. A hyperlink between two websites is represented by an edge. The official Perl community web sites get a purple node, bloggers a green node, open source web sites a red node, companies a black node, and CPAN author pages a blue node. The last CPAN visualization on CPAN Explorer is a visualization of dependencies between CPAN module distributions. The CPAN Explorer developers offer static PDF and SVG version of all these graphs and dynamic JavaScript visualizations, but also the original data sets in gexf file format to explore in Gephi.
Last month, Franck introduced a new project, Github Explorer. He explained that this was a very natural choice:
This time, Franck didn't aim for the Perl community only, but the whole community of users of Github, a popular web-based hosting service for projects that use the Git revision control system. He warns that Github doesn't represent the whole open source community and that he has collected only a selection of all user profiles, but nevertheless it gives us a good picture. Each profile is represented by a node, and a link between two profiles is represented by an edge. The weight of the edge is incremented each time the person forks code from the target profile.
![[Github Explorer PHP]](https://static.lwn.net/images/2010/github-explorer-php-sm.jpg)
On his blog, Franck shows some Github visualizations he made with Gephi, colored by country and split according to the programming language. He has some thought-provoking analyses about some of the languages. For example, the Perl community is clearly split between the 'west' and Japan. In the Python community, there is clearly one main project, the web framework Django. PHP is the only community on Github where the visualization shows clusters of people working together on a specific project. The Ruby graph looks like a big ball of yarn with a couple of isolated countries. In other visualizations, Franck split the graphs according to their country. He offers the data he gathered for use in Gephi, he has published all the graphs on Flickr, and he will offer a printed version on posters of size A2 and A1 for sale soon.
Towards a better understanding of open source communities
Many of these visualizations are beautiful, but that's not the point: to
paraphrase Richard Hamming's dictum "The goal of computation is not
numbers, but insight.
", your author would say "The goal of
visualization is not beauty, but understanding.
" and visualization
tools can help understand the internals and the dynamics of open source
projects and communities. While code_swarm and Gource can show users a lot
about patterns and evolutions in the development of a specific project,
including how the developer community works together, CPAN Explorer and
Github Explorer are about visualizing global connections between a lot of
open source projects, which is also an important factor in open source
communities. Now we just have to wait for some creative minds to visualize the SourceForge or Launchpad communities.
Index entries for this article | |
---|---|
GuestArticles | Vervloesem, Koen |
Posted Apr 14, 2010 10:40 UTC (Wed)
by error27 (subscriber, #8346)
[Link]
Visualizing open source projects and communities