The open source development model has many interesting properties so
it's not surprising it has also been applied in domains other than software. In his talk at FOSDEM (Free and Open Source Software Developers' European Meeting) 2012 in Brussels, Ryan Lane explained how the Wikimedia Foundation is treating their infrastructure as an open source project, which enables the community to help run the Wikimedia web sites, including the popular Wikipedia.
Ryan Lane is an Operations Engineer at the Wikimedia Foundation and the Project Lead of Wikimedia Labs, a project aimed at improving the involvement of volunteers in operations and software development for Wikimedia projects. These projects, like Wikipedia, Wikibooks, and Wikimedia Commons, are well-known because of their large community of volunteers contributing content. Moreover, MediaWiki, the wiki software originally developed for Wikipedia and now also used in many other wikis, is an open source project.
In the early days, Wikimedia volunteers had not only their say on content and software, but also on infrastructure. There was no staff doing operations as the server infrastructure was all managed by volunteers. However, in the meantime operations was professionalized, and now it's all done by staff. Ryan's message in his talk was: "We want to change this, because operations is currently a bottleneck: it doesn't scale as well as software. That's why we had the idea to re-open our infrastructure to volunteers." But how do you give volunteers access to an infrastructure?
Wikimedia has already shared a lot of knowledge about its infrastructure on wikitech. This public wiki describes their network and server infrastructure in detail, including the open source software they use, such as Ubuntu, Apache, Squid, PowerDNS, Memcached, MySQL, and the configuration management tool Puppet to maintain a consistent configuration for all their servers.
Ryan's approach to open up Wikimedia's infrastructure even more was twofold. First, Wikimedia's system administrators spent a few weeks to clean up Wikimedia's Puppet configuration. After stripping all private and sensitive information, they published the Puppet files in a public Git repository. The sensitive stuff was moved to a private repository that is only available to Wikimedia staff and volunteers with root access.
But Ryan wanted more than just sharing knowledge about how Wikimedia manages its servers (the information in wikitech and the public Puppet repository): he wanted to treat operations as a real open source project where community members could edit Wikimedia's server architecture just like they did with Wikimedia's content and software. So he had to build a self-sustaining operations community around Wikimedia. For this to happen without sacrificing the reliability of Wikimedia's servers, a group of volunteers created a clone of the production cluster, which is mostly set up now. Thanks to this, staff and community operations engineers can push their changes to a test branch of the Puppet repository to try out new things on the cloned cluster. After a code review of the changes by the staff operations engineers, the code is evaluated by a test suite. If the code passes the tests, the changes are pushed to the production branch of the Puppet repository and hence the production systems are managed by the new Puppet configuration.
Wikimedia Labs is using OpenStack as a private cloud to run their server instances (virtual machines). At the moment, there are 83 instances running in the test cluster, managed by various Puppet classes, including base (for the configuration that applies to every server instance), exim::simple-mail-sender (for every server that has to send email), nfs::server (for an NFS server), misc::apache2 (for a web server), and so on.
There are also 47 projects defined in the Wikimedia Labs project, each of them implementing a specific task such as adding a new feature, adding monitoring, or puppetizing infrastructure that has been set up manually in the past. For instance, there are projects for bots, the continuous integration tool Jenkins, Nginx, Search, Deployment-prep which implements the clone of the production infrastructure, and so on. Each project has a project page on the wiki with documentation, the group members, and other information.
The interesting thing about these project pages is that most of the
information is automatically generated. For example, when a server instance
is running for a project, this instance is automatically shown at the
bottom of the wiki page. And when someone types the command
!log <project> <message> on the
#wikimedia-labs IRC channel, it is automatically logged on the project page
under the heading "Server Admin Log", which are subdivided by day. That
way, a volunteer server administrator can explain what he did so other
volunteers who are maybe living in a different timezone on the other side
of the world can follow what is happening in the project.
The power of the community
So now that anyone has been able to push changes from ideas to
production on Wikimedia's cluster for a couple of months, what are the
results? According to Ryan, there are 105 users now in the Wikimedia Labs
project who have contributed a variety of Puppet configurations:
One volunteer puppetized our existing Nagios monitoring setup (which was not managed by Puppet) in a very neat way. The bot infrastructure has also been improved much by volunteers. And at the San Francisco hackathon in January 2012
we had a project created, implemented, tested, and deployed to production during the hackathon. We have a custom UDP logging module written for nginx, and it had a couple of bugs in the format. Abe Music built an instance, installed our nginx source package, added the change to fix the formatting, then pushed them up for review. We reviewed the change, then pushed it to production. All of this happened during the hackathon.
So has this ambitious experiment been successful? According to Ryan, the original goal to lessen the bottleneck of the operations team definitely succeeded. However, he points out that the bottleneck has shifted: "We have to do these code reviews now, but fortunately it takes less time to review code than it does to make a lot of changes." Another issue Ryan sees is trust: "Giving out root to volunteers is dangerous, so we have to audit our infrastructure often. Moreover, there's always the danger of social engineering: newcomers can try to build trust to have us give them sensitive information about our infrastructure." But luckily the staff can count on a core of community people whom they trust to do these code reviews and audits.
All in all, Ryan thinks that the same model as Wikimedia Labs uses can
also be used in other organizations to set up a volunteer-driven
infrastructure. In particular, non-profits or software development projects
that rely on a big infrastructure could profit from treating operations as
an open source project. In addition to being able to tap into the potential of technical talents in the community, opening operations is also a great way to identify skilled and passionate people to hire for a staff position.
to post comments)