LWN.net Logo

Google's project hosting service

August 9, 2006

This article was contributed by Stacey Quandt

Google used the recent O'Reilly Open Source Convention (OSCON) to announce that it is launching a project hosting service. The two primary features of the this service are Subversion hosting, and a brand new take on managing  bug reports.

Google has seven Subversion developers on staff who are building a new storage back-end for Subversion to store data in a "Bigtable." A Bigtable is a system for storing and managing very large amounts of structured data. The system is designed to manage several petabytes of data distributed across thousands of machines, with very high update and read request rates coming from thousands of simultaneous clients. This architecture allows Google to scale Subversion up to the meet the demands of storage and concurrency it believes will be needed to serve its members. According to Google's Greg Stein, “The existing two back-ends for Subversion (Berkeley DB and flat files) just do not have the capability to scale to our needs. The Bigtable system also gives us things like failover, monitoring, and performance tuning capabilities that are not present in the standard Subversion back-ends.” More  information on Google's version of Subversion can be found on the FAQ.  

When asked if Google intends to contribute its Bigtable code back to Subversion, Greg Stein responds: “We're certainly not opposed to the concept, but the devil is in the details.” The issue is that the code that interacts directly with Bigtable cannot be contributed back to the Subversion project since Google has no plans to publish the source code to Bigtable at this time. Stein explains, “We have made a number of changes in the functional tests, and a couple higher level libraries that we are going to contribute back.” However, source code changes that are highly specific to Google's environment will not be contributed back to the Subversion project because as Stein says, “It would not make sense...[since]... those changes would needlessly pollute the code base with no measurable benefit for others.” In essence Stein isn't opposed to contributing source code back to the community and stresses that “We've got to figure out what the best line is that helps the public code base".

One potential solution is to publish a non-working copy of the back-end database simply to see if there is some interest in the open source community for reviewing Google's model. Stein says: “The lessons learned and control/data flow patterns might be helpful for other, future back-ends.” Since Google started work on a version of Subversion that could be integrated with Google's technology “We have been heads-down getting the service built and delivered to the public”, claims Stein. He further states “We have much more work that we want to do, but it may be time for a breather to review what we've done and figure out the best options to get some pieces published.”

Google's ability to contribute the source code for its issue tracker back to the open source community falls under constraints similar to those it faces with Subversion. Stein explains, “When you subtract the Bigtable code, the search technology, and a few of the other proprietary pieces, then there is actually very little left.” Stein asserts Google has talked about this right from the start. In the event that someone should want to replicate Google's issue tracker Stein, says, “We'd happily consult with that community about what we've done. There may be a couple pieces we can provide (under the Apache license).”

As for the architecture of the issue tracker, Google disregarded the idea of a heavily structured database and replaced it with a free-form system based on Google's search technology. Issues can be arbitrarily labeled to note version information, operating system, milestones, priority or other project specific information. Users can query across all of the descriptions, comments, and labels to find the relevant issues. Advanced search allows a user to search just the labels or just the status of an issue. On top of this new model for storing and querying issues, Google built an Ajax-based interface to make it very easy for users to interact with. Issues are listed in a standard list format but users can perform basic changes to the user interface including adjusting the columns and sorting.

Google has also made it simpler to submit a bug report. Stein says, “Today a user is typically faced with a crazy set of drop-downs and fields covering  everything from priority, to software components, to the target milestones.” Stein asks the logical question: “How is the user supposed to know any of this? They just wanted to use that screaming mp3 server, and have no idea whether the affected component is Foo or Bar.” Google addresses this potential problem by only requiring the user to specify a summary and description. The user can also optionally attach files and an optional indication that they want updates as developers work on the bug report. Project developers can add, remove, or alter labels, assign owners, change the status to an existing bug report, and, when they are creating a new issue to be tracked in Google's issue tracker, they can add these labels as part of creating the bug report.

Stein claims, “Most open source groups don't require the heavy structure or workflow that is present in today's issue trackers.” Still Stein concedes that there are some large groups that do need these features, but they are typically in the minority. By focusing on the majority's needs, Google's take on bug reports could turn out to be beneficial for the open source community.

Google's Project Hosting enters a crowded space with alternative services from not only Sourceforge.net but also Savannah and Debian's Alioth, among others. This leads to the question of how easy is it to import a project, or to export it and move it somewhere else in the future. According to Stein, the answer is “Not very easy”. This is because at present there is no way to upload or download a Subversion dump file. Google engineers are working on both of these efforts. Stein says, “For upload, we'll maybe do something in combination with a file upload/download feature or rely on the revision of Subversion 1.4's sync/reply feature when it is released and after we upgrade the servers.”

Download is a different story. Google plans to make the dump file available to project owners so they can always access their complete information. Stein states, “We know how important it is to open source groups to know that they are not locked into a hosting service.” Google does not support the data export capability today but it does plan on allowing for the export of all information. The import and export functionality is not defined yet and Google plans to investigate using some simple APIs for this. Stein voices some concern about this approach and says: "I have a natural wariness with APIs. If you get them wrong then you can paint yourself into a corner.”

A question on some peoples' minds is: will Google project hosting offer the same services as Sourceforge? Google project hosting is similar to Sourceforge in its goal to encourage open source projects and foster productive open source communities. Aside from architectural considerations, another difference between the two services is the new Google service will not include Web site hosting and will initially target smaller projects. Since Google has no plans to make it easy to move a project from other hosting sites it appears that Sourceforge.net does not have to worry about losing its share of current users.

Stein stresses: “Sourceforge is one the major cornerstones of the open source community, and we have zero interest in damaging that foundation.” It is clear that, while Stein recognizes that people may develop tools on their own, especially once the Google project hosting  system has a better import system, but he says, “We have no plans to be an instigator for that.” If you try to create a project at Google Code using a name of a Sourceforge project then Google will stop the process and note the conflict. An email will be be sent to the owner of the Sourceforge project requesting approval  (or denying the project creation). Google wants to prevent malicious impersonation or accidental name conflicts and worked with Sourceforge to get a list of all hosted projects and email addresses of the owners. Google is also working with other hosting sites such as tigris.org, java.net and Codehaus to avoid naming conflicts.

Google has set initial storage limits at 100 MB for Subversion, and 50 MB for issue attachments. Stein says, “These limits will be more than enough for for open source projects, but we can individually adjust them for valid projects.” The limits are designed to prevent spam or abusive projects from inappropriately using Google's services to host content which is unrelated to free software projects or not freely redistributable.

The first step in getting started is creating a Gmail account, which is required for project owners and members. Owners have the ability to reconfigure projects, add/remove other owners and members, and to manage basic metadata about the project. Members can commit to the repository, and can change metadata on bug reports. To file a bug report or issue a comment on one, a user only needs a Google account with a verified email address. A Google account can be associated with any email address; a Gmail account is not required for this purpose. A valid email address is required so that the project members can get in touch with the person filing the bug report or in the event that further clarification is required.

Google requires a Gmail account for project owners and members in an attempt to obtain a higher certainty that they are not bots that could use the project space for spam or other malicious purposes. Also the fact that all owners and members use a Gmail account may also help Google in future integration efforts.  

It is clear that Google wants to participate in the free software development process and provide a viable alternative to other open source project repositories. Less clear is whether Google hosting is merely a goodwill exercise with the open source community or whether its goal is to be a profit-making venture, either via advertising revenue or by encouraging more Gmail usage. Regardless, Google's new offering will no doubt be a useful service to open source developers and a challenge for other hosting sites to improve the services offered to their users. As we all know, competition is a good thing.


(Log in to post comments)

Google's project hosting service

Posted Aug 10, 2006 1:48 UTC (Thu) by ewan (subscriber, #5533) [Link]

...source code changes that are highly specific to Google's environment will not be contributed back to the Subversion project because as Stein says, "It would not make sense...[since]... those changes would needlessly pollute the code base with no measurable benefit for others.

This is no reason to keep the code secret, Google could publish it all, and leave it to the subversion developers to include it in main line or not. They don't have to take it if it's not useful, but they (or someone else) could use some or all of it. This approach takes those choices away.

This says either "We don't trust you to make sensible decisions." or "This isn't our real reason, but we're going to fob you off with it."

Google's project hosting service

Posted Aug 10, 2006 12:07 UTC (Thu) by hummassa (subscriber, #307) [Link]

<AOL>I agree</AOL>
But we must take into account _why_ Ggl wants to keep Bigtable
proprietary... :-(

Google's project hosting service

Posted Aug 14, 2006 23:21 UTC (Mon) by gstein (guest, #3612) [Link]

Note that we have seven Subversion developers on staff (nine in a few weeks). We have a very good idea of what is interesting for the public codebase, and what isn't :-) It isn't that we don't trust the other folks to make the right decision, it is that we've been working with them for many years and know they won't want to commit useless code into the tree.

Seriously, when you strip out the pieces that connect into Bigtable, there isn't anything all that useful left. There wouldn't be anything to put into the main line. It would just be some code sitting there as a non-functional adjunct to the FSFS and BDB backends.

And all that said... look at the article again -- first sentence, fourth paragraph. I'd like to see if we can strip the sensitive bits and publish the rest. It may still be interesting to somebody out there. For example, by taking a different look at the atomicity requirements (as we had to for Bigtable), there are some new things that could be done. The FSFS backend basically took that approach, greatly simplifying the code relative to the BDB backend. So while we can't provide a new backend to the community, somebody might be able to build a new one based on the concepts in our Bigtable-based one (or at least, from the pieces we'd be able to publish).

Bug reporting

Posted Aug 10, 2006 9:12 UTC (Thu) by shane (subscriber, #3335) [Link]

Google has also made is simpler to submit a bug report. Stein says, "Today a user is typically faced with a crazy set of drop-downs and fields covering everything from priority, to software components, to the target milestones." Stein asks the logical question: "How is the user supposed to know any of this? They just wanted to use that screaming mp3 server, and have no idea whether the affected component is Foo or Bar."

Now that sounds like an improvement! In my last company, I rejected exposing our Bugzilla to the public for this very reason. Instead, we just accepted e-mail with bug reports and our engineers created the appropriate Bugzilla entries.

I know I have avoided submitting bug reports for simple things because of the pain of having to figure out the details of the report. Sure, it's a "bad thing", but I'm lazy and have other things to do with my life other than figure out whether "priority" or "severity" means it's a minor issue. :)

buzilla

Posted Aug 17, 2006 18:27 UTC (Thu) by j1m+5n0w (guest, #20285) [Link]

Nothing says "go away" quite like bugzilla. Let's look at the firefox bugzilla for a wonderful example of how to prevent users from filing bugs, shall we?
There are four crucial steps that you should follow before filing your first bug report:

1. Use the latest nightly build of Firefox with a new profile

Translation: We don't care about bugs in the version of firefox people are actually using. We also don't care about new versions trashing an old profile.
2. Check if you can reproduce the bug in Mozilla (Application Suite)
Translation: After we just asked you to install one extra web browser aside from the one you actually use, we're going to ask you to install another. (Notice we were nice enough not to ask you to do a binary search to find the first version wherein your bug first appears, which would have required log(N) browser installations.)
3. Check if the bug is already filed
Translation: It's your job as a user to be able to navigate our database of thousands of bugs and find the one just like yours, if it exists.
4. Finally, read the official Bug Writing Guidelines
This step at least seems reasonable.

Oh, and by the way, we'll ask for your email address and then post it publicly for all to see.

Perhaps I'm being cynical, but last time I filed a firefox bug, bugzilla seemed rather more like a challenge to be overcome than a hospitable friend welcoming me into the firefox developer community.

I can understand that the Firefox developers don't wan't to read bug reports like "MY INTERNETS ARE BROKE PLZ FIX", and good reports make bug fixing easier. However, if users are deterred by the process, the developers might not ever find out when they break things.

I think that one of the biggest problems for the open source community right now is poor communication between developers and users (and many users are probably too polite to complain about broken things that they didn't pay for). Open source developers should be actively encouraging their users to complain whenever they see something about the software they don't like, no matter how frivolous.

It looks like Google is doing a good job of making user-developer communication easy, by keeping the interface simple. Requiring a gmail account to post a comment seems a bit onerous, but I guess I can't really blame them for that.

bugzilla

Posted Aug 17, 2006 22:39 UTC (Thu) by gstein (guest, #3612) [Link]

Thanks, but one clarification: you only need a Google Account to file a bug or post a comment. A Google Account can be associated with any email address.

To be an Owner or Member a project, however, you'll need a Gmail account.

bugzilla

Posted Aug 18, 2006 6:55 UTC (Fri) by j1m+5n0w (guest, #20285) [Link]

Thanks for the clarification, I didn't realize those were separate things.

Google's project hosting service

Posted Aug 10, 2006 18:53 UTC (Thu) by mmarsh (subscriber, #17029) [Link]

*sigh*

Another bug-reporting system that requires casual users to have an account. While I already have a Google account, if I needed to get one to report a bug or request a feature, I wouldn't. That's one reason why I've never filed a bug report against Firefox or Thunderbird. Telling users that they have to go through the effort of getting another account and remembering another password to tell you that your program has a bug is arrogant. To me, it's equivalent to saying, "We don't want feedback from our users." I'd much rather see a challenge-response system, if a project really has an issue with spam-by-bug-report.

why google does it and the infinite accounts issue

Posted Aug 10, 2006 19:27 UTC (Thu) by berntsen (guest, #4650) [Link]

First, I believe I know why google makes all these efforts that require you to have google or gmail account: personalised data.

If you decide to use one of their many nice services, you will have to get an account which will most likely create you a cookie that will follow you for all their services, including searching. Now they will know a lot about you and can make better services for you, because they know what you want (at least they ought to be able to make an educated guess). Knowing a lot about you will make it more difficult for newcommers to compete, since they do not have the amount of data making up the difference when attempting to produce _personally relevant_ services.

I came to this conclusion reading the oreilynet article on web 2.0: http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/...

I may be wrong about what they do technically with cookies (as I haven't bothered to investigate), but I still believe very much that their drive is to gather data about users to let them make better services (and this way make more nice money :)

Second, as most of us, I am tired of the infinite number of accounts you are required to make, in order to 'participate' on sites on the web. I stumbled upon the proposal, openid, http://www.openid.net/ while reading blogs on planetrdf, and it seems a neat solution to me. Someone has voiced security concerns (of course), but the idea _is_ very neat I think.

Happy computing,
/\/ikolaj (blog.efef.dk)

why google does it and the infinite accounts issue

Posted Aug 14, 2006 23:28 UTC (Mon) by gstein (guest, #3612) [Link]

Euh... no. We have very specific rules against that. For example: we were accidentally logging some user IDs in our Subversion logs, and had to go fix some code to eliminate that. (yes, we have a team that specifically reviews what gets logged, and ensures that information that might identify a person is *NOT* logged).

We require a person to sign in with a "verified" Google Account. That means we know the person (at some point in time) could receive email there. That ensures that the Owners/Members of the project can contact the person filing the bug report.

Spam bug reports and spam comments are a serious pain. And even more so for a site in the "google.com" domain. Over time, might we find other ways to fix this? Possibly. But we have so many other interesting ideas and features to bring to users and developers, that tackling the problem is very low on our list.

why google does it and the infinite accounts issue

Posted Aug 15, 2006 9:09 UTC (Tue) by berntsen (guest, #4650) [Link]

I assume you are affiliated with google, and if so I appreciate your response.

But I think you misunderstand me a bit. When I say personalised I do not mean that you find out which person is behind a user id (a cookie) and provide better services because you know the 'person' (by looking at interests on the persons home page, e.g).

What I mean is, e.g., that if a person searches a lot for java and always follows the coffee links rather than the programming language links, you will alter the rating for the cookie such that coffee related java links will appear on top. Likewise you will present him coffee related ads rather than programmning related ads.

Of course I can think paranoid thoughts (in these days where conspiracy theories bloom in the cinemas) that Google has research teams writing algorithms to automatically find the person behind any cookie, and that NSA or whatever has full access to your logs and uses it for whatever, ;-) But that was not my point at all.

Happy happy,
/\/ikolaj

gmail account needed

Posted Aug 10, 2006 22:19 UTC (Thu) by joey (subscriber, #328) [Link]

The bug tracker ideas sound interesting, but the whole business of needing a gmail account to work on a free software project hosted there is somewhere between annoying and scary.

We've already seen gmail account offer spam issues drown out useful conversation on some mailing lists. (Well, I haven't in a while, since I have a mailfilter to drop any mail that looks like one in /dev/null.) Heaven help a project where all the developers need to obtain a gmail account before doing work, and that spam becomes on-topic and required!

gmail account needed

Posted Aug 11, 2006 23:33 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

I don't follow you. What is a "gmail account offer spam issue" and how would having a gmail account make spam on-topic and required?

One thing that isn't at all clear from the article is why Google requires the gmail account. The article alludes to another kind of account called a Google account (which one can use to submit a bug report), and I guess that's not sufficient for Google's purposes for project membership.

gmail account needed

Posted Aug 14, 2006 23:33 UTC (Mon) by gstein (guest, #3612) [Link]

A Google Account can be associated with any email address. We also require that it is "verified", meaning that we send the address an email, and the person needs to click on a URL in there to say "yup. received it." A plain old Google Account is needed for entering issues or commenting on them; this is to ensure that the Owners/Members of a project can contact that person later.

To be a project Owner or Member, we strengthen it to say a Gmail account. As a start, it means that I can be referred to as "gstein" rather than a full email address. Real Soon, we'll be providing alternate handles so that I don't have to expose my gmail inbox. You can see these alternates in use on Picasa Web Albums.

In the future, we can also use the Gmail accounts to tie in with other Google services to enhance the entire project hosting service.

gmail account needed

Posted Sep 1, 2006 6:59 UTC (Fri) by bjornen (guest, #38874) [Link]

> What is a "gmail account offer spam issue" and how would having a gmail account make spam on-topic and required?

You can only get a Gmail account if you're invited by someone who already has one. Thus all the "gmail account offer spam".

Subversion at Google

Posted Aug 14, 2006 9:29 UTC (Mon) by dion (subscriber, #2764) [Link]

Personally I think the more insteresting aspect of this is that Google has a lot of SVN developers on the payroll and that they are porting it to bigtable.

Could this mean that Google might be planning to move their internal source from Perforce to Subversion?

If they do move to svn then it would certainly be a great win for the svn project and I'm willing to bet that tons of features and fixes would flow from their use.

Subversion at Google

Posted Aug 14, 2006 23:37 UTC (Mon) by gstein (guest, #3612) [Link]

Daniel Berlin has been working on the much-sought-after "merge tracking" feature for Subversion. So we've already made some very direct contributions. There are a couple other features that our guys are working on for Subversion 1.5. The ra_serf option in 1.4 was built by one of our interns (Justin Erenkrantz).

We need to figure out what portions of our Subversion/Bigtable implementation to move the public tree. All of us here are aware of the benefits to do that... it's just a matter of getting to it.

Google's project hosting service

Posted Aug 14, 2006 19:40 UTC (Mon) by Tet (subscriber, #5433) [Link]

The strange decision here is the use of Subversion. I mean, I can understand it being an easy stepping stone up from CVS. But Subversion is anything but state of the art where revision control is concerned, and the decision to use it for a new project hosting system seems backwards to say the least. One of the distributed systems (darcs, bzr, monotone, mercurial, etc.) would have been a much better choice. There is an almost unlimited range to choose from, in various states ranging from experimental to stable and ready for production use. But surely one of them could have met the requirements that Google have, and wouldn't have hit the scalability problems that they've mentioned with svn.

Google's project hosting service

Posted Aug 14, 2006 23:52 UTC (Mon) by gstein (guest, #3612) [Link]

We have seven people on staff that are committers on the Subversion project, growing to nine in a few weeks. They are some of the foremost experts in the world on Subversion. That is more people than some of those alternatives' entire development team. With that kind of experience, why would we NOT use it?

Second, Subversion's user model and design fit very well with Google's model of providing (HTTP-based) services. We knew it would fit, and we had the depth of experience to make it work very well.

And for scalability: no version control system on the planet can meet the scaling needs that we want to target. Even in a distributed case, people still need to fetch a master copy from one central project repository to begin their work. We have to be able to support that load, and nothing out there can do it. That's why we built a new backend for Subversion. We knew that it could, given an upgraded storage system.

"State of the art" is debatable. There are lots of interesting alternatives [to centralized version control] being developed today, but I do not believe any of them have the maturity, robustness, tool support, documentation, history of deployment, and more that Subversion has. None of them. In a few years? Sure, I hope that the fragmentation will be reduced and there will be a clear and solid winner in the distributed-repository camp which answers "yes" to all of those points.

Regardless: the argument about whether centralized versus distributed repositories is not "obvious". One is not better than the other. Both models are valid for various scenarios. We chose one model based on our experience and what we thought would make the most sense for the largest group. If that does not suit some projects' needs, then they can simply use another service. We want to do the best job possible, and that means some compromises must be made. And that means we aren't going to have everybody's favorite tool.

Google's project hosting service

Posted Sep 1, 2006 7:08 UTC (Fri) by bjornen (guest, #38874) [Link]

(Thanks for talking to us Greg) gstein wrote:
> We have seven people on staff that are committers on the Subversion project, growing to nine in a few weeks. They are some of the foremost experts in the world on Subversion. That is more people than some of those alternatives' entire development team. With that kind of experience, why would we NOT use it?

I think we all assumed that Google hired seven Subversion developers *because* they wanted to use Subversion on their new service. It's actually a coincidence?

"all those nassssty fields"

Posted Aug 21, 2006 0:59 UTC (Mon) by Baylink (subscriber, #755) [Link]

> As for the architecture of the issue tracker, Google disregarded the idea of a heavily structured database and replaced it with a free-form system based on Google's search technology.

Oh, for *phuque's* sake.

I'm sorry people, but those fields are there *so us poor slobs who have to fix things* know what we're dealing with. They're there to *force* the reporter to provide enough information to make it possible to *actually fix their problem*: if you don't fill them in, then we're just going to have to ask you all those questions anyway.

As someone who designs this sort of thing (and then has end users dumb it down to the point of uselessness), this sort of viewpoint just drives me right straight up a tree. People applauding it gives me hives.

Sure, make it easier to search. But don't take out the information that will make *finding* anything possible.

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds