LWN.net Logo

The Cr-48 and Chrome OS: Google's vision of the net

The Cr-48 and Chrome OS: Google's vision of the net

Posted Jan 18, 2011 1:57 UTC (Tue) by jhs (guest, #12429)
Parent article: The Cr-48 and Chrome OS: Google's vision of the net

Thank you for the detailed review.

"Data in the cloud" systems have security concerns of their own (it would be nice if a Chrome OS system could be backed up by providers other than Google, for example)

A Chrome OS system can be backed up by providers other than Google. The answer is to use services which treat data the same way the free software movement treats code.

That is exactly the vision of my company, CouchOne. CouchDB is a Free Software database and application server. Its native protocol is HTTP and its primary feature is peer-to-peer replication. CouchDB is the kitchen sync for the web—the filesystem of the Internet.

At the latest Ubuntu Developer Summit, my standard quip was this: CouchDB sucks at everything. Except sync. And incidentally, sync is the most important feature a developer cares about in the future.

I come from a heavy free software philosophy. When I interviewed at CouchOne, I was skeptical, thinking they simply hawk yet another another immature "big data" NoSQL server. But I realized they think about data freedom how I think about software freedom. Now I run our CouchDB hosting service.

That is why, unlike some free software leaders, I am excited about a more web-based, cloud-based software future—done correctly!


(Log in to post comments)

The Cr-48 and Chrome OS: Google's vision of the net

Posted Jan 18, 2011 2:12 UTC (Tue) by akumria (subscriber, #7773) [Link]

Thank you for the advertisement of your product.

One of your product claims, namely:

A Chrome OS system can be backed up by providers other than Google. The answer is to use services which treat data the same way the free software movement treats code.

Could you provide additional material to support this claim, preferably some kind of working implementation I can install on a Chrome OS system (be it virtual machine or hardware).

The Cr-48 and Chrome OS: Google's vision of the net

Posted Jan 18, 2011 2:56 UTC (Tue) by jhs (guest, #12429) [Link]

You're welcome! I tried to identify the free-software-like philosophy without sounding like a shill. (For example, I removed all external links.)

Data freedom is becoming a foremost concern of the free software movement leadership. I hope to promote emerging projects which are congruent with the GNU philosophy but concerning data freedom, even if I risk looking petty.

To answer your question, the Couch App ecosystem is comparatively tiny, with no flagship, general-purpose products. CouchApp development can be tedious and slow, attractive mostly to early-adopter developers, like GNU was. The exciting thing is the philosophy behind the software, like GNU was.

For example, the Kabul War Diary is a CouchApp based on the Afghanistan WikiLeaks data. It is a web application like any other: AJAX, data-driven, Google Maps UI, etc. However, you can replicate (anonymously, over HTTP) the entire free data set to your own server. One of the records in this database is the web app itself, which is free software and runs on CouchDB. Now you have a private copy—free software and free data.

I hope my enthusiasm persuades you that I am not advertising; I just happen to work in a company with a worthwhile core philosophy.

The Cr-48 and Chrome OS: Google's vision of the net

Posted Jan 18, 2011 12:42 UTC (Tue) by marduk (subscriber, #3831) [Link]

But you did'nt provide an example of how now one could back up their personal data hosted elsewhere (unless you consider Wikileaks data as "personal data" ;-)

The Cr-48 and Chrome OS: Google's vision of the net

Posted Jan 18, 2011 19:36 UTC (Tue) by jhs (guest, #12429) [Link]

With CouchDB, a database is at a URL, such as https://my.personal.oauth-protected.domain.com/photos. You can trigger push replication to any other URL, or pull replication from any URL (assuming you are authorized to do so). You can also specify a filter policy to replicate only a subset of the data as needed.

Therefore a web app which respects your freedom allows/encourages you to replicate your data to your own systems, in the same way a developer who respects your freedom allows/encourages you to take the source code and use it as you see fit. For example, you might pull all your data from https://awesome-app.com/your_username and keep it on your laptop's encrypted partition.

Where to replicate to/from, and what the filter policy does is application-specific. The replication plumbing is complete and useful; however, I concede that general-purpose applications are only now being undertaken. The point is, it's encouraging that there is free software which enables "free data" in the cloud-based future of applications.

The Cr-48 and Chrome OS: Google's vision of the net

Posted Jan 22, 2011 1:20 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

A Chrome OS system can be backed up by providers other than Google. The answer is to use services which treat data the same way the free software movement treats code.

What you're saying is that Chrome OS could hypothetically back up its data to providers other than Google if Chrome OS were based on applications that use CouchDB databases.

But the point from the article is about Chrome OS that is actually available. For mail, for example, it uses Gmail. CouchDB notwithstanding, the user's mail cannot be backed by someone other than Google.

The Cr-48 and Chrome OS: Google's vision of the net

Posted Jan 22, 2011 5:57 UTC (Sat) by jhs (guest, #12429) [Link]

You are correct, I should have said "could hypothetically." Or as an optimist, I might say, "will, one day soon."

And it's not simply using CouchDB. The application must also permit users to replicate. For example, a Couch app might be allowed for browser access but replication is blocked by a firewall. The developer must both use CouchDB *and* respect your freedom. Still, I think that day is coming.

The Cr-48 and Chrome OS: Google's vision of the net

Posted Jan 19, 2011 15:19 UTC (Wed) by oever (subscriber, #987) [Link]

The term "Couch App" for an browser application that stores data on a Couch server is nice. Is there a JavaScript API that lets these applications also work offline from the browser storage, that is synced with the Couch server once that becomes reachable again?

The Cr-48 and Chrome OS: Google's vision of the net

Posted Jan 19, 2011 15:52 UTC (Wed) by jhs (guest, #12429) [Link]

There have been projects to add Couch semantics to browser storage however I do not think any are production-ready. Progress usually halts when it's time to implement the replication protocol. It's not rocket surgery, however as I will say in another comment, the only "documentation" is the Erlang reference implementation.

However, the situation you describe is pretty much the primary objective of much of the CouchDB leadership, so I'm optimistic that this will happen. In the meantime, people simply run CouchDB on the local device or even as a browser plugin. Ubuntu does that, and CouchDB can be embedded in Android and other mobile apps. Optimizing for size is only beginning however they think they can make CouchDB quite small and painless.

The Cr-48 and Chrome OS: Google's vision of the net

Posted Jan 18, 2011 7:27 UTC (Tue) by butlerm (subscriber, #13312) [Link]

The answer is to use services which treat data the same way the free software movement treats code.

I am not sure that is much of a tagline. A handful of important exceptions notwithstanding, most people prefer their data to be kept considerably more private than a typical open source code base.

The Cr-48 and Chrome OS: Google's vision of the net

Posted Jan 18, 2011 11:45 UTC (Tue) by AndreE (subscriber, #60148) [Link]

I'm pretty sure he's talking about personal access and user rights.

A cloud service that respects the rights (freedoms) of users and allows them unfettered control and access is the most ideal

The Cr-48 and Chrome OS: Google's vision of the net

Posted Jan 19, 2011 4:23 UTC (Wed) by njs (guest, #40338) [Link]

Speaking of which, is there a nice technical write-up of CouchDB's synchronization algorithms anywhere? I've been thinking for years that the obvious way to do open web services is to use the tricks we discovered when inventing distributed VCS's to store our data. I keep meaning to design a mail storage format that supports efficient and well-behaved syncing as a proof of concept, but never have gotten around to it, so I'm curious what other people are doing...

In particular, the problem of merging many sorts of structured data (e.g., filesystem trees, contact books, mailboxes, etc.) in a provably correct way was pretty much solved[1], but since the solutions never got picked up by the mainstream DVCS's, I don't know if anyone actually *knows* about this, and I'm very curious how the algorithms that people are actually using in practice compare.

[1] See: http://www.monotone.ca/docs/Mark_002dMerge.html, http://thread.gmane.org/gmane.comp.version-control.monoto... for the core algorithm, and the extension to complex data structures is, uh, in my head and the monotone code base.

The Cr-48 and Chrome OS: Google's vision of the net

Posted Jan 22, 2011 6:02 UTC (Sat) by jhs (guest, #12429) [Link]

Unfortunately, i do not think there is a writeup. There are two reference implementations, but I believe that is all.

You inspired me to write up at least all the aspects of it which I know. Please excuse the cross-post, but I thought the greater development community would benefit so I placed it on Stack Overflow.

The CouchDB replication protocol

The Cr-48 and Chrome OS: Google's vision of the net

Posted Jan 22, 2011 18:57 UTC (Sat) by njs (guest, #40338) [Link]

Thanks for taking the effort to write it up!

...I'm really struck by the idea that it doesn't allow "merge" nodes in its history graph. That seems to make correct merging/replication quite hopeless in common cases? Say you have a data store A, containing revision A1, which you then replicate to create data store B. Then both are modified, so we have:
A contains: A1 -> A2
B contains: A1 -> B2

Now we do a pull from B to A, and "merge". IIUC this creates:
A contains: A1 -> A2 -> A3, A1 -> B2 (deleted)
where A3 contains the modifications made in B2, but this isn't recorded anywhere in the data model.

Then B gets modified some more:
B contains A1 -> B2 -> B3

And now we pull from B to A again. As far as A is concerned, B3 is the child of a deleted revision. What do we do with it? Throw it out? Resurrect the divergence, and eventually merge B3 and A3 "from scratch", ignoring their common history (B2)?

I suppose I should take this to some mailing list...

The Cr-48 and Chrome OS: Google's vision of the net

Posted Jan 23, 2011 1:04 UTC (Sun) by jhs (guest, #12429) [Link]

I am investigating the answer to this. Hopefully I can reply with a good answer.

At this time, it seems more likely that I have an error in my understanding rather than Couch has a glaring architectural misfeature that nobody noticed all these years.

The Cr-48 and Chrome OS: Google's vision of the net

Posted Jan 23, 2011 1:40 UTC (Sun) by jhs (guest, #12429) [Link]

Okay I think I've got it. Two points:

1. Revision histories which are "deleted" are simply marked as such. They are not removed from the history graph.
2. Marking a revision "deleted" is itself a revision update. Maybe a way to think of it is a Git annotated tag.

Conceptually, a delete could look like this:

A1 -> B2 -> B2_Deleted

So yes, if you pull again and you see B3, no problem. Just put it on the revision tree. B3 and B2_Deleted are siblings with parent B2.

You do merge A3 and B3 "from scratch." This is the difference with Git. In Git, the history is the whole point. But with arbitrary records and application-specific merge semantics, it's not as meaningful that B3 and A3 have a comment parent. The application is still probably more interested in the difference between the *data* rather than the ancestral history. (Not to mention, CouchDB is difficult enough without HTML/Javascript programmers having to negotiate dependency trees in their code.)

For example, what if B3 stores "delete_account: true" which means in this application that the user wishes to completely remove his account and delete all of his data. Clearly that fact is dominant in the merge/conflict-resolution strategy and it hardly matters what the history graph looks like.

Finally, the history graph is always there for the client (replicator). There is nothing stopping an application-level `git-rerere` implementation--perhaps as a library which all developers could use.

(Note, I'm not saying explicit merge info isn't useful. For all I know it was overlooked in the original architecture. I'm just saying it's not a showstopper.)

The Cr-48 and Chrome OS: Google's vision of the net

Posted Jan 23, 2011 2:24 UTC (Sun) by njs (guest, #40338) [Link]

> So yes, if you pull again and you see B3, no problem. Just put it on the revision tree. B3 and B2_Deleted are siblings with parent B2.

Sure, that makes sense.

> The application is still probably more interested in the difference between the *data* rather than the ancestral history. [...] For example, what if B3 stores "delete_account: true" which means in this application that the user wishes to completely remove his account and delete all of his data. Clearly that fact is dominant in the merge/conflict-resolution strategy and it hardly matters what the history graph looks like.

Hrm, so what I'm basically hearing is that CouchDB's data/synchronization model in practice is:
-- You have a bunch of records which must be independent (there's no way that synchronization can respect any kind of referential integrity)
-- When synchronizing, CouchDB will automatically identify the "latest" version of any given record (via "fast-forward merge"); if anything more complicated happens, then the app is on its own. And if the app *wanted* to use a proper merge algorithm to, say, notice that the user added a phone number to this contact on their phone and also added a user picture on their computer and those edits can easily be combined -- then it's sort of doomed, because the needed history information just isn't recorded; it's expected that apps will mostly do the equivalent of two-way merge, or that divergence will be rare enough that even a crippled three-way-merge will be good enough.

Does that sound right?

My wild guess is that in practice this architecture works fine for data that's inherently loosely coupled, and where edits are rare relative to the size of the data store and the synchronization frequency. This probably covers all the main data people are replicating these days -- bookmarks, contact lists, mail with their phone -- but I'm not sure how far you can stretch it. And it's possible to do *much* better, without adding much -- if any -- complexity.

Monotone, for instance, implements a distributed data store with complex structure (a tree of files/directories, each of which can have an arbitrary set of attributes), referential integrity constraints (can't have two files with the same names, no directory loops, etc.), and a very cheap, fully history-sensitive automated merger with a rigorous mathematical foundation ("mark merge").

> There is nothing stopping an application-level `git-rerere` implementation--perhaps as a library which all developers could use.

Sure, but git-rerere is a total hack whose purpose is to prevent Linus from seeing merge nodes. I don't imagine your end-users really have any aesthetic preferences about the shape of the graph buried inside CouchDB :-). Wouldn't it make even *more* sense for CouchDB to just store the relevant information out of the box?

Your point about not wanting to confuse developers is well-taken, but I feel like they'd be better off overall if they had better tools, instead of having to implement these things themselves.

So, hmm. Thanks for giving me something to think about!

The Cr-48 and Chrome OS: Google's vision of the net

Posted Jan 23, 2011 2:45 UTC (Sun) by jhs (guest, #12429) [Link]

I think your assessment is close enough so that further nitpicking of the minutia wouldn't be productive. Two final thoughts:
  1. If an improvement to the architecture is possible, the community would likely be very open to that. That would be a big change, requiring good justification, but the community is very open and flexible. I think the main problem right now is, tooling ("merge" libraries, development/debugging tools) is so much more primitive than the core database, that it's somewhat moot. The wiki's description of conflict resolution is a piece of pseudocode. In the future, it will say "If you use C, use this library; if you use Ruby, use this other." When that happens, the pain point of the history graph may become dominant.
  2. Tracking true "merges" is possible in "user space" if you will. Like a shadow government, the client can simply track its own history graph using its own mechanism. (In this case, it's just like Couch except merges are recorded as such.) The data is simply a normal key/val part of the record. If the algorithm proves to be superior, it could be baked into couch. (The advantage of Couch's revision tree is, like Unix dotfiles, it is not transmitted to the client unless explicitly requested. Otherwise it's a normal key/val datum called IIRC `revs_info`.)

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds