1. Revision histories which are "deleted" are simply marked as such. They are not removed from the history graph.
2. Marking a revision "deleted" is itself a revision update. Maybe a way to think of it is a Git annotated tag.
Conceptually, a delete could look like this:
A1 -> B2 -> B2_Deleted
So yes, if you pull again and you see B3, no problem. Just put it on the revision tree. B3 and B2_Deleted are siblings with parent B2.
You do merge A3 and B3 "from scratch." This is the difference with Git. In Git, the history is the whole point. But with arbitrary records and application-specific merge semantics, it's not as meaningful that B3 and A3 have a comment parent. The application is still probably more interested in the difference between the *data* rather than the ancestral history. (Not to mention, CouchDB is difficult enough without HTML/Javascript programmers having to negotiate dependency trees in their code.)
For example, what if B3 stores "delete_account: true" which means in this application that the user wishes to completely remove his account and delete all of his data. Clearly that fact is dominant in the merge/conflict-resolution strategy and it hardly matters what the history graph looks like.
Finally, the history graph is always there for the client (replicator). There is nothing stopping an application-level `git-rerere` implementation--perhaps as a library which all developers could use.
(Note, I'm not saying explicit merge info isn't useful. For all I know it was overlooked in the original architecture. I'm just saying it's not a showstopper.)
The Cr-48 and Chrome OS: Google's vision of the net
Posted Jan 23, 2011 2:24 UTC (Sun) by njs (guest, #40338)
[Link]
> So yes, if you pull again and you see B3, no problem. Just put it on the revision tree. B3 and B2_Deleted are siblings with parent B2.
Sure, that makes sense.
> The application is still probably more interested in the difference between the *data* rather than the ancestral history. [...] For example, what if B3 stores "delete_account: true" which means in this application that the user wishes to completely remove his account and delete all of his data. Clearly that fact is dominant in the merge/conflict-resolution strategy and it hardly matters what the history graph looks like.
Hrm, so what I'm basically hearing is that CouchDB's data/synchronization model in practice is:
-- You have a bunch of records which must be independent (there's no way that synchronization can respect any kind of referential integrity)
-- When synchronizing, CouchDB will automatically identify the "latest" version of any given record (via "fast-forward merge"); if anything more complicated happens, then the app is on its own. And if the app *wanted* to use a proper merge algorithm to, say, notice that the user added a phone number to this contact on their phone and also added a user picture on their computer and those edits can easily be combined -- then it's sort of doomed, because the needed history information just isn't recorded; it's expected that apps will mostly do the equivalent of two-way merge, or that divergence will be rare enough that even a crippled three-way-merge will be good enough.
Does that sound right?
My wild guess is that in practice this architecture works fine for data that's inherently loosely coupled, and where edits are rare relative to the size of the data store and the synchronization frequency. This probably covers all the main data people are replicating these days -- bookmarks, contact lists, mail with their phone -- but I'm not sure how far you can stretch it. And it's possible to do *much* better, without adding much -- if any -- complexity.
Monotone, for instance, implements a distributed data store with complex structure (a tree of files/directories, each of which can have an arbitrary set of attributes), referential integrity constraints (can't have two files with the same names, no directory loops, etc.), and a very cheap, fully history-sensitive automated merger with a rigorous mathematical foundation ("mark merge").
> There is nothing stopping an application-level `git-rerere` implementation--perhaps as a library which all developers could use.
Sure, but git-rerere is a total hack whose purpose is to prevent Linus from seeing merge nodes. I don't imagine your end-users really have any aesthetic preferences about the shape of the graph buried inside CouchDB :-). Wouldn't it make even *more* sense for CouchDB to just store the relevant information out of the box?
Your point about not wanting to confuse developers is well-taken, but I feel like they'd be better off overall if they had better tools, instead of having to implement these things themselves.
So, hmm. Thanks for giving me something to think about!
The Cr-48 and Chrome OS: Google's vision of the net
Posted Jan 23, 2011 2:45 UTC (Sun) by jhs (guest, #12429)
[Link]
I think your assessment is close enough so that further nitpicking of the minutia wouldn't be productive. Two final thoughts:
If an improvement to the architecture is possible, the community would likely be very open to that. That would be a big change, requiring good justification, but the community is very open and flexible. I think the main problem right now is, tooling ("merge" libraries, development/debugging tools) is so much more primitive than the core database, that it's somewhat moot. The wiki's description of conflict resolution is a piece of pseudocode. In the future, it will say "If you use C, use this library; if you use Ruby, use this other." When that happens, the pain point of the history graph may become dominant.
Tracking true "merges" is possible in "user space" if you will. Like a shadow government, the client can simply track its own history graph using its own mechanism. (In this case, it's just like Couch except merges are recorded as such.) The data is simply a normal key/val part of the record. If the algorithm proves to be superior, it could be baked into couch. (The advantage of Couch's revision tree is, like Unix dotfiles, it is not transmitted to the client unless explicitly requested. Otherwise it's a normal key/val datum called IIRC `revs_info`.)