Posted Sep 11, 2012 23:55 UTC (Tue) by martin.langhoff (subscriber, #61417)
Parent article: Bazaar on the slow track
Back when the BitKeeper debacle hit, Bazaar was Bazaar-NG, starting off as a fork of Tom Lord's TLA. At the time, I was struggling to maintain a series of private trees with a small team of programmers, and paying a lot of attention to what was happening in the VCS space.
When the git wave hit, it looked as if Bazaar took a while to figure out the internal data structures of git, and eventually learned a lot from them. They are not identical, of course, but the TLA data structures were a horridly bad fit for the job.
There was a long period of confusion -- circa 2006. All the while git's usability was, um, bad, but it's storage was rock-solid. Bazaar's storage was not considered very reliable, even by Martin Pool. This was the time when X.org, Mozilla and other high-profile / large repo projects were looking to migrate.
Once git's cli UI starting getting a bit more polished, around git 1.5, the "slightly better UI" justification for many of the other VCS started to dry up. VCSs are specialized tools, so when you invest in learning one (and you have to), a slightly steeper learning curve is very often worth it. So all git needed to do was to get close in usability to the others -- and it did. At that point, git's sheer flexibility and power closes the deal.
One VCS I do miss is darcs. Its internal data structures were flawed, but the UI was oh so elegant, specially for those of use suffering with TLA. Not sure if it was original, but I do think it set the UI standard for Mercurial and Bazaar (and git, once it got to the "make it usable" stage).
I am actually surprised that the Bazaar people haven't traded their core engine for git. If you want to support features git won't give you (explicit renames, for example :-) ) you can attach extra bits of metadata to trees and commit objects.
You can use that trick to prototype DVCS features ideas pretty quick, and you could implement a very good "usable-first" DVCS. There are many git "wrappers" that aim to preserve git purity (that is, they are compatible with standard git usage). That puts a lot of limitations on what you can do. If you break that taboo, git can be an outstanding storage engine for very fancy DCVS.
{ I am author of importers to git and git-cvsserver, as well as patches to git and the long-obsoleted cg. I helped several large profile projects evaluate git and import their repos (with mixed results). And I am always looking for time to try to fix some git usability pet peeves. }
Posted Sep 12, 2012 3:48 UTC (Wed) by pabs (subscriber, #43278)
[Link]
Speaking of git importers, I really miss good interop between git and other VCSen. For example git-remote-bzr is a bit buggy, needs a new maintainer in Debian and has some gaps in its mappings between git and bzr concepts. git-remote-hg still isn't merged into git upstream and git-remote-darcs doesn't exist.
Bazaar on the slow track -- history notes
Posted Sep 13, 2012 1:19 UTC (Thu) by martin.langhoff (subscriber, #61417)
[Link]
Please do report bugs in the git mailing list. IME, git has the best interop tools I've seen -- they only get better if you make noise about bugs you find in the mailing list...
Bazaar on the slow track -- history notes
Posted Sep 12, 2012 5:06 UTC (Wed) by abentley (subscriber, #22064)
[Link]
Bazaar-NG was never a fork of Tom Lord's TLA. That fork was Baz, which was called "Bazaar" at the time. Bazaar-NG was always a from-scratch design intended to support lossless imports from Baz.
I don't think Martin considered Bazaar-NG's storage unreliable. Bazaar-NG was self-hosting in March after 3 months of development, and Martin wouldn't have done that if he didn't trust it. His original web site warned "This is pre-release unstable code. Keep backups of any important information.", but I think this was just an overabundance of caution.
One of the reasons Bazaar(-NG) didn't switch to git's core was because git didn't provide a library. And even if it had, it would have been in C, not Python. But we also wanted something that worked with our data model. We felt we could do at least as well as git in storing data, and I've never had reason to doubt the 2a format's efficiency.
Bazaar on the slow track -- history notes
Posted Sep 13, 2012 1:51 UTC (Thu) by martin.langhoff (subscriber, #61417)
[Link]
My memory was fuzzy in the baz/bazaarNG distinction. Thanks for clarifying.
On whether the reliability of storage and Martin Pool's regard of it... I have an anecdote :-)
I was sitting at Martin Pool's presentation in linux.conf.au 2006 (Dunedin, NZ). From the back of the room, in the QA part of the session, someone asked: "so, is it ready for real work? You see, I have this large codebase that's been developed for 25+ years. After several VCS migrations, it's in CVS with a messy repo due to migrations. We are a widely distributed team, and we are hurting. Should I be migrating to bzr now?"
Martin looked rather uncomfortable with the question, and muttered something like "not really, not yet". He had already been less than reassuring when I had asked whether Bazaar storage was delta-centric (darcs-like) or snapshot centric (git-like).
The "is it ready for real workdd?" question had come from Jim Gettys, who I did not know personally at the time. After the talk I asked him whether he had been talking about X.org and whether he could give me access to those messy X.org CVS repos. I would try importing them into git, and we could see if he liked the outcome.
It was the start of a long hard road -- it led to many improvements to git- cvsimport, yet the migration was done with parsecvs (written by Keith Packard).
I was at linux.conf.au to run a workshop on git; Linus joined us, so it stretched from 2 to 4hs. We had a much smaller room assigned than Bazaar, but you could feel we were rocking and rolling :-) I believe Matt Mackall was there too, talking about Mercurial, but I missed it.
This happened long ago -- and this is how I remember it. Quotes are as best as I can recall.
In my view, 2006/2007 was the time where the overall trends in the DVCS space got established; x.org migrated to git, Mozilla ran high profile bakeoffs between DVCSs, etc. And at that time Bazaar was on unfortunately on unsure footing (bad timing!). As a result, Git and Mercurial generally stole the show...
Bazaar on the slow track -- history notes
Posted Sep 13, 2012 7:34 UTC (Thu) by mbp (subscriber, #2737)
[Link]
I think my discomfort would have been about performance for the size of the tree and history they were talking about, not about reliability. bzr was not ready for big trees in 2006 that time.
bzr has always had snapshot storage and never been darcs-like.
I reject, and resent, the implication that I publicly advocated something I privately didn't think was reliable.
Bazaar on the slow track -- history notes
Posted Sep 13, 2012 13:50 UTC (Thu) by martin.langhoff (subscriber, #61417)
[Link]
My apologies. I did not mean to cause offense. Time has its way of distorting memories, perhaps you or others have a different recollection?
My impression after your talk back then was that perhaps Bazaar-NG was performing or planning internal storage changes (or something like that) and that at that particular time those were awkward questions. Not that you did not trust or promote Bazaar, but that you were stating "not right now".
Bazaar on the slow track -- Montone gets too little attention
Posted Sep 12, 2012 16:38 UTC (Wed) by walex (subscriber, #69836)
[Link]
I like your discussion, and in particular the emphasis on storage structure as well as functionality. One of the big issues with SVN for example is the enormous number of small state files it creates in the working copy, and the rather inefficient repository storage too, once they deprecated the DB files.
I think that most recent Bazaar storage structures are not too bad, but that the Mercurial one is pretty terrible, the Git one is so-so, and by far the best is that used by Monotone, which is a single Sqlite file per repository. That makes tree searches, backups, and in general all whole-repository and filetree oriented operations a lot faster and easier.
Also Git and Monotone are implemented fairly well in compiled languages, and can be much faster than Python-implemented Bazaar and Mercurial, even if a bit more careful for the latter two has improved the situation.
Interestingly Monotone, which inspired a lot the design of Git, is also functionally rather complete, and works well, and I think that it is for most projects the most appropriate VCS, followed by Git itself, then Bazaar and not far behind Mercurial.
It is a pity that TLA and DARCS are more often mentioned than Monotone, which has a pretty deliberate, careful design and implementation, even if it is one of the Gang Of Four major modern DVCSes.
Bazaar on the slow track -- Montone gets too little attention
Posted Sep 12, 2012 17:20 UTC (Wed) by BlueLightning (subscriber, #38978)
[Link]
I can't speak to the current state of Monotone - it's not unlikely it has improved, but at the time when the OpenEmbedded project used it (some years ago now) it was agonisingly slow with large codebases. In fact was so bad that instead of doing the initial fetch of the repository we got people to download a snapshot and then update from that.
Bazaar on the slow track -- Montone gets too little attention
Posted Sep 12, 2012 23:15 UTC (Wed) by akupries (subscriber, #4268)
[Link]
storage structures [...] by far the best is that used by Monotone, which is a single Sqlite file per repository.
Richard Hipp, sqlite's author wrote an SCM using a single sqlite file per repository as well. It is called Fossil. It manages the sqlite repository now.
Bazaar on the slow track -- Montone gets too little attention
Posted Sep 13, 2012 1:58 UTC (Thu) by martin.langhoff (subscriber, #61417)
[Link]
AIUI, Monotone using SQLite is unfortunately a disaster in terms of performance. Linus tried it early, before starting git; liked some of the design decisions, abhorred others (signing everything, SQLite).
Monotone using SQLite is a boon for programmers. SQL is easier to wrestle than complex on-disk and in-memory data structures, specially if you are changing the layout. But git design learned from many sources (including Monotone) and had a pretty set data structure from the beginning.
With that clearly-defined data structure, Linus and other kernel hackers cranked out very efficient code. IIRC, Monotone used to take hours to import _one_ snapshot of a kernel, where git could do it in <10s.
See the very very early emails in the git list, by Linus, on his design research and early tests with monotone.
Bazaar on the slow track -- Montone gets too little attention
Posted Sep 13, 2012 7:54 UTC (Thu) by graydon (subscriber, #5009)
[Link]
No, monotone was never that slow. It handled kernel trees (at least at the time I was working on it; those and gcc trees were the benchmark testcases). Linus likes to exaggerate this matter, along with chastising us for our terrible "object oriented C++" code and various other hyperbole. I assume he didn't actually read it.
That said, monotone was unusably slow _when compared to git_, and as project histories and development parallelism has grown, that delta has become an easy and correct criterion for picking git for production in most cases. Git also picked a more sensible branch-naming model (local, per-repo, no PKI; less ambitious but easier and more powerful), embraced history-rewriting early and aggressively, had the benefit of hindsight in most algorithms, declined to bother tracking object identity (turns out to cost more performance than it's worth), figured out submodules, etc. etc. Git won this space hands down. There's no point competing with it anymore, imo.
Bazaar on the slow track -- Montone gets too little attention
Posted Sep 15, 2012 20:54 UTC (Sat) by cmccabe (guest, #60281)
[Link]
I always get a chuckle when people try to compare the speed of most DVCSes to git. It's like hearing someone say his bicycle can out-perform your ferrari-- if there's a good tail-wind, and the road is downhill, and maybe if Lance Armstrong is the rider... Kudos to you for being realistic about the issue. And monotone was cool, back in the day.
Bazaar on the slow track -- Montone gets too little attention
Posted Sep 17, 2012 15:16 UTC (Mon) by zooko (subscriber, #2589)
[Link]
If I recall correctly on the day (weekend?) that Linus tried monotone, the then-current release of monotone had some diagnostic/debugging/profiling code compiled in which caused it to have superlinear runtime for some computation or other. Correct me if I'm wrong, Graydon, as I think what I'm recalling is from something you wrote shortly thereafter.
It's one of those "for want of a nail the horseshoe was lost" kinds of moments in history -- if monotone had been fast enough for Linus to use at that time then presumably he never would have invented git.
And while *most* of the good stuff that the world has learned from git is stuff that git learned from monotone, I do feel a bit of relief that we have git's current branch naming scheme. Git's approach is basically to not try to solve it, and make it Someone Else's Problem. That sucks, it leads to ad-hoc reliance on DNS/PKI, and it probably contributes to centralization e.g. github, but at least there is an obvious spot where something better could be plugged in to replace it. If we had monotone's deeper integration into DNS/PKI (http://www.monotone.ca/docs/Branches.html), it might be harder for people to understand what the problem is and how to change it.
Bazaar on the slow track -- Montone gets too little attention
Posted Sep 18, 2012 15:25 UTC (Tue) by graydon (subscriber, #5009)
[Link]
I don't think it was just a matter of a missing nail in a horseshoe. If I _had_ to point to a single matter, it would center on identity-tracking (that is, not "just dumb content tracking"). We initially just did content-tracking alone -- which was orders of magnitude faster -- and found that we were stapling together too many ad-hoc merge "algorithms" to reconstruct events like file and directory renames, and java users were complaining about the inaccuracy of those, so we wound up building a large (and as it turns out, computationally expensive) secondary layer of logic concerned with file and directory object lifecycles. That's probably the source of the lion's share of the costs; but even if we hadn't done that, I'm sure the amount of in-memory transformations and verification, data re-parsing, crypto, and simple buffer-copying / sqlite-IO would probably have doomed us going up against kernel engineers. They think in a much closer to "zero copies and never calculate anything twice" mode. Very hard to compete with, given my background and coding style. I'm happy to yield defeat on implementation here; git _flies_. Very impressive implementation (though I do wish it'd integrate rolling-checksum fragment-consolidation in its packfiles, a la bup).
All that's a distraction though, at this stage. Git won; but there's more to do. I agree with you that the residual/next/larger issue is PKI and naming. Or rather, getting _rid_ of PKI-as-we-have-tried-it and deploying something pragmatic, decentralized and scalable in its place for managing names-and-trust. The current system of expressing trust through x.509 PKI is a joke in poor taste, and git (rightly) rejects most of that in favour of the three weaker more-functional models: the "DNS and soon-to-be-PKI DNSSEC+DANE" model of global-name disambiguation, the "manual ssh key-exchange with sticky-key-fingerprints" model of endpoint transport security, and the (imo strictly _worse_) "GPG web of trust" model for long-lived audit-trails. The three of these systems serve as modest backstops to one another but I still feel there's productive work to do exploring the socio-technical nexus of trust-and-naming as a more integrated, simplified, decentralized and less random, more holistic level (RFCs 2693 and 4255 aside). There are still too many orthogonal failure modes, discontinuities and security skeuomorphisms; the experience of naming things, and trusting the names you exchange, at a global scale, still retains far too much of the sensation of pulling teeth. We wind up on IRC with old friends pasting SHA-256 fingerprints of things back and forth and saying "this one? no? maybe this one?" far too often.
Bazaar on the slow track -- Montone gets too little attention
Posted Sep 18, 2012 18:59 UTC (Tue) by jackb (subscriber, #41909)
[Link]
All that's a distraction though, at this stage. Git won; but there's more to do. I agree with you that the residual/next/larger issue is PKI and naming. Or rather, getting _rid_ of PKI-as-we-have-tried-it and deploying something pragmatic, decentralized and scalable in its place for managing names-and-trust. The current system of expressing trust through x.509 PKI is a joke in poor taste, and git (rightly) rejects most of that in favour of the three weaker more-functional models: the "DNS and soon-to-be-PKI DNSSEC+DANE" model of global-name disambiguation, the "manual ssh key-exchange with sticky-key-fingerprints" model of endpoint transport security, and the (imo strictly _worse_) "GPG web of trust" model for long-lived audit-trails. The three of these systems serve as modest backstops to one another but I still feel there's productive work to do exploring the socio-technical nexus of trust-and-naming as a more integrated, simplified, decentralized and less random, more holistic level (RFCs 2693 and 4255 aside). There are still too many orthogonal failure modes, discontinuities and security skeuomorphisms; the experience of naming things, and trusting the names you exchange, at a global scale, still retains far too much of the sensation of pulling teeth. We wind up on IRC with old friends pasting SHA-256 fingerprints of things back and forth and saying "this one? no? maybe this one?" far too often.
My theory is that PKI doesn't work because it is based on a flawed understanding of what identity actually means.
The fraction of the population that really understands what it means to assign cryptographic trust to a key is statistically indistinguishable from "no one". Maybe the reason that the web of trust we've been promised since the 90s hasn't appeared yet is because the model itself is broken.
Bazaar on the slow track -- Montone gets too little attention
Posted Sep 18, 2012 19:43 UTC (Tue) by hummassa (subscriber, #307)
[Link]
> The fraction of the population that really understands what it means to assign cryptographic trust to a key is statistically indistinguishable from "no one". Maybe the reason that the web of trust we've been promised since the 90s hasn't appeared yet is because the model itself is broken.
Ok, but... what is the alternative?
Bazaar on the slow track -- Montone gets too little attention
Posted Sep 18, 2012 20:05 UTC (Tue) by jackb (subscriber, #41909)
[Link]
Now that people are carrying mobile, internet-connected computers around with them basically all the time key signing can be automated.
The question of "does the person standing in front of me control a particular private key" can be answered by having each person's smartphone sign a challenge and exchange keys via QR codes (bluetooth, NFC, etc). This step should require very little human interaction.
That question, however, does not establish an identity as we humans understand it. Identity between social creatures is a set of shared experiences. The way that you "know" your friends is because of your memories of interacting with them.
Key signing should be done in person and mostly handled by an automated process. Identity formation is done by having the users verify facts about other people based on their shared experiences.
If properly implemented the end result would look a lot like a social network that just happens to produce a cryptographic web of trust as a side effect.
Bazaar on the slow track -- Montone gets too little attention
Posted Sep 18, 2012 20:23 UTC (Tue) by graydon (subscriber, #5009)
[Link]
I agree. My hunch (currently exploring in code) is that a more useful model involves defining trust in reference to cross-validation between multiple private small-group communication-histories. Put another way: identity should adhere to evidence concerning communication-capability (and the active verification thereof), not evidence of decrypting long-lived keys. Keys should always be ephemeral. They'll be broken, lost or stolen anyways; best to treat them as such.
(Keep in mind how much online-verification comes out in the details of evaluating trust in our key-oriented PKI system anyways. And how often "denying a centralized / findable verification service" features in attack scenarios. Surprise surprise.)
So, I also expect this will require -- or at least greatly benefit from -- a degree of "going around" current network infrastructure. Or at least a willingness to run verification traffic over a comfortable mixture of channels, to resist whole-network-controlling MITMs (as the current incarnation of the internet seems to have become).
But lucky for our future, communication bandwidth grows faster than everything else, and most new devices have plenty of unusual radios.
Bazaar on the slow track -- Montone gets too little attention
Posted Sep 18, 2012 20:25 UTC (Tue) by Cyberax (✭ supporter ✭, #52523)
[Link]
PKI is a failure on all levels, starting from technical and going up to the social/management level.
For example, is there anybody here who can claim enough of ASN.1 knowledge to parse encoded certificates and keys? I certainly don't, every time I need to generate a CSR or a key, I go to Google and search for the required command-line to make OpenSSL spit out the magic binhex block.
Then there's a problem with lack of delegation. It's not possible to create a master cert for "mydomain.com" which I then can use to sign "host1.mydomain.com" and "host2.mydomain.com".
And so on. I'd gladly help a project to replace all this morass with clean JSON-based certificates with clear human-readable encoding.
Bazaar on the slow track -- Montone gets too little attention
Posted Sep 18, 2012 21:16 UTC (Tue) by jackb (subscriber, #41909)
[Link]
I think there are two components necessary to build a web of trust that real people will actually use. First is automated in-person key signing that I described in an eariler post. The second part is an online database of facts about a particular identity.
The database would consist of one table that associates arbitrary text strings with public key IDs, and another table containing cryptographically-signed affirmations or refutations of the entries in the first table.
An example of an arbitrary text string could be a legal name, an email address, "inventor of the Linxu kernel", "CEO of Acme, Inc.", etc.
Everybody is free to claim anything they want, and everyone else is free to confirm or refute it. A suitable algorithm would be used to sort out these statements based on the user's location in the web of trust to estimate the veracity of any particular statement.
The value of the web of trust depends on getting people to actually use it so the tools for managing it would need to be enjoyable to work with instead of painful. That's one reason I think making the user interface similar to a social network because the emperical evidence suggests that people like using Facebook more than they like using GPG or OpenSSL. The other reason is that social networks better model how people actually interact in real life so making the web of trust operate that way is more intuitive.
Bazaar on the slow track -- history notes
Posted Sep 17, 2012 11:52 UTC (Mon) by douglasbagnall (subscriber, #62736)
[Link]
> I am actually surprised that the Bazaar people haven't traded their core engine for git.
The other day I heard a Canonical employee advocating bzr as a git front-end. The argument was that nobody suffers if you use Bazaar locally and Git remotely, so bzr people should just do that and stop fussing. As you suggest, they may have been glossing over incompatibilities in the models, or perhaps they haven't hit them in practice.