LWN.net Logo

Re: Index/hash order

From:  Linus Torvalds <torvalds-AT-osdl.org>
To:  "H. Peter Anvin" <hpa-AT-zytor.com>
Subject:  Re: Index/hash order
Date:  Wed, 13 Apr 2005 10:24:53 -0700 (PDT)
Cc:  Ingo Molnar <mingo-AT-elte.hu>, git-AT-vger.kernel.org



On Wed, 13 Apr 2005, H. Peter Anvin wrote:
> 
> I see what you mean.  Do remember, however, that the fact that the blobs 
> are compressed is part of the argument as to why there is no need to do 
> xdelta-type incremental storage.

No.

The reason for not doing deltas is not "because we compress stuff we don't 
need it".

The reason for not doing deltas is purely about consistency, speed, and 
distribution. Compression is not it.

The reason I rejected deltas out-of-hand in the design was:

 - I want top-of-tree to be fast. And by "fast" I mean so frigging 
   unbelievably fast that I feel confident that nothing that gives the
   same kind of consistency guarantees can top it (that said, I'll also 
   freely admit that my definition of "fast" is "fast for things _I_ care
   about" ;)

   This means that a delta format just isn't acceptable. Either you have 
   to build up the top based on history (forward-moving deltas), which
   clearly does not scale performance-wise, or you have to re-base the
   deltas and keep the top  up-to-date and make the slowdown happen for 
   old revisions.

   Making older revisions slower is fine by me, but it fails my second 
   basic requirement:

 - I want things to distribute well. This means that it has to be based 
   on a "append data" model, where historical data never changes, and you 
   only append on top of it (either by adding totally new files, or by 
   just letting the files grow).

   This works in a forward-delta environment (which is fundamentally based 
   on the notion of "we know the old version, we're adding new stuff on
   top of it"), but does _not_ work in the backwards model of "we keep the
   old history as a delta against the new" model.

In other words, I don't dislike delta's per se. But they are fundamentally
incompatible with the very purpose of "git", so git does not use them.

Now, it's quite possibly a _wonderful_ idea to use deltas for git-to-git
synchronization. For example, one of the nice properties of "git" is
exactly the fact that the data involved _fully_ determines all objects. So
let's say that you already have the parent version of a commit: you do not
have to send the full object database to synchronize, you really _can_
send just the diff of the data and the file structure (modes) and the
exact commit object (*), and the receiving side can then re-create the
rest from the git database it already has. 

And this is all possible exactly because git does not pollute the git 
objects with _anything_ else than their contents and has a fixed method 
for re-creating them.

So if you want to do a "git-sync" protocol that sends deltas back and 
forth, that is quite possible, and is totally independent of the fact that 
the git database itself is designed to be totally stable.

In fact, the total stability of the git database is a huge boon. It means 
that while a "git-sync" is going on, the synchronization process in _no_ 
way needs to worry about any writes happening to the git database on 
either end. In other words, you can synchronize a git database with no 
locking _what-so-ever_. 

Trust me, not needing locking is a huge boon. I don't think people realize
just how much thought I've put into my database selection and what the
implications are.

It's perfect, I tell you.

		Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


(Log in to post comments)

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds