|
|
Subscribe / Log in / New account

Support large repositories!

Support large repositories!

Posted Apr 3, 2010 22:35 UTC (Sat) by RCL (guest, #63264)
Parent article: A proposed Subversion vision and roadmap

There's one thing that current DVCS really suck at: working with large repositories which are frequent in gamedev and similar areas.

E.g. recently I tried creating a git/bzr/hg repo of 22GB of files (it's not that huge, real repositories that hold all the resources needed to build a game, including branches etc are order of magnitude larger... games themselves are several GBs of *compressed* data, don't forget...).

All three failed.

It seems like all git/bzr/hg assume that every file they're dealing with can be directly loaded to memory. That's probably why git spits "malloc returned NULL" errors when asked to add a file which is larger than amount of memory available.

SHA calculation doesn't help them as well... Running "git status" on even relatively small (~1GB) repository becomes uncomfortably slow.

Yet Perforce handles all this easily.

Subversion guys, do something with it (svn can handle several GB-repos, but it is much slower than Perforce) and you'll have an edge over DVCS!


to post comments

Support large repositories!

Posted Apr 3, 2010 23:20 UTC (Sat) by marcH (subscriber, #57642) [Link] (36 responses)

Most VCS are indeed designed for human-generated, line-oriented text files. At the core of every single VCS sit "diff" and "patch", totally useless for binaries. For binaries most people use regular filesystems and file servers, especially when the binaries are massive. Why don't you? Why insist on using the same tool for two totally different jobs?

Support large repositories!

Posted Apr 4, 2010 0:21 UTC (Sun) by brouhaha (subscriber, #1698) [Link]

Because it is desirable to have the same revision history tracking mechanism for the large binary files, rather than having to use some completely different ad-hoc mechanism.

Once you've told the VCS that a file is binary, it doesn't need to do textual diffs and other text-related operations on it, so there's no obvious reason why the VCS shouldn't be able to handle it in a reasonable manner.

Support large repositories!

Posted Apr 4, 2010 0:28 UTC (Sun) by RCL (guest, #63264) [Link] (15 responses)

Because data should also be versioned?

Your code should always be synchronized to a particular state of your data it works with. That's so common that I'm suprised that you are asking "why". Here are few reasons why:

1) Using bytecode-compiled scripts which rely on particular layout and sizeof of structures used by C/C++ code and which are usually embedded in data (maps, actors) they operate on.

2) Using binary formats which are tied to particular code which loads/uses them (e.g. if you are making a console game, you don't have resources to have hierarchicaly structured XML-like formats which require additional memory to be loaded and/or cannot be read with a single I/O operation, which is a showstopper if you are streaming directly from DVD - your only realistic option is to load large binary chunks, change a few pointers here and there and that's it).

3) Having the ability to branch/tag a particular state of the whole game so it can be later ported/patched/otherwise updated independently of ongoing development...

etc etc etc

Basically there are as many reasons to have your binary data versioned as there are reasons to have your plaintext data versioned.

Support large repositories!

Posted Apr 4, 2010 7:25 UTC (Sun) by fdr (guest, #57064) [Link] (6 responses)

I don't think the grandparent was suggesting you not version data, just that
it be done using some other mechanism.

I do think there's some reason for improvement here, but I must admit: often
the large resources are not terribly coupled to code change (ex: texture
tweaking), and I really, really like the fact that the "log" and "diff"
operators are local and blazing fast for textual artifacts. In the idea
world I could have all my binary blobs, too, however....

I think some UI slickness could be of big help here. It would also be nice
for managing third-party code/dependencies. At the same time, I do imagine
that the right kind of simple script could fetch the resources needed, at
least a conveniently as a P4/SVN install, while not losing access to git's
advantages when manipulating textual artifacts.

Support large repositories!

Posted Apr 4, 2010 8:18 UTC (Sun) by ikm (guest, #493) [Link] (1 responses)

> I don't think the grandparent was suggesting you not version data, just that it be done using some other mechanism.

mechanism other than a VCS (version control system)? Maybe, like, a hammer, or a shovel? But not a VCS. Definitely not. VCSes are not for versioning data, no, no. Shovels are for that. Dig a hole, snatch your data, mark the place with a rock. And you're done.

Support large repositories!

Posted Apr 5, 2010 21:40 UTC (Mon) by fdr (guest, #57064) [Link]

In my mind, there's a real use case for different handling of smaller,
textual artifacts and larger blobs. Maybe one could call the DVCSs as they
exist now incomplete, but I would say the same for the centralized systems
where I feel my ability to interrogate textual history is agonizingly slow
(and, of course, requires network access).

No need to lean so heavily on the expansion of a TLA to make biting remarks.
It's somewhat silly.

Support large repositories!

Posted Apr 4, 2010 12:04 UTC (Sun) by RCL (guest, #63264) [Link] (3 responses)

I can't imagine what other [convenient] mechanisms could be used for data.
In gamedev data is a first-class citizen, and data commits (from 60+
artists) are usually much more frequent than code commits (from 10+
coders) yet they are interdependent (data depends on editor and game
depends on data), so they should be versioned together...

And by the way, being "binary" is not the deciding factor. Some
intermediate data formats may be textual (id software even tried that for
production formats, but it failed miserably on consoles). Things don't get
much better if you deal with gigabytes of text files.

Basic problem seems to be time needed to detect the change. Perforce
relies on explicit help from the user by requiring that you "p4 open" some
file before editing it (kind of enforced by setting files to read-only),
but it makes "p4 diff" blazingly fast. SVN tries to guess itself and while
convenient, that slows things down.

Git seems to be ideologically incompatible with the very idea of workflow,
where code is a tiny fragment of overall versioned data. DVCSes get all
the things wrong here: local history (which would require several TBs) is
not needed here, ability to lock the file is missing, ability to detect
changes by scanning the files is detrimental.

There are some ways where DVCS might be more convenient (but only for
coders, which are tiny part of the team), that's why Perforce introduced
"shelving" concept, which seems to get you the main advantage of DVCS into
traditional VCS. Perhaps Subversion should do the same...

Support large repositories!

Posted Apr 5, 2010 11:01 UTC (Mon) by CaddyCorner (guest, #64998) [Link]

It would be nice if large binary formats used e.g. tar as a part of their spec. This obviously doesn't address the current situation.

Perhaps something that does address the present situation is simply viewing files which have changed their modification date as changed and running a diff index in the background. If it were possible to transparently wrap the binary blob in a paged format then modification dates could be taken on each block/page. This seems to all be at the cost of the VCS trying to not inject itself into the workflow, possibly injection could be optional.

Support large repositories!

Posted Apr 5, 2010 11:31 UTC (Mon) by marcH (subscriber, #57642) [Link]

> In gamedev data is a first-class citizen, and data commits (from 60+ artists) are usually much more frequent than code commits (from 10+ coders)

Now I feel glad not to be a second-class citizen in game development. I would hate to have to deal with tens of gigabytes every time I want to test a one line fix in isolation (just because no cares to modularize these tens of gigabytes...)

> Git seems to be ideologically incompatible with the very idea of workflow, where code is a tiny fragment of overall versioned data.

Yes, because it is designed and optimized for the opposite use case.

> DVCSes get all the things wrong here: local history (which would require several TBs) is not needed here, ability to lock the file is missing, ability to detect changes by scanning the files is detrimental.

These are all desired features for a distributed, "text-optimized" VC.

Thanks for sharing your experience with versioning binaries. It allows you to highlight better than anyone else how optimizing for binaries is different and incompatible from optimizing for text.

Support large repositories!

Posted Apr 8, 2010 16:33 UTC (Thu) by Spudd86 (subscriber, #51683) [Link]

what do you mean, like gigabytes in one text file? If you you're doing the text file part wrong...

Support large repositories!

Posted Apr 5, 2010 22:18 UTC (Mon) by marcH (subscriber, #57642) [Link] (6 responses)

> That's so common that I'm suprised that you are asking "why".

That's so common that only perforce seem to handle large binaries well?

I am afraid what is actually common is to generate large binaries from source.

Support large repositories!

Posted Apr 6, 2010 1:19 UTC (Tue) by bronson (subscriber, #4806) [Link] (5 responses)

It's clear you don't work on projects with any appreciable graphics or sound. Really, given the number of mobile, game, and Flash devs nowadays, you might want ot rethink your use of the word "common".

Besides, when it takes six hours for an optimized compile (this was the 90s), or when the dev tools cost $25,000/seat, then hell yes you check binaries into revision control. Right next to the source code.

Support large repositories!

Posted Apr 6, 2010 9:09 UTC (Tue) by marcH (subscriber, #57642) [Link]

> Besides, when it takes six hours for an optimized compile (this was the 90s), or when the dev tools cost $25,000/seat, then hell yes you check binaries into revision control. Right next to the source code.

As a matter of fact, I work daily with binaries that I cannot compile myself. Hell no they are not checked in right next to the source code, not to make revision control operations unnecessarily slow down to a crawl.

Support large repositories!

Posted Apr 6, 2010 18:52 UTC (Tue) by avik (guest, #704) [Link] (3 responses)

The correct way to handle generated binaries is with a ccache-style shared repository. This way the first person to compile takes the hit, the rest reuse the generated binaries, and the source control doesn't need to be aware of it.

Support large repositories!

Posted Apr 7, 2010 23:19 UTC (Wed) by cmccabe (guest, #60281) [Link] (2 responses)

> The correct way to handle generated binaries is with a ccache-style shared
> repository. This way the first person to compile takes the hit, the rest
> reuse the generated binaries, and the source control doesn't need to be
> aware of it.

Amen to that.

Checking in large blobs of "mystery meat" on the honor system just leads to chaos.

Support large repositories!

Posted Apr 8, 2010 23:13 UTC (Thu) by bronson (subscriber, #4806) [Link] (1 responses)

Mystery meat is mystery meat no matter where it's stored. When you give someone commit rights it becomes an honor system no matter what software you're using.

I notice you guys are ignoring my main points about audio and video files, and cross compilers that cost a lot of dough per seat. OK, fine, let's restrict this discussion to just native compiling. Even in this specialized case, anyone who's kept a distributed ccache up and running might be skeptical of Avi's advice.

Executables are WAY more backward compatible than object files. If you can ensure that everyone is running the exact same minor version of gcc and libraries, ccache would probably work. In most dev shops, where there's a crazy mix of personal favorite Linux distros is plus a bunch of custom-compiled shared libs, I'm pretty sure trying to keep everyone on ccache will cost you a lot more time than it saves. (spoken from my bitter experience of trying to do this in 2006).

Different strokes, right? You will to use whichever technique is best for your shop. That might be ccache, custom scripts pulling binaries off fileservers, or just checking them right into source control. Each one has its place.

Support large repositories!

Posted Apr 30, 2010 18:44 UTC (Fri) by cmccabe (guest, #60281) [Link]

> Mystery meat is mystery meat no matter where it's stored. When you give
> someone commit rights it becomes an honor system no matter what software
> you're using.

When you check _code_, a skilled coder can look at your change and figure out what it is doing. When you check in a _binary_, there is no obvious way to figure out how it differs from the binary that was previously there. Sure you could disassemble it and run a detailed anaylsis, but realistically, that's not going to happen. Hence, it's "mystery meat."

> I notice you guys are ignoring my main points about audio and
> video files

No, I totally agree with your points regarding audio and video. I hope that git will be extended to support working with these large files more effectively.

> Executables are WAY more backward compatible than object files. If
> you can ensure that everyone is running the exact same minor version
> of gcc and libraries, ccache would probably work. In most dev shops,
> where there's a crazy mix of personal favorite Linux distros is plus
> a bunch of custom-compiled shared libs, I'm pretty sure trying to
> keep everyone on ccache will cost you a lot more time than it saves.
> (spoken from my bitter experience of trying to do this in 2006).

You are doing it wrong. Set up a chroot environment with the proper libraries and compiler. Look up "cross compiling with gcc."

Support large repositories!

Posted Apr 6, 2010 16:53 UTC (Tue) by Spudd86 (subscriber, #51683) [Link]

don't store the bytecode of the script, store the source?

Why are you putting the bytecode in the repository if it's coupled to the changes in the C/C++ source it should be built at the same time as all your native code...

Support large repositories!

Posted Apr 4, 2010 10:04 UTC (Sun) by epa (subscriber, #39769) [Link] (3 responses)

Listening to those telling Subversion users that they shouldn't
Want to store large binary files in version control is like
hearing an svn diehard ask 'why do you want to mess around with
branches all the time?'.

Clearly, it's desirable that a VCS support big files. The fact
that some popular ones don't is not reason to dismiss this need.

Support large repositories!

Posted Apr 7, 2010 23:35 UTC (Wed) by cmccabe (guest, #60281) [Link] (2 responses)

For projects where pretty pictures and movies are a big part of the data, it seems like any version control system really needs to handle large binaries.

So far I've heard two reasons why git is slow on binaries:

1. git normally rescans the entire file during operations like "git diff".
For huge binaries, this gets expensive.

I wonder if git could use the file's mtime to determine whether to scan it for changes. Or does it already?

2. The git format still has some limitations with large files.

Those seem fixable. I wonder if anyone is working on this.

Support large repositories!

Posted Apr 8, 2010 0:31 UTC (Thu) by dlang (guest, #313) [Link] (1 responses)

1. I believe that git will first compare the stored hash of the two files (actually of the two trees, so if the trees are the same it doesn't bother checking the individual files), only if that is different will it actually do the diff

2. This has been discussed, and most of the design work has been done for a couple of different possible solutions.

the first is to store the files separately and just have a reference of how to get the file inside the existing git records. the design work has been mostly done, but nobody has taken the time to code it (GSOC interest anyone ;-)

the second is to modify the pack format to handle larger things. There are people working on this, but since this would be a very invasive set of changes they are trying to anticipate all possible improvements and so it is moving very slowly

Support large repositories!

Posted Apr 8, 2010 16:35 UTC (Thu) by Spudd86 (subscriber, #51683) [Link]

In fact I think it first checks the mtime of the file before even computing the hash

Support large repositories!

Posted Apr 4, 2010 10:37 UTC (Sun) by ikm (guest, #493) [Link] (13 responses)

> At the core of every single VCS sit "diff" and "patch", totally useless for binaries.

You can 'diff' and 'patch' binaries. What you can't usually do is 'merge' them. Nevertheless, the need for versioning them exists, even if they aren't mergeable. By the way, to that end SVN supports locking, so that only one person works on a binary at a time. That would be quite weird for a DVCS, but centralized SVN can afford this.

Support large repositories!

Posted Apr 4, 2010 14:43 UTC (Sun) by tialaramex (subscriber, #21167) [Link] (4 responses)

Agreed that binary files (sometimes) want revision control, and that digging a hole is not the solution.

Perhaps though it's wrong to accept the "can't merge" outcome, particularly in the video game context. It's my understanding that video games almost always have someone in the "toolsmith" role. A toolsmith's job could include providing merge for binary formats where that seems like a meaningful operation.

A really simple example is image merge. Given three layered "source images" one the "original", one of which has an improved stone effect layer made by artist A, and one the newly agreed "bad guy logo" in the emboss layer from lead designer B, it ought to be possible to take the changes made by A and by B and merge them to produce a new image which has the nicer stone AND the new logo. This is a mechanical operation (load both images, keep the layers that changed in both, save) but it requires insight into the format of the data.

But even non-mechanical merges are useful. Maybe the two unrelated changes to the intro level can't be merged by a machine, but the level design tool could be tricked out with a feature that can load another level and show it as a copyable ghost on top of the first. That takes merge from a painful and lossy operation worth avoiding at any cost (svn locking) to a relatively mundane occurrence, possible whenever necessary but not to be encouraged.

Support large repositories!

Posted Apr 4, 2010 17:15 UTC (Sun) by RCL (guest, #63264) [Link] (2 responses)

I haven't seen even visual diffing, let alone merge between formats
(binary or textual, doesn't matter) - I'm not talking about images, but
(usually stored in proprietary and/or ad hoc formats because of efficiency
requirements) animations, geometry etc

What you are proposing is a nice idea, but it would took an enormous
amount of work to be generally applicable (merge of two skinned characters
with different number of bones, anyone?) and still will be error-prone.

Moreover, merging between two different data sets is not solely a
technical problem, it requires artistic eye because even correctly merged
result of two great-looking changes may still look like shit.

It's so much easier to just lock the files, really!

Support large repositories!

Posted Apr 5, 2010 3:52 UTC (Mon) by martinfick (subscriber, #4455) [Link] (1 responses)

"Moreover, merging between two different data sets is not solely a
technical problem, it requires artistic eye because even correctly merged
result of two great-looking changes may still look like shit."

This can very well be true for code also...doesn't meant that a merge tool
to help the process isn't/wouldn't be useful. But, naturally the right way
to support this would be to have your VCS support multiple merge tools via a
plugin mechanism.

Support large repositories!

Posted Apr 5, 2010 9:28 UTC (Mon) by dlang (guest, #313) [Link]

which git does support

Support large repositories!

Posted Apr 6, 2010 20:43 UTC (Tue) by vonbrand (subscriber, #4458) [Link]

True, as far as it goes. But note that the diff + patch mechanism used to merge aren't infallible either: Consider a repo containing a function named foo and several uses of it. Now on one branch rename foo to bar, and on another introduce further uses of foo. When you merge, even if the merge is successful (i.e., no changed chunks intersect), the result is still inconsistent.

What is really happening is that we use text as a representation for source code, which has a rather richer structure than just "lines" (but not so rich that it makes the above completely useless). We saw that with a student who worked (essentially) on VCS for XML files representing projects. The simple line based diff/merge turned out not to be enough, a somewhat smarter set of operations was needed.

That takes us again to the source of the (otherwise unreasonable) success of Unix: Use text files, not random binary formats unless strictly required. Binary formats are much harder to handle, and each of them will require its own set of operations. To add to the fun, many binary formats include their own (rudimentary) VCS...

Support large repositories!

Posted Apr 4, 2010 20:35 UTC (Sun) by marcH (subscriber, #57642) [Link] (7 responses)

> You can 'diff' and 'patch' binaries

I was not thinking of "diff-the-concept" but of "diff-the-tool".

You can design a tool that will pretend to handle both text and binaries the same way, but it will only pretend to. Inside the box you will actually find two different tools.

Support large repositories!

Posted Apr 4, 2010 21:31 UTC (Sun) by nix (subscriber, #2304) [Link] (6 responses)

No you won't. hg's delta algorithm is binary, as is svn's; git can
transform its series of commit snapshots in all sorts of ways without
changing visible behaviour, and can detect commonality between entirely
unrelated files merged from completely different source repositories. The
diff you see when you do 'git diff' is completely unrelated to the
delta-compression algorithm. (IMHO, this is one of git's biggest
architectural strengths: it can change its on-disk representation almost
beyond recognition without changing anything the user sees or breaking
compatibility in any way.)

Support large repositories!

Posted Apr 5, 2010 11:35 UTC (Mon) by marcH (subscriber, #57642) [Link] (5 responses)

> hg's delta algorithm is binary, as is svn's; git can transform its series of commit snapshots in all sorts of ways without changing visible behaviour,

Many binary formats are compressed by default. This usually prevents computing deltas. Are these tools clever enough to transparently uncompress revisions before comparing?

Support large repositories!

Posted Apr 5, 2010 16:24 UTC (Mon) by nix (subscriber, #2304) [Link] (2 responses)

Of course the source formats being compressed doesn't prevent computing
deltas, but it does mean that the deltas might be larger than they would
otherwise be. (If they end up too large, you'll just end up storing a
sequence of snapshots.)

Support large repositories!

Posted Apr 5, 2010 21:47 UTC (Mon) by marcH (subscriber, #57642) [Link] (1 responses)

> Of course the source formats being compressed doesn't prevent computing deltas, but it does mean that the deltas might be larger than they would otherwise be.

Every time I tried this, the delta was almost as big as the file itself. Would you have counter-examples?

Support large repositories!

Posted Apr 5, 2010 23:14 UTC (Mon) by nix (subscriber, #2304) [Link]

No. You'll end up with a lot of snapshots rather than deltas, currently.

Support large repositories!

Posted Apr 5, 2010 17:25 UTC (Mon) by dlang (guest, #313) [Link] (1 responses)

the ability is there in git to have the delta algorithm uncompress revisions before comparing them.

This has been discussed several times (especially in the context of handling things like .odf files that are compressed XML). What needs to be done to handle formats like this is well understood. Git even has the mechanism to flag files as being of a specific type and call arbatrary tools (external scripts/programs) to handle different file types.

unfortunately, nobody has good, simple examples of this that I am aware of. It's possible, but will take some git-fu to get setup.

Support large repositories!

Posted Apr 6, 2010 17:00 UTC (Tue) by Spudd86 (subscriber, #51683) [Link]

What would be nice is if someone wrote the code to handle the common files of this type and just included it as part of git (or at least posted it somewhere)

Support large repositories!

Posted Apr 6, 2010 19:44 UTC (Tue) by vonbrand (subscriber, #4458) [Link]

Most VCS are indeed designed for human-generated, line-oriented text files.
Right, that is their primary use.
At the core of every single VCS sit "diff" and "patch", totally useless for binaries.
Wrong. At least, git does not use any kind of "diff" + "patch" at it's core. It (optionally!) uses xdelta's format to represent differences between similar file contents to get a more compact representation, and that works fine on binaries.

Support large repositories!

Posted Apr 4, 2010 20:07 UTC (Sun) by dlang (guest, #313) [Link] (10 responses)

as I understand it the problem is not large repositories.

There are two separate problems

1. large files (individual files > 4G or larger than you can reasonably mmap on your system)

this is a real, acknowledged problem that is discussed every 6 months or so on the git list.

2. files that don't diff and therefor make the repository large and therefor slow to copy, etc.

shallow clones (i.e. don't pull all the history) are the work-around for this.

Support large repositories!

Posted Apr 5, 2010 11:36 UTC (Mon) by cortana (subscriber, #24596) [Link] (9 responses)

Large repositories are a problem, as well (as least with Git). Git occasionally decides to 'repack' a
large number of objects in its database into a smaller number of much larger 'packfiles'. As soon
as the size of one of these packfiles goes over 2.1GB, it fails with all kinds of obtuse error
messages.

Support large repositories!

Posted Apr 5, 2010 17:25 UTC (Mon) by dlang (guest, #313) [Link] (1 responses)

if you rename a pack file and add .keep to the end of the name, git will never try to repack that file.

Support large repositories!

Posted Apr 5, 2010 19:02 UTC (Mon) by nix (subscriber, #2304) [Link]

Uh, not quite. You want to create a *new* file with the same name as the
packfile but ending in .keep instead of .pack. (The content can be
anything; a reason why you never want to repack this packfile, or nothing
at all).

Support large repositories!

Posted Apr 5, 2010 19:01 UTC (Mon) by nix (subscriber, #2304) [Link] (6 responses)

Yeah, big packfiles pretty much require a 64-bit box right now (and, given
the near-zero popularity of new 32-bit boxes in non-embedded roles,
probably always).

Support large repositories!

Posted Apr 5, 2010 23:15 UTC (Mon) by cortana (subscriber, #24596) [Link] (5 responses)

And I am *eagerly* awaiting a 64-bit git client for Windows. :)

Support large repositories!

Posted Apr 5, 2010 23:36 UTC (Mon) by dlang (guest, #313) [Link] (4 responses)

the code is 64 bit clean, does it not work if you compile it with a 64 bit compiler on windows?

or are you saying that git needs to provide pre-compiled binaries for 64 bit windows?

Support large repositories!

Posted Apr 6, 2010 8:49 UTC (Tue) by cortana (subscriber, #24596) [Link] (3 responses)

Pre-compiled binaries are necessary. I really don't have enough time to work out how to navigate
the maze that is preparing an mingw/msys environment followed by working out how to build git,
package it into an installer, etc.

Support large repositories!

Posted Apr 8, 2010 16:39 UTC (Thu) by Spudd86 (subscriber, #51683) [Link] (2 responses)

err compiling git on msys should be just

./configure
make
make install

you may or may not have to call configure with --prefix=/usr to get a usable result... but I don't think git has much in the way of dependencies... unless you want things like gitk or git-svn

('course I am assuming you already have MSYS installed...)

Support large repositories!

Posted Apr 8, 2010 16:40 UTC (Thu) by Spudd86 (subscriber, #51683) [Link]

Oh yea if you do what I said you don't need an installer, just fire up your MSYS shell and use git from there

Support large repositories!

Posted Apr 8, 2010 16:48 UTC (Thu) by cortana (subscriber, #24596) [Link]

But then I'm using git in msys, not the nicely-packaged msysgit. And I would need to have a 64-bit
msys environment installed on the developer's systems. Does that even exist?

Support large repositories!

Posted Apr 5, 2010 16:49 UTC (Mon) by SEMW (guest, #52697) [Link] (1 responses)

For what it's worth, both git-bigfiles (a fork of git) and a Mercurial Bigfiles extension exist. (Whether either of them are any use I have no idea).

Support large repositories!

Posted Apr 5, 2010 17:33 UTC (Mon) by chad.netzer (subscriber, #4257) [Link]

Also recently discussed is "bup", which is a backup system for large files:
http://lwn.net/Articles/380983/
http://github.com/apenwarr/bup#readme

The main clever bit is that "bup" writes packfiles directly, rather than trying to diff-and-pack the large blobs after the fact. This combined with splitting and reconstituting large files automatically, might be a good method for git to natively grow efficient large file support (with built in binary data "deduping").

Support large repositories!

Posted Apr 5, 2010 17:30 UTC (Mon) by tan2 (guest, #42953) [Link]

Mercurial (hg) has a "bigfiles" extension for this situation.

http://mercurial.selenic.com/wiki/BigfilesExtension

Files larger than a certain size, say 10MB, are stored in an external place. The file names and md5 checksum are stored and revisioned in the repository. An "hg update -r REV" will update files tracked by the normal repository and by the external place to the specified revision.

Support large repositories!

Posted Apr 6, 2010 0:23 UTC (Tue) by tosiabunio (guest, #65014) [Link]

I have a first hand experience in using Subversion for game development and
that was for versioning data not code.

The Witcher game used SVN during its development. The working directory was
20+ GB in size and the whole repository was 200+ GB in size when the game
reached gold.

I have to admit that SVN handled this quite well although the performance was
the biggest problem. 50+ people updating 50K+ files every morning took some
time. Many hours wasted just to finally update a few hundred files. Also many
gigabytes of local space wasted by local copies of each file only needed in
case of revert operation (which could be done from the server anyway).

Support large repositories!

Posted Apr 8, 2010 20:06 UTC (Thu) by jools (guest, #65116) [Link]

Yes! Rightly or wrongly game developers have large repositories, with lots of binary files in them. For instance here's the sizes of repository I'm working with.

Accurev: 22GB workspace in 130,000 files on disk, history going back to 2003,
Perforce: 115GB workspace in 50,000 files on disk, a couple of years of history

Both of these systems are well able to cope with this. For Subversion, as a centralised VCS, this is the competition.

For what it's worth these are the advantages which being centralised can bring IMO:

* nothing but the files locally. at those sizes you don't want to have to have the whole repository locally (optionally would be fine!), or the pristine copy that subversion currently keeps.

* central backup

* central access to all commited code on all branches

Being centralised shouldn't stop the painless branching and merging that DVCS has. I've used Accurev a lot, and IMO it's competitive with Mercurial for this, although the command line is clunkier. (Use the GUI!)

OTOH a DVCS could bring all of the features above. As far as I know right now nothing does :( I'd love to be proved wrong on this!

Those sizes of repository also suggest why some systems e.g. Perforce use the checkout-before-edit system: it greatly reduces the file scanning required, and so speeds up some client operations greatly.


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds