LWN: Comments on "Preserving the global software heritage"

Preserving the global software heritage

mina86 — Sun, 07 Aug 2016 15:22:33 +0000

I wonder how they're planning to cope with projects whose original GitHub repository stopped being maintained and the project is now developed in a “forked” repository. Perhaps the long term solution is to store everything?

Another observation is that people who accidentally pushed their password upstream are now even more screwed. :P Not that in such a case one should immediately change their password anyway.

Preserving the global software heritage

nix — Fri, 05 Aug 2016 19:41:43 +0000

we looked and found thousands of different file names (not even considering the full paths) for the textual version of the GPL3 license, many of which appear to be randomly generated names.

I'm curious as to just what on earth might be choosing to rename a COPYING file to a randomly-generated name. This isn't Subversion text-base/ directories or something, is it?

Preserving the global software heritage

zack — Sun, 24 Jul 2016 16:46:30 +0000

> So all revisions are stored as separate files? One would think storing differences (with RCS for example) would make more sense for textual source files, to greatly reduce the space usage (better than zipping the versions separately).

It's in fact a trade-off between compressing well at the individual "project" scale, and compressing well across tens of million "projects". For now, we've chosen the large scale end of that spectrum, and it's working very well for us; we'll see how it goes in the long run.

As a first approximation, our low-level file storage is a content-adressable storage, which we use to deduplicate files coming from any software origin we track. If some content is pointed to by a file called foo.txt in a project and bar.txt in another one, the content will be stored only once. In addition to the storage we do have metadata remembering all file names and paths, but the low-level storage is completely agnostic to that. That's useful to trivially deduplicate across different software origins, but also means that when a new (version of a new) file comes in the content storage won't know what to diff it against to store it more efficiently. Unless you go deeper and do stuff like sub-file block-level deduplication, which we currently don't do, but that (at least architecturally) won't be hard to had.

Git itself, which at its bottom level is a content-adressable storage, has similar problems to optimize the size of packfiles. It gets around them with heuristics based on file names: "these two blobs are pointed by the same path, let's store one of them as a diff against the other one". But Git has it easier, because filenames in a single repository are very stable over time and renames are local. At our scale the same blob will be pointed by millions of different paths; as a fun fact we looked and found thousands of different file names (not even considering the full paths) for the textual version of the GPL3 license, many of which appear to be randomly generated names.

Yes, we could try some smart heuristics to "cluster" together content that are similar by some metric, and store contents belonging to the same cluster as diff to some base version, but it didn't seem worth the extra design complexity.

> Anyway, this is a fantastic project, I wish it luck.

Thanks :-)

Preserving the global software heritage

Wol — Thu, 14 Jul 2016 23:37:34 +0000

> Copyright is inherently controversial. It's supposed to To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries—but both extremities don't work (or so it seems).

Except that is a minority view. It's the justification used in the American Constitution, but that doesn't apply to most of the world. It's based on the Queen Anne act (can't remember the details) which was meant to provide a monopoly for printers.

And it's been shown that that "limited times" should be about 10 years - nearly all the value in almost any work will have been extracted in that ten years.

I can accept extending it for the authors, but the current system is just totally unjustifiable ...

Cheers,
Wol

Preserving the global software heritage

flussence — Mon, 11 Jul 2016 22:09:27 +0000

I'm at a loss to what the real security issue of weak hashes on a public dataset is. Can you give examples?

Preserving the global software heritage

hkario — Mon, 11 Jul 2016 14:22:50 +0000

the problem is that malicious users can create SHA-1 collisions, RIPEMD-160 is not much better (yes, it moves the problem few years in the future, but it does not eliminate it)

you simply should not use any kind of 160bit hash in current time, especially for a project that is just being deployed

Preserving the global software heritage

zack — Mon, 11 Jul 2016 10:33:19 +0000

> SHA-1, while still absolutely enough for this kind of application, seems a strange choice today. Is git compatibility an issue?

To clarify, we offer only SHA1 as lookup mechanism in the current (very minimal for now) Web UI, but we do not rely on the fact that we will not encounter SHA1 collisions in the wild. (Even though I personally do agree that SHA1 is still absolutely enough for this kind of applications, we are trying to be future proof and we know we will eventually need to move away from SHA1 even for integrity checking purposes.)

Internally in our DB we currently use 3 kinds of checksums—SHA1, SHA2 (256), "salted" SHA1 (a-la git hash-object)—and we do cross checks to spot collisions on a single one of them.

We would like to add SHA3 in the mix (possibly dropping SHA2), but for that we were waiting for a stable SHA3 implementation to land in Python 3.x (we're currently on 3.4).

Hope this clarifies.
/me, wearing his Software Heritage hat

Preserving the global software heritage

flussence — Sat, 09 Jul 2016 00:48:04 +0000

If size is a concern, RIPEMD-160 is the same as SHA1 while being a bit less broken and widely available. SHA1 has hardware acceleration though, probably significant for a dataset this huge.

Preserving the global software heritage

robbe — Fri, 08 Jul 2016 21:07:00 +0000

> The web interface will likely obfuscate addresses

Aren’t we past this stage already in the spamfight? Do reasonable people expect the mail addresses they upload into a public forum to remain out of the hands of spammers?

> and the archive API may rate-limit requests.

That’s never a bad idea.

Preserving the global software heritage

robbe — Fri, 08 Jul 2016 21:00:26 +0000

Oh, interesting …
/me runs sha1sum on https://github.com/jmechner/Prince-of-Persia-Apple-II/blo...
9b870d5108539a195401b611061e188cd1d1411d, yep, it’s already in the heritage archive.

Preserving the global software heritage

robbe — Fri, 08 Jul 2016 20:26:39 +0000

sha-2 256, I guess. But that would also bloat their postgres DB…

Preserving the global software heritage

khim — Fri, 08 Jul 2016 19:47:39 +0000

I think eventually they'll do. They are veeeeery slow, but such things are getting published. MS DOS here, Prince of Persia there… I only wish they would do it somewhat earlier… I'm pretty sure so much is already lost…

Preserving the global software heritage

khim — Fri, 08 Jul 2016 19:43:36 +0000

Copyright is inherently controversial. It's supposed to To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries—but both extremities don't work (or so it seems). If these “limited Times” are too small then apparently nobody will bother to create anything (although I'm not sure if anyone ever tested that theory). If that time is pushed to millions of years then all the “Writings and Discoveries” would eventually become unavailable for anyone who does not want to spend exorbitant sums of money.

Libraries (both physical and digital) just sit at the center of that whole dilemma…

Preserving the global software heritage

jospoortvliet — Fri, 08 Jul 2016 15:50:39 +0000

My first thought was... Will MS, Apple and others contribute ancient versions of Windows (eg 1.0, 1.1), DOS (1-5 perhaps), Mac OS and so on...???

Preserving the global software heritage

amarao — Fri, 08 Jul 2016 13:12:31 +0000

I do not understand. They says 'close electronic libraries, they infringe our precious copyright', and this is 'piracy' and 'bad thing'. Next they says 'preserve culture' and 'give access to stored media' and this is good thing and so on.

Why not legalize those libraries at first place? Amazing collection of existing books, available for search, free access to everyone who need it. But no, this is 'piracy'.

I do not understand you, mankind species.

Preserving the global software heritage

smitty_one_each — Fri, 08 Jul 2016 12:50:11 +0000

"seems a strange choice today"

What else might one recommend?

Preserving the global software heritage

eru — Fri, 08 Jul 2016 07:18:24 +0000

All of the imported archives are stored as flat files in a standard filesystem, including all of the revisions of each file.

So all revisions are stored as separate files? One would think storing differences (with RCS for example) would make more sense for textual source files, to greatly reduce the space usage (better than zipping the versions separately). It s even possible to search for strings inside a set of versions stored in this format without unpacking the file.

Anyway, this is a fantastic project, I wish it luck.

Preserving the global software heritage

robbe — Fri, 08 Jul 2016 06:51:35 +0000

> Users can search for specific files by their SHA-1 hashes, but
> cannot browse.
To be specific: all you can get currently is a yes/no answer to the question: „is that SHA-1 hash contained in the archive?“

(It was not clear to me.)

SHA-1, while still absolutely enough for this kind of application, seems a strange choice today. Is git compatibility an issue?