Preserving the global software heritage

By Nathan Willis
July 7, 2016

The Software Heritage initiative is an ambitious new effort to amass an organized, searchable index of all of the software source code available in the world (ultimately, including code released under free-software licenses as well as code that was not). Software Heritage was launched on June 30 with a team of just four employees but with the support of several corporate sponsors. So far, the Software Heritage software archive has imported 2.7 billion files from GitHub, the Debian package archive, and the GNU FTP archives, but that is only the beginning.

In addition to the information on the Software Heritage site, Nicolas Dandrimont gave a presentation about the project on July 4 at DebConf; video [WebM] is available. In the talk, Dandrimont noted that software is not merely pervasive in the modern world, but it has cultural value as well: it captures human knowledge. Consequently, it is as important to catalog and preserve as are books and other media—arguably more so, because electronic files and repositories are prone to corruption and sudden disappearance.

Thus, the goal of Software Heritage is to ingest all of software source code available, index it in a meaningful way, and provide front-ends for the public to access it. At the beginning, that access will take the form of searching, but Dandrimont said the project hopes to empower research, education, and cultural analysis in the long term. There are also immediate practical uses for a global software archive: the tracking of security vulnerabilities, assisting in license compliance, and helping developers discover relevant prior art.

The project was initiated by Inria, the French Institute for Research in Computer Science and Automation (which has a long history of supporting free-software development) and as of launch time has picked up Microsoft and Data Archiving and Networked Services (DANS) as additional sponsors. Dandrimont said that the intent is to grow Software Heritage into a standalone non-profit organization. For now, however, there is a small team of full-time employees working on the project, with the assistance of several interns.

The project's servers are currently hosted at Inria, utilizing about a dozen virtual machines and a 300TB storage array. At the moment, there are backups at a separate facility, but there is not yet a mirror network. The archive itself is online, though it is currently accessible only in limited form. Users can search for specific files by their SHA-1 hashes, but cannot browse.

Indices

It does not take much contemplation to realize that Software Heritage's stated goal of indexing all available software is both massive in raw numbers and complicated by the vast assortment of software sources involved. Software Heritage's chief technology officer (CTO) is Stefano Zacchiroli, a former Debian Project Leader who has recently devoted his attention to Debsources, a searchable online database of every revision of every package in the Debian archive.

Software Heritage is an extension of the Debsources concept (which, no doubt, had some influence in making the Debian archive one of the initial bulk imports). In addition to the Debian archive, at launch time the Software Heritage archive also included every package available through the GNU project's FTP site and an import of all public, non-fork repositories on GitHub. Dandrimont mentioned in his talk that the Software Heritage team is currently working with Google to import the Google Code archive and with Archive Team to import its Gitorious.org archive.

Between the three existing sources, the GitHub data set is the largest, accounting for 22 million repositories and 2.6 billion files. For comparison, in 2015, Debsources was reported to include 11.7 million files in just over 40,000 packages. Google Code included around 12 million projects and Gitorious around 2 million.

But those collections account for just a handful of sites where software can be found. Moving forward, Software Heritage wants to import the archives for the other public code-hosting services (like SourceForge), every Linux distribution, language-specific sites like the Python Package Index, corporate and personal software repositories, and (ultimately) everywhere else.

Complicating the task is that this broad scope, by its very nature, will pull in a lot of software that is not open-source or free software. In fact, as Zacchiroli confirmed in an email, the licensing factor is already a hurdle, since so many repositories have no licensing information:

There is a lot of publicly available source code (e.g., on GitHub) which is simply not licensed as FOSS, often due to the lack of a license. That stuff will become FOSS one day though, either when its license gets clarified (even retroactively), or when copyright expires.

The way I like to think about this is: we want to protect the entire Software Commons. Free/Open Source Software is the largest and best curated part of it; so we want to protect of FOSS. Given the long-term nature of Software Heritage, we simply go for all publicly available source code (which includes all of FOSS but is larger), as it will become part of the Software Commons one day too.

For now, Zacchiroli said, the Software Heritage team is focused on finalizing the database of the current software and on putting a reliable update mechanism in place. GitHub, for example, is working with the team to enable ongoing updates of the already imported repositories, as well as adding new repositories as they are created. The team is also writing import tools for use ingesting files from a variety of version-control systems (old and new).

Access

Although the Software Heritage archive's full-blown web interface has yet to be launched, Dandrimont's talk provided some details on how it will work, as well as how the underlying stack is designed.

All of the imported archives are stored as flat files in a standard filesystem, including all of the revisions of each file. A PostgreSQL database tracks each file by its SHA-1 hash, with directory-level manifests of which files are in which directory. Furthermore, each release of each package is stored in the database as a directed acyclic graph of hashes, and metadata is tracked on the origin (e.g., GitHub or GNU) of each package and various other semantic properties (such as license and authorship). At present, he said, the archive consists of 2.7 billion files occupying 120TB, with the metadata database taking up another 3.1TB. "It is probably the biggest distributed version-control graph in existence," he added.

Browsing through the web interface and full-text searching are the next features on the roadmap. Following that, downloading comes next, including an interface to grab projects with git clone. Further out, the project's plans are less specific, in part because it hopes to attract input from researchers and users to help determine what features are of interest.

At the moment, he said, the storage layer is fairly basic in its design. He noted that the raw number of files "broke Git's storage model" and that the small file sizes (3kB on average) posed its own set of challenges. He then invited storage experts to get involved in the project, particularly as the team starts exploring database replication and mirroring. The code used by the project itself is free software, available at forge.softwareheritage.org.

Because the archive contains so many names and email addresses, Zacchiroli said that steps were being taken to make it difficult for spammers to harvest addresses in bulk, while still making it possible for well-behaved users to access files in their correct form. "There is a tension here," he explained. The web interface will likely obfuscate addresses and the archive API may rate-limit requests.

The project clearly has a long road ahead of it; in addition to the large project-hosting sites and FTP archives, collecting all of the world's publicly available software entails connecting to thousands if not millions of small sites and individual releases. But what Software Heritage is setting out to do seems to offer more value than a plain "file storage" archive like those offered by Archive Team and the Internet Archive. Providing a platform for learning, searching, and researching software has the potential to attract more investments of time and financial resources, two quantities that Software Heritage is sure to need in the years ahead.

Preserving the global software heritage

Posted Jul 8, 2016 6:51 UTC (Fri) by robbe (guest, #16131) [Link] (6 responses)

> Users can search for specific files by their SHA-1 hashes, but
> cannot browse.
To be specific: all you can get currently is a yes/no answer to the question: „is that SHA-1 hash contained in the archive?“

(It was not clear to me.)

SHA-1, while still absolutely enough for this kind of application, seems a strange choice today. Is git compatibility an issue?

Preserving the global software heritage

Posted Jul 8, 2016 12:50 UTC (Fri) by smitty_one_each (subscriber, #28989) [Link] (4 responses)

"seems a strange choice today"

What else might one recommend?

Preserving the global software heritage

Posted Jul 8, 2016 20:26 UTC (Fri) by robbe (guest, #16131) [Link] (3 responses)

sha-2 256, I guess. But that would also bloat their postgres DB…

Preserving the global software heritage

Posted Jul 9, 2016 0:48 UTC (Sat) by flussence (guest, #85566) [Link] (2 responses)

If size is a concern, RIPEMD-160 is the same as SHA1 while being a bit less broken and widely available. SHA1 has hardware acceleration though, probably significant for a dataset this huge.

Preserving the global software heritage

Posted Jul 11, 2016 14:22 UTC (Mon) by hkario (subscriber, #94864) [Link] (1 responses)

the problem is that malicious users can create SHA-1 collisions, RIPEMD-160 is not much better (yes, it moves the problem few years in the future, but it does not eliminate it)

you simply should not use any kind of 160bit hash in current time, especially for a project that is just being deployed

Preserving the global software heritage

Posted Jul 11, 2016 22:09 UTC (Mon) by flussence (guest, #85566) [Link]

I'm at a loss to what the real security issue of weak hashes on a public dataset is. Can you give examples?

Preserving the global software heritage

Posted Jul 11, 2016 10:33 UTC (Mon) by zack (subscriber, #7062) [Link]

> SHA-1, while still absolutely enough for this kind of application, seems a strange choice today. Is git compatibility an issue?

To clarify, we offer only SHA1 as lookup mechanism in the current (very minimal for now) Web UI, but we do not rely on the fact that we will not encounter SHA1 collisions in the wild. (Even though I personally do agree that SHA1 is still absolutely enough for this kind of applications, we are trying to be future proof and we know we will eventually need to move away from SHA1 even for integrity checking purposes.)

Internally in our DB we currently use 3 kinds of checksums—SHA1, SHA2 (256), "salted" SHA1 (a-la git hash-object)—and we do cross checks to spot collisions on a single one of them.

We would like to add SHA3 in the mix (possibly dropping SHA2), but for that we were waiting for a stable SHA3 implementation to land in Python 3.x (we're currently on 3.4).

Hope this clarifies.
/me, wearing his Software Heritage hat

Preserving the global software heritage

Posted Jul 8, 2016 7:18 UTC (Fri) by eru (subscriber, #2753) [Link] (2 responses)

All of the imported archives are stored as flat files in a standard filesystem, including all of the revisions of each file.

So all revisions are stored as separate files? One would think storing differences (with RCS for example) would make more sense for textual source files, to greatly reduce the space usage (better than zipping the versions separately). It s even possible to search for strings inside a set of versions stored in this format without unpacking the file.

Anyway, this is a fantastic project, I wish it luck.

Preserving the global software heritage

Posted Jul 24, 2016 16:46 UTC (Sun) by zack (subscriber, #7062) [Link] (1 responses)

> So all revisions are stored as separate files? One would think storing differences (with RCS for example) would make more sense for textual source files, to greatly reduce the space usage (better than zipping the versions separately).

It's in fact a trade-off between compressing well at the individual "project" scale, and compressing well across tens of million "projects". For now, we've chosen the large scale end of that spectrum, and it's working very well for us; we'll see how it goes in the long run.

As a first approximation, our low-level file storage is a content-adressable storage, which we use to deduplicate files coming from any software origin we track. If some content is pointed to by a file called foo.txt in a project and bar.txt in another one, the content will be stored only once. In addition to the storage we do have metadata remembering all file names and paths, but the low-level storage is completely agnostic to that. That's useful to trivially deduplicate across different software origins, but also means that when a new (version of a new) file comes in the content storage won't know what to diff it against to store it more efficiently. Unless you go deeper and do stuff like sub-file block-level deduplication, which we currently don't do, but that (at least architecturally) won't be hard to had.

Git itself, which at its bottom level is a content-adressable storage, has similar problems to optimize the size of packfiles. It gets around them with heuristics based on file names: "these two blobs are pointed by the same path, let's store one of them as a diff against the other one". But Git has it easier, because filenames in a single repository are very stable over time and renames are local. At our scale the same blob will be pointed by millions of different paths; as a fun fact we looked and found thousands of different file names (not even considering the full paths) for the textual version of the GPL3 license, many of which appear to be randomly generated names.

Yes, we could try some smart heuristics to "cluster" together content that are similar by some metric, and store contents belonging to the same cluster as diff to some base version, but it didn't seem worth the extra design complexity.

> Anyway, this is a fantastic project, I wish it luck.

Thanks :-)

Preserving the global software heritage

Posted Aug 5, 2016 19:41 UTC (Fri) by nix (subscriber, #2304) [Link]

we looked and found thousands of different file names (not even considering the full paths) for the textual version of the GPL3 license, many of which appear to be randomly generated names.

I'm curious as to just what on earth might be choosing to rename a COPYING file to a randomly-generated name. This isn't Subversion text-base/ directories or something, is it?

Preserving the global software heritage

Posted Jul 8, 2016 13:12 UTC (Fri) by amarao (guest, #87073) [Link] (2 responses)

I do not understand. They says 'close electronic libraries, they infringe our precious copyright', and this is 'piracy' and 'bad thing'. Next they says 'preserve culture' and 'give access to stored media' and this is good thing and so on.

Why not legalize those libraries at first place? Amazing collection of existing books, available for search, free access to everyone who need it. But no, this is 'piracy'.

I do not understand you, mankind species.

Preserving the global software heritage

Posted Jul 8, 2016 19:43 UTC (Fri) by khim (subscriber, #9252) [Link] (1 responses)

Copyright is inherently controversial. It's supposed to To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries—but both extremities don't work (or so it seems). If these “limited Times” are too small then apparently nobody will bother to create anything (although I'm not sure if anyone ever tested that theory). If that time is pushed to millions of years then all the “Writings and Discoveries” would eventually become unavailable for anyone who does not want to spend exorbitant sums of money.

Libraries (both physical and digital) just sit at the center of that whole dilemma…

Preserving the global software heritage

Posted Jul 14, 2016 23:37 UTC (Thu) by Wol (subscriber, #4433) [Link]

> Copyright is inherently controversial. It's supposed to To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries—but both extremities don't work (or so it seems).

Except that is a minority view. It's the justification used in the American Constitution, but that doesn't apply to most of the world. It's based on the Queen Anne act (can't remember the details) which was meant to provide a monopoly for printers.

And it's been shown that that "limited times" should be about 10 years - nearly all the value in almost any work will have been extracted in that ten years.

I can accept extending it for the authors, but the current system is just totally unjustifiable ...

Cheers,
Wol

Preserving the global software heritage

Posted Jul 8, 2016 15:50 UTC (Fri) by jospoortvliet (guest, #33164) [Link] (2 responses)

My first thought was... Will MS, Apple and others contribute ancient versions of Windows (eg 1.0, 1.1), DOS (1-5 perhaps), Mac OS and so on...???

Preserving the global software heritage

Posted Jul 8, 2016 19:47 UTC (Fri) by khim (subscriber, #9252) [Link] (1 responses)

I think eventually they'll do. They are veeeeery slow, but such things are getting published. MS DOS here, Prince of Persia there… I only wish they would do it somewhat earlier… I'm pretty sure so much is already lost…

Preserving the global software heritage

Posted Jul 8, 2016 21:00 UTC (Fri) by robbe (guest, #16131) [Link]

Oh, interesting …
/me runs sha1sum on https://github.com/jmechner/Prince-of-Persia-Apple-II/blo...
9b870d5108539a195401b611061e188cd1d1411d, yep, it’s already in the heritage archive.

Preserving the global software heritage

Posted Jul 8, 2016 21:07 UTC (Fri) by robbe (guest, #16131) [Link]

> The web interface will likely obfuscate addresses

Aren’t we past this stage already in the spamfight? Do reasonable people expect the mail addresses they upload into a public forum to remain out of the hands of spammers?

> and the archive API may rate-limit requests.

That’s never a bad idea.

Preserving the global software heritage

Posted Aug 7, 2016 15:22 UTC (Sun) by mina86 (guest, #68442) [Link]

I wonder how they're planning to cope with projects whose original GitHub repository stopped being maintained and the project is now developed in a “forked” repository. Perhaps the long term solution is to store everything?

Another observation is that people who accidentally pushed their password upstream are now even more screwed. :P Not that in such a case one should immediately change their password anyway.