LWN: Comments on "Preserving the global software heritage" https://lwn.net/Articles/693471/ This is a special feed containing comments posted to the individual LWN article titled "Preserving the global software heritage". en-us Mon, 03 Nov 2025 01:19:06 +0000 Mon, 03 Nov 2025 01:19:06 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net Preserving the global software heritage https://lwn.net/Articles/696623/ https://lwn.net/Articles/696623/ mina86 <div class="FormattedComment"> I wonder how they're planning to cope with projects whose original GitHub repository stopped being maintained and the project is now developed in a “forked” repository. Perhaps the long term solution is to store everything? <br> <p> Another observation is that people who accidentally pushed their password upstream are now even more screwed. :P Not that in such a case one should immediately change their password anyway. <br> </div> Sun, 07 Aug 2016 15:22:33 +0000 Preserving the global software heritage https://lwn.net/Articles/696569/ https://lwn.net/Articles/696569/ nix <blockquote> we looked and found thousands of different file names (not even considering the full paths) for the textual version of the GPL3 license, many of which appear to be randomly generated names. </blockquote> I'm curious as to just what on earth might be choosing to rename a COPYING file to a randomly-generated name. This isn't Subversion text-base/ directories or something, is it? Fri, 05 Aug 2016 19:41:43 +0000 Preserving the global software heritage https://lwn.net/Articles/695254/ https://lwn.net/Articles/695254/ zack <div class="FormattedComment"> <font class="QuotedText">&gt; So all revisions are stored as separate files? One would think storing differences (with RCS for example) would make more sense for textual source files, to greatly reduce the space usage (better than zipping the versions separately).</font><br> <p> It's in fact a trade-off between compressing well at the individual "project" scale, and compressing well across tens of million "projects". For now, we've chosen the large scale end of that spectrum, and it's working very well for us; we'll see how it goes in the long run.<br> <p> As a first approximation, our low-level file storage is a content-adressable storage, which we use to deduplicate files coming from any software origin we track. If some content is pointed to by a file called foo.txt in a project and bar.txt in another one, the content will be stored only once. In addition to the storage we do have metadata remembering all file names and paths, but the low-level storage is completely agnostic to that. That's useful to trivially deduplicate across different software origins, but also means that when a new (version of a new) file comes in the content storage won't know what to diff it against to store it more efficiently. Unless you go deeper and do stuff like sub-file block-level deduplication, which we currently don't do, but that (at least architecturally) won't be hard to had.<br> <p> Git itself, which at its bottom level is a content-adressable storage, has similar problems to optimize the size of packfiles. It gets around them with heuristics based on file names: "these two blobs are pointed by the same path, let's store one of them as a diff against the other one". But Git has it easier, because filenames in a single repository are very stable over time and renames are local. At our scale the same blob will be pointed by millions of different paths; as a fun fact we looked and found thousands of different file names (not even considering the full paths) for the textual version of the GPL3 license, many of which appear to be randomly generated names.<br> <p> Yes, we could try some smart heuristics to "cluster" together content that are similar by some metric, and store contents belonging to the same cluster as diff to some base version, but it didn't seem worth the extra design complexity.<br> <p> <font class="QuotedText">&gt; Anyway, this is a fantastic project, I wish it luck.</font><br> <p> Thanks :-)<br> </div> Sun, 24 Jul 2016 16:46:30 +0000 Preserving the global software heritage https://lwn.net/Articles/694555/ https://lwn.net/Articles/694555/ Wol <div class="FormattedComment"> <font class="QuotedText">&gt; Copyright is inherently controversial. It's supposed to To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries—but both extremities don't work (or so it seems).</font><br> <p> Except that is a minority view. It's the justification used in the American Constitution, but that doesn't apply to most of the world. It's based on the Queen Anne act (can't remember the details) which was meant to provide a monopoly for printers.<br> <p> And it's been shown that that "limited times" should be about 10 years - nearly all the value in almost any work will have been extracted in that ten years.<br> <p> I can accept extending it for the authors, but the current system is just totally unjustifiable ...<br> <p> Cheers,<br> Wol<br> </div> Thu, 14 Jul 2016 23:37:34 +0000 Preserving the global software heritage https://lwn.net/Articles/694156/ https://lwn.net/Articles/694156/ flussence <div class="FormattedComment"> I'm at a loss to what the real security issue of weak hashes on a public dataset is. Can you give examples?<br> </div> Mon, 11 Jul 2016 22:09:27 +0000 Preserving the global software heritage https://lwn.net/Articles/694064/ https://lwn.net/Articles/694064/ hkario <div class="FormattedComment"> the problem is that malicious users can create SHA-1 collisions, RIPEMD-160 is not much better (yes, it moves the problem few years in the future, but it does not eliminate it)<br> <p> you simply should not use any kind of 160bit hash in current time, especially for a project that is just being deployed<br> </div> Mon, 11 Jul 2016 14:22:50 +0000 Preserving the global software heritage https://lwn.net/Articles/694048/ https://lwn.net/Articles/694048/ zack <div class="FormattedComment"> <font class="QuotedText">&gt; SHA-1, while still absolutely enough for this kind of application, seems a strange choice today. Is git compatibility an issue?</font><br> <p> To clarify, we offer only SHA1 as lookup mechanism in the current (very minimal for now) Web UI, but we do not rely on the fact that we will not encounter SHA1 collisions in the wild. (Even though I personally do agree that SHA1 is still absolutely enough for this kind of applications, we are trying to be future proof and we know we will eventually need to move away from SHA1 even for integrity checking purposes.)<br> <p> Internally in our DB we currently use 3 kinds of checksums—SHA1, SHA2 (256), "salted" SHA1 (a-la git hash-object)—and we do cross checks to spot collisions on a single one of them.<br> <p> We would like to add SHA3 in the mix (possibly dropping SHA2), but for that we were waiting for a stable SHA3 implementation to land in Python 3.x (we're currently on 3.4).<br> <p> Hope this clarifies.<br> /me, wearing his Software Heritage hat<br> </div> Mon, 11 Jul 2016 10:33:19 +0000 Preserving the global software heritage https://lwn.net/Articles/693932/ https://lwn.net/Articles/693932/ flussence <div class="FormattedComment"> If size is a concern, RIPEMD-160 is the same as SHA1 while being a bit less broken and widely available. SHA1 has hardware acceleration though, probably significant for a dataset this huge.<br> </div> Sat, 09 Jul 2016 00:48:04 +0000 Preserving the global software heritage https://lwn.net/Articles/693913/ https://lwn.net/Articles/693913/ robbe <div class="FormattedComment"> <font class="QuotedText">&gt; The web interface will likely obfuscate addresses</font><br> <p> Aren’t we past this stage already in the spamfight? Do reasonable people expect the mail addresses they upload into a public forum to remain out of the hands of spammers?<br> <p> <font class="QuotedText">&gt; and the archive API may rate-limit requests. </font><br> <p> That’s never a bad idea.<br> </div> Fri, 08 Jul 2016 21:07:00 +0000 Preserving the global software heritage https://lwn.net/Articles/693912/ https://lwn.net/Articles/693912/ robbe <div class="FormattedComment"> Oh, interesting …<br> /me runs sha1sum on <a href="https://github.com/jmechner/Prince-of-Persia-Apple-II/blob/master/01%20POP%20Source/Source/GRAFIX.S">https://github.com/jmechner/Prince-of-Persia-Apple-II/blo...</a><br> 9b870d5108539a195401b611061e188cd1d1411d, yep, it’s already in the heritage archive. <br> </div> Fri, 08 Jul 2016 21:00:26 +0000 Preserving the global software heritage https://lwn.net/Articles/693911/ https://lwn.net/Articles/693911/ robbe <div class="FormattedComment"> sha-2 256, I guess. But that would also bloat their postgres DB…<br> </div> Fri, 08 Jul 2016 20:26:39 +0000 Preserving the global software heritage https://lwn.net/Articles/693909/ https://lwn.net/Articles/693909/ khim I think eventually they'll do. They are veeeeery slow, but such things <b>are</b> getting published. <a href="http://www.computerhistory.org/atchm/microsoft-ms-dos-early-source-code/">MS DOS here</a>, <a href="https://github.com/jmechner/Prince-of-Persia-Apple-II">Prince of Persia there</a>… I only wish they would do it somewhat earlier… I'm pretty sure so much is already lost…</p> Fri, 08 Jul 2016 19:47:39 +0000 Preserving the global software heritage https://lwn.net/Articles/693907/ https://lwn.net/Articles/693907/ khim <p>Copyright is inherently controversial. It's supposed to <i>To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries</i>—but both extremities don't work (or so it seems). If these “limited Times” are too small then apparently nobody will bother to create anything (although I'm not sure if anyone ever tested that theory). If that time is pushed to millions of years then all the “Writings and Discoveries” would eventually become unavailable for anyone who does not want to spend exorbitant sums of money.</p> <p>Libraries (both physical and digital) just sit at the center of that whole dilemma…</p> Fri, 08 Jul 2016 19:43:36 +0000 Preserving the global software heritage https://lwn.net/Articles/693879/ https://lwn.net/Articles/693879/ jospoortvliet <div class="FormattedComment"> My first thought was... Will MS, Apple and others contribute ancient versions of Windows (eg 1.0, 1.1), DOS (1-5 perhaps), Mac OS and so on...???<br> </div> Fri, 08 Jul 2016 15:50:39 +0000 Preserving the global software heritage https://lwn.net/Articles/693856/ https://lwn.net/Articles/693856/ amarao <div class="FormattedComment"> I do not understand. They says 'close electronic libraries, they infringe our precious copyright', and this is 'piracy' and 'bad thing'. Next they says 'preserve culture' and 'give access to stored media' and this is good thing and so on.<br> <p> Why not legalize those libraries at first place? Amazing collection of existing books, available for search, free access to everyone who need it. But no, this is 'piracy'.<br> <p> I do not understand you, mankind species.<br> </div> Fri, 08 Jul 2016 13:12:31 +0000 Preserving the global software heritage https://lwn.net/Articles/693854/ https://lwn.net/Articles/693854/ smitty_one_each <div class="FormattedComment"> "seems a strange choice today"<br> <p> What else might one recommend?<br> </div> Fri, 08 Jul 2016 12:50:11 +0000 Preserving the global software heritage https://lwn.net/Articles/693814/ https://lwn.net/Articles/693814/ eru <i>All of the imported archives are stored as flat files in a standard filesystem, including all of the revisions of each file.</i> <p> So all revisions are stored as separate files? One would think storing differences (with RCS for example) would make more sense for textual source files, to greatly reduce the space usage (better than zipping the versions separately). It s even possible to search for strings inside a set of versions stored in this format without unpacking the file. <p> Anyway, this is a fantastic project, I wish it luck. Fri, 08 Jul 2016 07:18:24 +0000 Preserving the global software heritage https://lwn.net/Articles/693811/ https://lwn.net/Articles/693811/ robbe <div class="FormattedComment"> <font class="QuotedText">&gt; Users can search for specific files by their SHA-1 hashes, but</font><br> <font class="QuotedText">&gt; cannot browse.</font><br> To be specific: all you can get currently is a yes/no answer to the question: „is that SHA-1 hash contained in the archive?“<br> <p> (It was not clear to me.)<br> <p> SHA-1, while still absolutely enough for this kind of application, seems a strange choice today. Is git compatibility an issue?<br> </div> Fri, 08 Jul 2016 06:51:35 +0000