Leading items

Preserving the global software heritage

By Nathan Willis
July 7, 2016

The Software Heritage initiative is an ambitious new effort to amass an organized, searchable index of all of the software source code available in the world (ultimately, including code released under free-software licenses as well as code that was not). Software Heritage was launched on June 30 with a team of just four employees but with the support of several corporate sponsors. So far, the Software Heritage software archive has imported 2.7 billion files from GitHub, the Debian package archive, and the GNU FTP archives, but that is only the beginning.

In addition to the information on the Software Heritage site, Nicolas Dandrimont gave a presentation about the project on July 4 at DebConf; video [WebM] is available. In the talk, Dandrimont noted that software is not merely pervasive in the modern world, but it has cultural value as well: it captures human knowledge. Consequently, it is as important to catalog and preserve as are books and other media—arguably more so, because electronic files and repositories are prone to corruption and sudden disappearance.

Thus, the goal of Software Heritage is to ingest all of software source code available, index it in a meaningful way, and provide front-ends for the public to access it. At the beginning, that access will take the form of searching, but Dandrimont said the project hopes to empower research, education, and cultural analysis in the long term. There are also immediate practical uses for a global software archive: the tracking of security vulnerabilities, assisting in license compliance, and helping developers discover relevant prior art.

The project was initiated by Inria, the French Institute for Research in Computer Science and Automation (which has a long history of supporting free-software development) and as of launch time has picked up Microsoft and Data Archiving and Networked Services (DANS) as additional sponsors. Dandrimont said that the intent is to grow Software Heritage into a standalone non-profit organization. For now, however, there is a small team of full-time employees working on the project, with the assistance of several interns.

The project's servers are currently hosted at Inria, utilizing about a dozen virtual machines and a 300TB storage array. At the moment, there are backups at a separate facility, but there is not yet a mirror network. The archive itself is online, though it is currently accessible only in limited form. Users can search for specific files by their SHA-1 hashes, but cannot browse.

Indices

It does not take much contemplation to realize that Software Heritage's stated goal of indexing all available software is both massive in raw numbers and complicated by the vast assortment of software sources involved. Software Heritage's chief technology officer (CTO) is Stefano Zacchiroli, a former Debian Project Leader who has recently devoted his attention to Debsources, a searchable online database of every revision of every package in the Debian archive.

Software Heritage is an extension of the Debsources concept (which, no doubt, had some influence in making the Debian archive one of the initial bulk imports). In addition to the Debian archive, at launch time the Software Heritage archive also included every package available through the GNU project's FTP site and an import of all public, non-fork repositories on GitHub. Dandrimont mentioned in his talk that the Software Heritage team is currently working with Google to import the Google Code archive and with Archive Team to import its Gitorious.org archive.

Between the three existing sources, the GitHub data set is the largest, accounting for 22 million repositories and 2.6 billion files. For comparison, in 2015, Debsources was reported to include 11.7 million files in just over 40,000 packages. Google Code included around 12 million projects and Gitorious around 2 million.

But those collections account for just a handful of sites where software can be found. Moving forward, Software Heritage wants to import the archives for the other public code-hosting services (like SourceForge), every Linux distribution, language-specific sites like the Python Package Index, corporate and personal software repositories, and (ultimately) everywhere else.

Complicating the task is that this broad scope, by its very nature, will pull in a lot of software that is not open-source or free software. In fact, as Zacchiroli confirmed in an email, the licensing factor is already a hurdle, since so many repositories have no licensing information:

There is a lot of publicly available source code (e.g., on GitHub) which is simply not licensed as FOSS, often due to the lack of a license. That stuff will become FOSS one day though, either when its license gets clarified (even retroactively), or when copyright expires.

The way I like to think about this is: we want to protect the entire Software Commons. Free/Open Source Software is the largest and best curated part of it; so we want to protect of FOSS. Given the long-term nature of Software Heritage, we simply go for all publicly available source code (which includes all of FOSS but is larger), as it will become part of the Software Commons one day too.

For now, Zacchiroli said, the Software Heritage team is focused on finalizing the database of the current software and on putting a reliable update mechanism in place. GitHub, for example, is working with the team to enable ongoing updates of the already imported repositories, as well as adding new repositories as they are created. The team is also writing import tools for use ingesting files from a variety of version-control systems (old and new).

Access

Although the Software Heritage archive's full-blown web interface has yet to be launched, Dandrimont's talk provided some details on how it will work, as well as how the underlying stack is designed.

All of the imported archives are stored as flat files in a standard filesystem, including all of the revisions of each file. A PostgreSQL database tracks each file by its SHA-1 hash, with directory-level manifests of which files are in which directory. Furthermore, each release of each package is stored in the database as a directed acyclic graph of hashes, and metadata is tracked on the origin (e.g., GitHub or GNU) of each package and various other semantic properties (such as license and authorship). At present, he said, the archive consists of 2.7 billion files occupying 120TB, with the metadata database taking up another 3.1TB. "It is probably the biggest distributed version-control graph in existence," he added.

Browsing through the web interface and full-text searching are the next features on the roadmap. Following that, downloading comes next, including an interface to grab projects with git clone. Further out, the project's plans are less specific, in part because it hopes to attract input from researchers and users to help determine what features are of interest.

At the moment, he said, the storage layer is fairly basic in its design. He noted that the raw number of files "broke Git's storage model" and that the small file sizes (3kB on average) posed its own set of challenges. He then invited storage experts to get involved in the project, particularly as the team starts exploring database replication and mirroring. The code used by the project itself is free software, available at forge.softwareheritage.org.

Because the archive contains so many names and email addresses, Zacchiroli said that steps were being taken to make it difficult for spammers to harvest addresses in bulk, while still making it possible for well-behaved users to access files in their correct form. "There is a tension here," he explained. The web interface will likely obfuscate addresses and the archive API may rate-limit requests.

The project clearly has a long road ahead of it; in addition to the large project-hosting sites and FTP archives, collecting all of the world's publicly available software entails connecting to thousands if not millions of small sites and individual releases. But what Software Heritage is setting out to do seems to offer more value than a plain "file storage" archive like those offered by Archive Team and the Internet Archive. Providing a platform for learning, searching, and researching software has the potential to attract more investments of time and financial resources, two quantities that Software Heritage is sure to need in the years ahead.

Comments (18 posted)

Mozilla Servo arrives in nightly demo form

By Nathan Willis
July 7, 2016

The Firefox codebase dates back to 2002, when the browser was unbundled from the Mozilla Application Suite—although much of its architecture predates even that split. Major changes have been rare over the years, but recently several long-running Mozilla efforts have started to see the light of day. The most recent of these is the Servo web-rendering engine, for which the first standalone test builds were released on June 30. Although the Servo builds are not full-blown browsers, they enable users to download and test the engine on live web sites for the first time. Servo is designed with speed and concurrency in mind, and if all goes according to plan, the code may work its way into Firefox in due course.

Servo, for those unfamiliar, is a web rendering engine—roughly analogous to Gecko in the current Firefox architecture and WebKit or Blink in other browsers. It does not execute JavaScript, but is responsible for interpreting HTML and CSS and performs the vast majority of page layout operations.

The interesting facets of Servo are that it is written to be extensively parallel in execution and that it is designed to be intrinsically secure against the most common security bugs that plague browsers (and other application software). This security comes by virtue of being developed in the Rust language, which has a variety of built-in memory-safety features. Rust also offers concurrency features that Servo can leverage to do parallel page rendering. As a practical matter, this should enable faster rendering on today's multi-core hardware.

In 2015, we covered a talk at LinuxCon Japan by Lars Bergstrom and Mike Blumenkrantz that explored Servo's design. In that talk, the two speakers cautioned that Servo is a research project and that it is not scheduled to be a drop-in replacement for Gecko—at least, not on the desktop—although they did indicate that certain parts of Servo may be migrated into Gecko.

The June 30 announcement marked the release of a series of pre-built binaries that wrap the Servo engine with a minimalist browser GUI (based on Browser.html). The binaries are automatically built nightly, and are initially provided for Mac OS X and x86_64 Linux only, although there are tracking bugs available for users to see when Windows and Android builds will be available. It is also possible to build the nightlies from source; the Servo wiki includes a page about building for Linux on ARM.

Because the nightly builds are not full browsers, the interface leaves out most of the traditional browser chrome. Instead, the browser's start page presents a set of eight tiles linking to some well-known web sites, four tiles linking to graphics-intensive demos, a URL entry bar that doubles as a DuckDuckGo search bar (plus forward and back buttons and a "new tab" button). The same start page is accessible through other browsers. In some non-scientific testing, it is easy to see that Servo loads certain pages faster than recent Firefox releases—Wikipedia and Hacker News stand out among the eight tiles, for instance. On my four-core desktop machine, the difference was about twofold, although to provide a truly fair test, one should compare against Gecko (or another engine) with ad-blocking and tracker-blocking disabled, and with a clear cache.

Or, to be more precise, one could say that Servo begins to show the page sooner than Firefox. In many cases, Firefox takes a long pause to fully lay out the page content before anything is displayed, while Servo begins placing page elements on screen almost immediately, even if it still takes several additional seconds before the page-load progress bar at the top of the window indicates success. That is in keeping with Bergstrom and Blumenkrantz's comments about slow-to-load sites in Firefox: many pages are built with frames and <div> elements that are fetched separately, so loading them concurrently is where much of the time is saved.

The speed difference on the graphics demos was more drastic; the Firefox 47 build I used could barely even animate the Moire and Transparent Rectangle demos, while they ran smoothly on Servo.

The engine already provides good coverage of older HTML and CSS elements, with a few exceptions (frames and form controls, for example). Newer web specifications, including multimedia and web-application–driven standards like Service Workers tend to be less fully developed. Here again, the Servo wiki provides a page to track the project's progress.

Based on these early test builds, Servo looks promising. There were several occasions where it locked up completely, which would not be too surprising on any nightly build. But it is encouraging to see that it is already faster at rendering certain content than Gecko—and well before the project turns its attention to optimization.

Lest anyone get too excited about Servo's potential to replace Gecko, for the time being there is no such plan on the books. But the plan to patch some Servo components into Gecko or other Firefox modules appears to still be in the roadmap. Tracking bugs exist for a few components, such as Servo's URL parser and CSS style handling. The plan also notes that Servo is being looked at as a replacement for Gecko on Android, however, and as a reusable web-rendering engine—a use-case Mozilla has not addressed for quite some time.

Although that work still appears to be many releases removed from end users, it is worth noting that Firefox has moved forward on several other long-term projects in the past few months. In June, the first Firefox builds using Electrolysis, Mozilla's project to refactor Firefox for multi-process operation, were made available in the Beta channel. Recent Firefox releases have also successfully moved from the old extensions APIs to WebExtensions. Both of those changes are substantial, and both (like Servo) should provide improved security and performance.

Over the years, Mozilla has taken quite a bit of criticism for the aging architecture of Firefox—although, one must point out that Mozilla also takes quite a bit of criticism whenever it changes Firefox. If anything, the new Servo demos provide an opportunity for the public to see that some of Mozilla's research projects can have tangible benefits. One way or another, Firefox will reap benefits from Servo, as may other free-software projects looking for a modern web-rendering engine.

Comments (2 posted)

A leadership change for nano

By Nathan Willis
July 7, 2016

The nano text editor has a long history as a part of the GNU project, but its lead developer recently decided to sever that relationship and continue the project under its own auspices. As often happens in such cases, the change raised concerns from many in the free-software community, and prompted questions about maintainership and membership in large projects.

Nano past

Nano was created in 1999 as a GPL-licensed replacement for the Pico editor, which was originally a component of the Pine email client. Pico and Pine were developed at the University of Washington and, at the time, were distributed under a unique license that was regarded as incompatible with the GPL. Nano's creator, Chris Allegretta, formally moved the project into GNU in 2001.

Like Pico, nano is a text-mode editor optimized for use in terminal emulators. As such, it has amassed a healthy following over the years, particularly as a lightweight alternative to Emacs and vi. Often when one logs into a remote machine to make a small change to a text configuration file, nano can seem to be the wise choice for editing; it loads and runs quickly, is free of extraneous features, and the only keyboard commands one needs to alter a file are helpfully displayed right at the bottom of the window.

But nano has not stayed still. Over the years, it has gained new features like color syntax highlighting, automatic indentation, toggle-able line numbering, and so on. Other programmers have led nano's development for most of its recent history, although Allegretta has served as the GNU package maintainer for the past several years (after having taken a few years off from that duty in the mid-2000s).

Nano present

Over those past few years, Benno Schulenberg has made the most code contributions (by a comfortable margin), and as Allegretta determined that he no longer had the time or inclination to act as maintainer, conversation naturally turned to a formal transition. Much of that conversation took place privately, however, which may have led to the confusion that erupted in late June when the nano site seemed to proclaim that the project was no longer part of GNU.

Specifically, the change went public on June 17, which was the date of the 2.6.0 release. The release notes on the News page ended with the words:

And, with this release, we take leave of the herd... Bye! And thanks for all the grass!

In addition, the ASCII-art logo on the project home page changed from reading "The GNU nano Text Editor Homepage" to "The nano Text Editor homepage" (see the Wayback Machine's archived copy for comparison). The code was also changed to remove the GNU branding, in a June 13 commit by Schulenberg.

Within a few days, the change had been noticed by people outside the project; discussion threads popped up on Hacker News (HN) and Reddit. Those discussions took the move to be an acrimonious fork by Schulenberg, an interpretation perhaps fueled by GNU project member Mike Gerwitz's comment early on that "Nano has _not_ left the GNU project" and "Benno decided to fork the project. But he did so with hostility: he updated the official GNU Nano website, rather than creating a website for the fork." Gerwitz reported that the incident was fallout from a dispute between Allegretta and Schulenberg. Specifically, Allegretta had wanted to add Schulenberg as a co-maintainer, but that Schulenberg had refused to accept the GNU project's conditions of maintainership.

As it turns out, though, the sequence of events that led up to the 2.6.0 release were more nuanced. In May, Schulenberg had asked to roll a new release incorporating several recent changes. Allegretta was slow to respond, and cited several concerns with recent development processes, starting with the fact that GNU requires outside contributors to assign copyright to the FSF—at least, if the copyright on the package in question is already held by the Free Software Foundation (FSF), which was the case for nano.

Developers working on GNU packages are not required to assign copyright to the FSF (although it encourages copyright assignment in order to better enable license-compliance enforcement efforts). Schulenberg was unwilling to do the FSF copyright assignment (or any other copyright assignment), nor to participate in other formal GNU maintainer duties. But the crux of the issue for Allegretta seemed to be that the project was stuck in a place of noncompliance: as a GNU project, it should adhere to the GNU project's rules, but in practice it had not done so for years.

In the email linked-to above, Allegretta proposed that if the active developers were not interested in following the GNU guidelines, the project could move from GNU Savannah to GitHub or another hosting service and drop the GNU moniker. He then reframed the discussion by starting a new mailing-list thread titled "Should nano stay a GNU program." In that email, he said it "is fine" if Schulenberg is not interested in following the GNU guidelines, but that "we just need to figure out a solution".

In reply, nano developers Mark Majeres, Jordi Mallach, David Ramsey, and Mike Frysinger all said that whether the project continued under the GNU banner would not impact their future participation (although Mallach and Ramsey indicated that staying with GNU would be their preference). In the end, Schulenberg made the commit that removed the GNU branding, and no one objected.

Nano future

After news of the change hit HN and Reddit, Allegretta posted a clarification on his blog, describing the project as "peacefully transitioning" to a new maintainer and clarifying that he, not Schulenberg, had redirected the project's domain name to point to the new web server. Schulenberg filed a support ticket at Savannah asking to have the nano project moved from the gnu to the nongnu subdomain.

There is still the lingering question of whether or not anyone at GNU will wish to continue to develop a GNU-hosted version of nano (and, if so, how downstream users would handle the naming conflict). But it appears to be a hypothetical concern. Although he is active in several GNU projects, it is not clear that Gerwitz is involved in the nano project, and the active nano maintainers all seem to have continued participating as before.

Ultimately, the entire series of events was over long before it became news. Allegretta handed maintainership duties to Schulenberg and the project changed its affiliation. But the various discussion threads on the topic make for interesting reading nonetheless. There seems to be a fair amount of lingering confusion about the GNU project's copyright-assignment practices and what it means for a project to be affiliated with GNU, as well as disagreement over what exactly the role of project maintainer is.

For instance, as HN commenters pointed out, if GNU has been the home of a project for a decade and a half, many would say that an individual forking the project should be obligated to change its name. Conversely, user zx2c4 pointed out that Schulenberg had made considerably more commits in recent years than anyone else. To a lot of participants in the Reddit and HN threads, that fact entitled Schulenberg to make unilateral calls about the direction of the project, even though someone else was the official maintainer.

Maintaining a project means more than merely checking in commits, of course—a fact that most free-software developers readily acknowledge—and, for the record, something that Schulenberg has proven quite comfortable doing in nano's recent past. But the brief public uproar over nano's transition does reveal that, for at least some portion of the community, lines of code committed seems to count for more than formal membership in an umbrella project like GNU. Whether GNU will take any action to address that issue remains to be seen.

Comments (3 posted)

Page editor: Nathan Willis
Next page: Security>>