|
|
Log in / Subscribe / Register

Google, Reader, and hard lessons about migration

By Nathan Willis
July 10, 2013

As has been widely reported already, Google discontinued Reader, its RSS and Atom feed-reading tool, at the beginning of July. In the weeks preceding the shutdown, scores of replacement services popped up hoping to attract disgruntled Reader refugees. But most of them focused squarely on the news-delivery features of Reader; a closer look illustrates several additional lessons about the drawbacks of web services—beyond the simple question of where one's data is stored.

Takeout, again?

First, Google had advertised that users would be able to extract their account information from Reader ahead of the shutdown. But the reality is that the available data migration tools are often not all that they are cracked up to be, particularly when they are offered by the service provider. Reader had always allowed users to export their list of feed subscriptions in Outline Processor Markup Language (OPML) format, of course. But access to the rest of an account's Reader data required visiting Google Takeout, the company's extract-and-download service (which is run by a team within Google called the Data Liberation Front). Takeout allowed users to extract additional data like the lists of starred and shared items, notes attached to feed items, and follower/following information.

However, Takeout does not preserve the historical contents of subscribed feeds, the existence of which is one of the more valuable aspects of always accessing news items at a single location: it is what enables full-text and title search of cached entries. Obviously, there are copyright issues that could understandably make Google shy away from offering downloads of other sites' content—although it could be argued that the company was already retaining that content and offering access to it in a variety of products, from Reader's cache to the "cached" items in Google Search. In any event, in the weeks preceding the Reader shutdown, several tools sprang up to retrieve the cached item store, from the open source Reader Is Dead (RID) to the commercial (and Mac-only) Cloudpull.

Both Cloudpull and RID utilized the unpublished Reader API to fetch and locally store an account's entire feed history. By sheer coincidence, I stumbled across the existence of RID a few days before the shutdown deadline, and used it to successfully pull down several year's worth of feed items on June 30. The archive consumes about 30 GB of space (uncompressed), although about half of that is wasted on high-traffic feeds without any historic value, such as local weather updates and Craigslist categories.

For the rest, however, the backup is akin to a local Wayback Machine. Initially the RID project was working on its own web application called reader_browser to access and search these archives; that program is still under development with limited functionality at present, but in the first week of July the project rolled out a stop-gap solution called zombie_reader as well. zombie_reader starts a local web server on port 8074, and presents a clone of the old Reader interface using the cached archive as storage. The legality of the clone may be questionable, since it employs a modified copy of the Reader JavaScript and CSS. But there is little long-term value in developing it further anyway, since outside of search and browsing, few of the old application's features make sense for an archive tool. Developer Mihai Parparita is continuing to work on reader_browser and on an accompanying command-line tool.

The silo

Of course, maintaining a standalone archive of old news items puts an additional burden on the user; at some point the news is too old to be of sufficient value. A better long term solution would be to merge the extracted data into a replacement feed reader. That illustrates another difficulty with migrating away from an application service provider—importing the extracted data elsewhere is problematic, if it is possible at all.

Copying in an OPML subscription list is no problem, of course, but other web-based feed-reader services will understandably not support a full history import (much less one 30GB in size). Self-hosted free software tools like ownCloud News and Tiny Tiny RSS are an option, although the official reception from Tiny Tiny RSS to such ideas has been less than enthusiastic. The Tiny Tiny RSS feature-request forum even lists asking for Google Reader features as a "bannable offense".

Outside contributors may still manage to build a working import tool for RID archives (there is one effort underway on the Tiny Tiny RSS forum). Regardless, the main factor that makes RID just a short-term fix is the fact only those users who made an archive before Reader closed can use it. Once Google deactivated Reader, it was no longer possible to extract any more cached account data. That left quite a few confused users who did not complete their exports before the July 1 shutdown, and it puts a hard upper limit on the number of RID users and testers.

The reason archive export no longer works, quite simply, is that Google switched off the Reader API with the application itself. That is an understandable move, perhaps. But there is still another shutdown looming: even the ability to export basic information (i.e., OPML subscriptions) will vanish on July 15—which is a perplexingly short deadline, considering that users can still snag their old Google Buzz and Google Notebook data through official channels, several years after those services were shuttered. So despite the efforts of the Data Liberation Front, it seems, the company can still be arbitrarily unhelpful when it comes to specific services.

Why it still matters

The moral of the Reader shutdown (and resulting headaches) is that it is often impossible to predict which portions of your data are the valuable ones until you actually attempt to migrate away to a new service provider. Certainly Google Reader had a great many users who did not care about the ability to search through old feed item archives. But some did, and the limitations of the service's export functionality only brought that need to light when they tried to move elsewhere.

For the future, the obvious lesson is that one should not wait until a service is deactivated to attempt migration. It is easy to lapse into complacency and think that leaving Gmail will be simple if and when the day comes. But, as is the case with restoring from backups, Murphy's Law is liable to intervene in one form or anther, and it is better to discover how in advance. There are certainly other widely-used Google services that exhibit the same problematic symptoms as Reader, starting with not allowing access to full data. Many of these services are for personal use only, but others are important from a business standpoint.

The most prominent example is probably Google Analytics, which is used for site traffic analysis by millions of domains. Analytics allows users to download summary reports, but not the raw numbers behind them. On the plus side, there are options for migrating the data into the open source program Piwik. However, without the original data there are limits to the amount and types of analysis that can be performed on the summary information alone. Most other Google products allow some form of export, but the options are substantially better when there is an established third-party format available, such as iCalendar. For services without clear analogues in other applications of providers—say, Google+ posts or AdWords accounts—generic formats like HTML are the norm, which may or may not be of immediate use outside of the service.

The Data Liberation Front is an admirable endeavor; no doubt, without it, the task of moving from one service provider to another would be substantially more difficult for many Google products. And the Reader shutdown is precisely the kind of major disruption that the advocates of self-hosted and federated network services (such as the Autonomo.us project) have warned free software fans about for years. But the specifics are instructive in this case as well: perhaps few Reader users recognized that the loss of their feed history would matter to them in time to export everything with RID, and perhaps more than a few are still unaware that Google Takeout will drop its Reader export functionality completely on July 15.

Ultimately, the question of how to maintain software freedom with web services divides people into several camps. Some argue that users should never use proprietary web services in the first place, but always maintain full control themselves. Others say that access to the data and the ability to delete one's account is all that really matters. The Autonomo.us project, for example, argues in its Franklin Street Statement that "users should control their private data" and that public data should be available under free and open terms. One could argue that Reader met both of those requirements, though. Consequently, if it signifies nothing else, Reader's shutdown illustrates that however admirable data portability conditions may be, those conditions are still complex ones, and there remains considerable latitude for their interpretation.


to post comments

Right To Serve

Posted Jul 11, 2013 3:39 UTC (Thu) by filteredperception (guest, #5692) [Link] (1 responses)

obligatory "right to serve" spam link, don't mind me - http://lwn.net/Articles/556680/

Right To Serve

Posted Aug 10, 2013 19:52 UTC (Sat) by jrn (subscriber, #64214) [Link]

I do mind. What does this reply have to do with the article?

Google, Reader, and hard lessons about migration

Posted Jul 11, 2013 8:55 UTC (Thu) by gnuman (guest, #72460) [Link]

Another great rss reader: www.g2reader.com

Google, Reader, and hard lessons about migration

Posted Jul 13, 2013 16:55 UTC (Sat) by TRS-80 (guest, #1804) [Link] (2 responses)

Archive Team did a grab of cached reader items, I don't think they've been posted publicly yet though.

Google, Reader, and hard lessons about migration

Posted Jul 18, 2013 21:33 UTC (Thu) by ivank (guest, #91957) [Link] (1 responses)

> Archive Team did a grab of cached reader items, I don't think they've been posted publicly yet though.

All ~8800GB are up at http://archive.org/details/archiveteam_greader

The files are in WARC format, which can be read with several tools including hanzo's warc-tools. Others are listed at http://www.archiveteam.org/index.php?title=The_WARC_Ecosy...

For those who want to avoid downloading 8TB, the .cdx files in the "Archive Team Google Reader Grab" items can be used to construct an index of feed URL -> (WARC filename, .warc.gz seek index). This index can then be used to seek directly to a cached feed in a .warc.gz at archive.org using normal HTTP range requests.

Google, Reader, and hard lessons about migration

Posted Aug 10, 2013 12:28 UTC (Sat) by ersi (guest, #64521) [Link]

Just a short note on why ArchiveTeam upped the data on archive.org: Anyone can upload to archive.org! So please contribute your RID archived data to archive.org that you've downloaded - just like ArchiveTeam did for their grabs.

Together, we can stich some parts of history together again!

Google, Reader, and hard lessons about migration

Posted Jul 18, 2013 2:42 UTC (Thu) by mlinksva (guest, #38268) [Link]

Some argue that users should never use proprietary web services in the first place, but always maintain full control themselves. Others say that access to the data and the ability to delete one's account is all that really matters. The Autonomo.us project, for example, argues in its Franklin Street Statement that "users should control their private data" and that public data should be available under free and open terms. One could argue that Reader met both of those requirements, though.
The Franklin Street Statement also says that the software that runs services should be released as free software. Reader definitely did not meet that requirement.

But hard lessons about migration remain even with the service software and a copy of all the relevant data. (Indeed, we could skip "service"; migration is hard period, and nobody has written software to migrate among lots of free software applications that have user state and one might want help switching between, eg photo managers and media players.)


Copyright © 2013, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds