LWN.net Logo

Wait, what?

Wait, what?

Posted Oct 13, 2009 22:36 UTC (Tue) by cjb (guest, #40354)
In reply to: Wait, what? by nybble41
Parent article: WikiReader: OpenMoko's "Project B"

> Wikipedia is *huge*; just the raw English articles in pure HTML really do take up some 200GB in uncompressed form. I actually had to create a loopback filesystem image to hold it, as my normal root filesystem, created with the default settings, didn't even have enough inodes for that many files.

The technique they're using, which is also the technique we used for our offline wikipedia snapshot at OLPC, is to have a single compressed archive containing all of the content, an index from article title into block number, and a tool for uncompressing (only) a specified block number from the archive quickly.


(Log in to post comments)

Wait, what?

Posted Oct 14, 2009 21:11 UTC (Wed) by nybble41 (subscriber, #55106) [Link]

Right, I've seen it done that way. However, I wanted to be able to access the articles as separate files without first pre-processing the archive to create a block index and writing a custom FUSE adapter to extract the files on demand. SquashFS is similar to an indexed archive, except that (a) it's more structured; (b) it's a more general solution, and (c) you don't need special software to read the filesystem image, as SquashFS is available by default in recent Linux kernels (with backports available for older ones).

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds