LWN.net Logo

Wait, what?

Wait, what?

Posted Oct 13, 2009 21:45 UTC (Tue) by nybble41 (subscriber, #55106)
In reply to: Wait, what? by popey
Parent article: WikiReader: OpenMoko's "Project B"

Mine was also imageless and English-only. I imaging this device just stores the raw text, with perhaps some short formatting codes, which would save space over full HTML pages. (Not all that much, however, given that both versions are compressed.)

Wikipedia is *huge*; just the raw English articles in pure HTML really do take up some 200GB in uncompressed form. I actually had to create a loopback filesystem image to hold it, as my normal root filesystem, created with the default settings, didn't even have enough inodes for that many files.


(Log in to post comments)

Wait, what?

Posted Oct 13, 2009 22:36 UTC (Tue) by cjb (guest, #40354) [Link]

> Wikipedia is *huge*; just the raw English articles in pure HTML really do take up some 200GB in uncompressed form. I actually had to create a loopback filesystem image to hold it, as my normal root filesystem, created with the default settings, didn't even have enough inodes for that many files.

The technique they're using, which is also the technique we used for our offline wikipedia snapshot at OLPC, is to have a single compressed archive containing all of the content, an index from article title into block number, and a tool for uncompressing (only) a specified block number from the archive quickly.

Wait, what?

Posted Oct 14, 2009 21:11 UTC (Wed) by nybble41 (subscriber, #55106) [Link]

Right, I've seen it done that way. However, I wanted to be able to access the articles as separate files without first pre-processing the archive to create a block index and writing a custom FUSE adapter to extract the files on demand. SquashFS is similar to an indexed archive, except that (a) it's more structured; (b) it's a more general solution, and (c) you don't need special software to read the filesystem image, as SquashFS is available by default in recent Linux kernels (with backports available for older ones).

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds