What ever happened to chunkfs?
Posted Jun 26, 2009 22:54 UTC (Fri) by roelofs
In reply to: What ever happened to chunkfs?
Parent article: What ever happened to chunkfs?
Probably not "search engine", at least in the conventional Internet search engine sense.
ObDisclosure: I work for one...
They surely index a lot of data, but probably store less of it, ...
Hard to index it if you don't store it. ;-) Life isn't just an inverted index, after all; you need to be able to generate dynamic summaries on the fly.
... and wouldn't need long-term backups of most of it. After all, the
data that was indexed for searching should for the most part be still
there on the net to reindex, a process that likely wouldn't take much
longer than restoring a backup anyway, and regardless, by the time they
finished the restore, the data would be stale, so a live re-index is going
to be more effective anyway.
That's true as far as it goes, but we're not talking about long-term backups, either. Search engines are more about robustness--think replication and failover and low (sub-second) latencies. How much data depends on which part you're talking about (tracked [webmap] vs. crawled vs. indexed), but when the document count ranges from dozens to hundreds of billions, the node count ranges from tens of thousands to hundreds of thousands (as reported by Google quite a few years ago), and the failure rate is dozens to thousands of nodes per day (also reported by Google not too long ago, IIRC), you can probably see where disk-based petabyte storage might come into play and why recrawling isn't a realistic option for point failures.
to post comments)