User: Password:
|
|
Subscribe / Log in / New account

What ever happened to chunkfs?

What ever happened to chunkfs?

Posted Jun 26, 2009 10:15 UTC (Fri) by Duncan (guest, #6647)
In reply to: What ever happened to chunkfs? by roelofs
Parent article: What ever happened to chunkfs?

Probably not "search engine", at least in the conventional Internet search
engine sense. They surely index a lot of data, but probably store less of
it, and wouldn't need long-term backups of most of it. After all, the
data that was indexed for searching should for the most part be still
there on the net to reindex, a process that likely wouldn't take much
longer than restoring a backup anyway, and regardless, by the time they
finished the restore, the data would be stale, so a live re-index is going
to be more effective anyway. (Of course this says nothing about the other
non-search services such entities provide, many of which WILL need
backups.)

Rather, these huge petabyte class storage systems occur, based on my
reading on the topic, in a handful of "write mostly" situations. Of
course the "write mostly" bit is a given, since at that kind of data
volume, once past a certain point, it's a given that reading back the data
for further processing simply isn't going to be done for at least the
greatest portion thereof.

The archetypical example would be the various movie studios, now primarily
on digital media for many years. As storage capacities grew, the not only
shot/generated and processed all that data digitally, but stored it, and
not just the theater and "studio cut" editions, but the products generated
at each step of the process. Of course consumer resolutions are growing
as well, and production resolutions are multiple times that at multiple
bits more of color depth, as well. And where it's entirely or primarily
CGI, as the technology grows, so does the detail and data size of the
generated product. So they have several factors all growing at
(literally, for a number of them) geometric rates. But the advantage is
that just as they can remaster music, and just as they've been "restoring"
the movie classics for some time, now it's all digital, and with
a "simple" (which isn't so simple given the mountain of data, even with a
good index) retrieval of the original from the archived backups, they can
remaster from the bit-perfect originals.

AFAIK that was the original usage for petabyte class storage, and could
well be the first usage for exabyte class systems as well, but now that
petabyte is more and more possible and within reach of the "common"
corporation or government entity, it's actually coming to be required by
law there as well. Sarbanes-Oxley had the effect of requiring logging of
vast amounts of information for many US companies. Many readers here are
also no doubt familiar with the various ISP and etc logging requirements
many nations have legislated or tried to, with more lining up to try.
Over time that's going to add up to petabytes of information, and indeed,
many in the technical community have made the connection between drive
sizes outgrowing the needs of a typical consumer and lobbying for passage
of the various mandatory data-logging initiatives, alleging it's no
accident these laws are being passed just as drives get big enough most
ordinary consumers no longer need to get bigger ones every couple years.

Obama's electronic medical records legislation will certainly add to this,
tho many medical entities likely already electronically archive vast
quantities of information for defensive legal purposes, if nothing else.

Then there's of course usage such as that of the Internet Archive, which
would certainly need backups, tho until I just checked wikipedia, I had no
idea what their data usage was (3 petabytes as of 2009, growing at 100
terabytes/mo, as compared to 12 terabytes/mo growth in 2003, this is
apparently for the Wayback Machine alone, not including their other
archives).

Similarly (tho it is said to be smaller than the IA) with the Library of
Congress, and various other similar sites. See the Similar Projects
section of the Wikipedia Internet Archive entry for one list of such
sites.

Then there's the various social sites, tho based on the myspace image
archive torrent (17 GB, I torrented a copy) from a year or so ago, they
likely range in the terabytes, not petabytes.

But think of someone social-video based, like youtube. Even tho they're
not archiving the per-product level of data the studios are archiving, and
what they /are/ archiving is heavily compressed, they're getting content
from a vastly LARGER number of submitters, and must surely be petabyte
class by now (it's hard to believe it was founded only about 4 years ago,
2005-02, first video 2005-04). Wikipedia was no help on storage capacity,
and a quick google isn't helping much either (45 terabytes in 2006...
great help that is for 2009), but I do see figures of 10, scratch that,
15, scratch that, 20 hours of video uploaded /per/ /minute/! Even at the
compression rates they use, that's a LOT of video and therefore a LOT of
storage.

Duncan


(Log in to post comments)

What ever happened to chunkfs?

Posted Jun 26, 2009 22:54 UTC (Fri) by roelofs (guest, #2599) [Link]

Probably not "search engine", at least in the conventional Internet search engine sense.

ObDisclosure: I work for one...

They surely index a lot of data, but probably store less of it, ...

Hard to index it if you don't store it. ;-) Life isn't just an inverted index, after all; you need to be able to generate dynamic summaries on the fly.

... and wouldn't need long-term backups of most of it. After all, the data that was indexed for searching should for the most part be still there on the net to reindex, a process that likely wouldn't take much longer than restoring a backup anyway, and regardless, by the time they finished the restore, the data would be stale, so a live re-index is going to be more effective anyway.

That's true as far as it goes, but we're not talking about long-term backups, either. Search engines are more about robustness--think replication and failover and low (sub-second) latencies. How much data depends on which part you're talking about (tracked [webmap] vs. crawled vs. indexed), but when the document count ranges from dozens to hundreds of billions, the node count ranges from tens of thousands to hundreds of thousands (as reported by Google quite a few years ago), and the failure rate is dozens to thousands of nodes per day (also reported by Google not too long ago, IIRC), you can probably see where disk-based petabyte storage might come into play and why recrawling isn't a realistic option for point failures.

Greg


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds