Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 23, 2013
An "enum" for Python 3
An unexpected perf feature
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
It's not possible to create a _fast_ distributed index to replicate Google's performance.
Web Search By The People, For The People: YaCy 1.0
Posted Nov 29, 2011 1:59 UTC (Tue) by yarikoptic (subscriber, #36795)
Posted Nov 29, 2011 18:00 UTC (Tue) by martinfick (subscriber, #4455)
Ever heard of a DHT?
Posted Nov 29, 2011 18:03 UTC (Tue) by Cyberax (✭ supporter ✭, #52523)
Then there's the next problem - you can't make everything to be a lookup in a DHT.
Posted Nov 29, 2011 20:57 UTC (Tue) by martinfick (subscriber, #4455)
I know you keep saying that, but do you care to enlighten us as to why it's not possible ("I couldn't do it" isn't very convincing)?
Not to mention that there are other ways to make distributed indexes, such as... the way google does. :)
Posted Nov 29, 2011 21:23 UTC (Tue) by Cyberax (✭ supporter ✭, #52523)
Let's imagine that you're doing a query for 'tasty peanuts'. First you need to find out _whom_ you are going to send this query. Your node can't hold the list of all DHT nodes.
So you immediately at least two RTTs of latency - from your computer to the coordination server and from coordination server to the nodes in the DHT. And you really can't have one coordination server serving all requests, so you probably need at least one or two levels in your DHT. So we're looking at 3-5 RTTs for the simplest lookups - and just that takes about 1 second in itself.
Suppose that you've sent a request to the desired node. What next? You need to get a list of documents containing the words "tasty" and then the list of documents for "peanuts" and then join them. But both these lists are way too large to be transmitted over WAN links.
So you need to offload joining to the nodes themselves. In essence you need to ask nodes holding "tasty" documents to check if their documents also contain "peanuts". It's possible, but your nodes also would have to expend quite a lot of CPU times responding to queries.
Then there's a question of partitioning - each individual node won't be able to contain all documents with the word 'tasty'. I'm not aware of algorithms that can solve this problem in a true distributed network with untrusted nodes.
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds