LWN.net Logo

Web Search By The People, For The People: YaCy 1.0

Web Search By The People, For The People: YaCy 1.0

Posted Nov 29, 2011 18:03 UTC (Tue) by Cyberax (✭ supporter ✭, #52523)
In reply to: Web Search By The People, For The People: YaCy 1.0 by martinfick
Parent article: Web Search By The People, For The People: YaCy 1.0

I have even implemented one. That's why I'm saying it's not possible to have a fast distributed search.

Then there's the next problem - you can't make everything to be a lookup in a DHT.


(Log in to post comments)

Web Search By The People, For The People: YaCy 1.0

Posted Nov 29, 2011 20:57 UTC (Tue) by martinfick (subscriber, #4455) [Link]

> I have even implemented one. That's why I'm saying it's not possible to have a fast distributed search.

I know you keep saying that, but do you care to enlighten us as to why it's not possible ("I couldn't do it" isn't very convincing)?

Not to mention that there are other ways to make distributed indexes, such as... the way google does. :)

Web Search By The People, For The People: YaCy 1.0

Posted Nov 29, 2011 21:23 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

OK, I mean 'fast' as in 'takes less than 15 seconds' which is already way too slow compared to Google's latency of several tens of milliseconds.

Let's imagine that you're doing a query for 'tasty peanuts'. First you need to find out _whom_ you are going to send this query. Your node can't hold the list of all DHT nodes.

So you immediately at least two RTTs of latency - from your computer to the coordination server and from coordination server to the nodes in the DHT. And you really can't have one coordination server serving all requests, so you probably need at least one or two levels in your DHT. So we're looking at 3-5 RTTs for the simplest lookups - and just that takes about 1 second in itself.

Suppose that you've sent a request to the desired node. What next? You need to get a list of documents containing the words "tasty" and then the list of documents for "peanuts" and then join them. But both these lists are way too large to be transmitted over WAN links.

So you need to offload joining to the nodes themselves. In essence you need to ask nodes holding "tasty" documents to check if their documents also contain "peanuts". It's possible, but your nodes also would have to expend quite a lot of CPU times responding to queries.

Then there's a question of partitioning - each individual node won't be able to contain all documents with the word 'tasty'. I'm not aware of algorithms that can solve this problem in a true distributed network with untrusted nodes.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds