LWN.net Logo

Web Search By The People, For The People: YaCy 1.0

Web Search By The People, For The People: YaCy 1.0

Posted Nov 29, 2011 0:53 UTC (Tue) by Cyberax (✭ supporter ✭, #52523)
Parent article: Web Search By The People, For The People: YaCy 1.0

Not gonna fly.

It's not possible to create a _fast_ distributed index to replicate Google's performance.


(Log in to post comments)

Web Search By The People, For The People: YaCy 1.0

Posted Nov 29, 2011 1:59 UTC (Tue) by yarikoptic (subscriber, #36795) [Link]

Indeed -- they need to add some value (not just decentralization feature mortals do not generally care about). From top of the head -- I might have preferred to use it IF it was seamlessly integrated with a "desktop search" (documents, applications -- installed or available, etc) and possibly using that knowledge to customize my searches.

Web Search By The People, For The People: YaCy 1.0

Posted Nov 29, 2011 18:00 UTC (Tue) by martinfick (subscriber, #4455) [Link]

> It's not possible to create a _fast_ distributed index to replicate Google's performance.

Ever heard of a DHT?

Web Search By The People, For The People: YaCy 1.0

Posted Nov 29, 2011 18:03 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

I have even implemented one. That's why I'm saying it's not possible to have a fast distributed search.

Then there's the next problem - you can't make everything to be a lookup in a DHT.

Web Search By The People, For The People: YaCy 1.0

Posted Nov 29, 2011 20:57 UTC (Tue) by martinfick (subscriber, #4455) [Link]

> I have even implemented one. That's why I'm saying it's not possible to have a fast distributed search.

I know you keep saying that, but do you care to enlighten us as to why it's not possible ("I couldn't do it" isn't very convincing)?

Not to mention that there are other ways to make distributed indexes, such as... the way google does. :)

Web Search By The People, For The People: YaCy 1.0

Posted Nov 29, 2011 21:23 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

OK, I mean 'fast' as in 'takes less than 15 seconds' which is already way too slow compared to Google's latency of several tens of milliseconds.

Let's imagine that you're doing a query for 'tasty peanuts'. First you need to find out _whom_ you are going to send this query. Your node can't hold the list of all DHT nodes.

So you immediately at least two RTTs of latency - from your computer to the coordination server and from coordination server to the nodes in the DHT. And you really can't have one coordination server serving all requests, so you probably need at least one or two levels in your DHT. So we're looking at 3-5 RTTs for the simplest lookups - and just that takes about 1 second in itself.

Suppose that you've sent a request to the desired node. What next? You need to get a list of documents containing the words "tasty" and then the list of documents for "peanuts" and then join them. But both these lists are way too large to be transmitted over WAN links.

So you need to offload joining to the nodes themselves. In essence you need to ask nodes holding "tasty" documents to check if their documents also contain "peanuts". It's possible, but your nodes also would have to expend quite a lot of CPU times responding to queries.

Then there's a question of partitioning - each individual node won't be able to contain all documents with the word 'tasty'. I'm not aware of algorithms that can solve this problem in a true distributed network with untrusted nodes.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds