LWN.net Logo

Web Search By The People, For The People: YaCy 1.0

From:  Martin Husovec <husovec-AT-fsfe.org>
To:  press-release-AT-fsfeurope.org
Subject:  [FSFE PR][EN] Web Search By The People, For The People: YaCy 1.0
Date:  Mon, 28 Nov 2011 11:27:33 +0100
Message-ID:  <1322476053.2345.9.camel@Hutko.fsfe.org>
Archive-link:  Article, Thread

= Web Search By The People, For The People: YaCy 1.0 =

The YaCy project is releasing version 1.0 of its peer-to-peer Free
Software search engine. The software takes a radically new approach to
search. YaCy does not use a central server. Instead, its search results
come from a network of currently over 600 independent peers. In such a
distributed network, no single entity decides what gets listed, or in
which order results appear.

The YaCy search engine runs on each user's own computer. Search terms
are encrypted before they leave the user and the user's computer.
Different from conventional search engines, YaCy is designed to protect
users' privacy. A user's computer creates its individual search indexes
and rankings, so that results better match what the user is looking for
over time. YaCy also makes it easy to create a customised search portal
with a few clicks.

"Most of what we do on the Internet involves search. It's the vital link
between us and the information we're looking for. For such an essential
function, we cannot rely on a few large companies, and compromise our
privacy in the process," says Michael Christen, YaCy's project leader.
"YaCy's free search is the vital link between free users and free
information. YaCy hands control over search back to us, the users." 

Each YaCy user is part of a large search network. YaCy is already in use
on websites such as sciencenet.kit.edu, yacy.geocaching-portal.com, or
fsfe.org, to provide a site-wide search function that respect users'
privacy. It contains a peer-to-peer network protocol to exchange search
indexes with other YaCy search engines. 

"We are moving away from the idea that services need to be centrally
controlled. Instead, we are realising how important it is to be
independent, and to create infrastructure that doesn't have a single
point of failure," says Karsten Gerloff, President of the Free Software
Foundation Europe. "In the future world of distributed, peer-to-peer
systems, Free Software search engines like YaCy are a vital building
block."

Everyone can try out the search engine at http://search.yacy.net. Users
can become part of YaCy's network by installing the software on their
own computers. YaCy is Free Software, so anyone can use, study, share
and improve it. It is currently available for GNU/Linux, Windows and
MacOS. The project is also looking for developers and other
contributors.


= Links:  =
  
  YaCy homepage: http://yacy.net
  
  YaCy search portal: http://search.yacy.net/
  
  How to contribute: http://yacy.net/en/Join.html
  
  Free Software Foundation Europe: http://fsfe.org


= Contacts = 

 Michael Christen
 YaCy Project Leader
 Tel. +49 177 6424235
 Email: mc@yacy.net
 
 Karsten Gerloff
 President, Free Software Foundation Europe
 Tel. +49 176 9690 4298
 Email: gerloff@fsfeurope.org


== About the Free Software Foundation Europe ==
  
  The Free Software Foundation Europe (FSFE) is a non-profit
  non-governmental organisation active in many European countries and
  involved in many global activities. Access to software determines
  participation in a digital society. To secure equal participation in
  the information age, as well as freedom of competition, the Free
  Software Foundation Europe (FSFE) pursues and is dedicated to the
  furthering of Free Software, defined by the freedoms to use, study,
  modify and copy. Founded in 2001, creating awareness for these issues,
  securing Free Software politically and legally, and giving people
  Freedom by supporting development of Free Software are central issues
  of the FSFE.

  http://fsfe.org/
_______________________________________________
Press-release mailing list
Press-release@fsfeurope.org
https://mail.fsfeurope.org/mailman/listinfo/press-release


(Log in to post comments)

Web Search By The People, For The People: YaCy 1.0

Posted Nov 28, 2011 22:10 UTC (Mon) by ballombe (subscriber, #9523) [Link]

So I jump to yacy.net for instruction, and I am greeted by a screen that says 'Please switch to a browser with native H.264 support or install Adobe flash player'. Not the best way to attract free software activists.

Web Search By The People, For The People: YaCy 1.0

Posted Nov 29, 2011 0:02 UTC (Tue) by kragilkragil2 (guest, #76172) [Link]

F$DS, will the complaining that people use popular sites like vimeo to host their videos never stop? Sure Youtube would have been better, but come on, without a h264 decoder you miss out on so many videos that you shouldn't be bothered by one more.

Web Search By The People, For The People: YaCy 1.0

Posted Nov 29, 2011 15:20 UTC (Tue) by clint (subscriber, #7076) [Link]

I think you may have missed the point.

Web Search By The People, For The People: YaCy 1.0

Posted Nov 30, 2011 17:02 UTC (Wed) by shmerl (guest, #65921) [Link]

Vimeo supports WebM. I never missed H264 support in Firefox.

"Seeks", a similar and very promising project

Posted Nov 28, 2011 23:01 UTC (Mon) by dodji (guest, #49817) [Link]

There is a similar project called http://www.seeks-project.info/site. For now they are operating as a decentralized meta-search engine, re-using other search engines, I guess because they didn't perceived the crawling part of the job as being the one that adds the most value. But I wouldn't be surprised that a longer term goal would be to have their own crawling services as well.

There are several seeks node that are up and running at the moment. You can consider installing your own, or using an existing one just for testing.

I personally use the node http://www.seeks.fr.

Web Search By The People, For The People: YaCy 1.0

Posted Nov 28, 2011 23:17 UTC (Mon) by josh (subscriber, #17465) [Link]

I tried a few test searches, and didn't seem to get many useful results at all. Searching for [debian] did not produce any debian.org results anywhere on the first page. Similarly, searching for [google] did not produce google.com (or any other google domain) on the first page. Searching for [lwn] produced one random LWN comment, but nothing else. Searching for [linux] produced a page full of links to the Wikipedia articles on Linux in numerous different languages, in no sensible order.

Web Search By The People, For The People: YaCy 1.0

Posted Nov 29, 2011 10:58 UTC (Tue) by rsidd (subscriber, #2582) [Link]

Afaict, it searches only sites where it is installed. Presumably debian.org hasn't installed it. (So, no, it won't be a google killer. But there may still be a market for it, if it does local searches fast and well.)

Web Search By The People, For The People: YaCy 1.0

Posted Nov 29, 2011 0:53 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

Not gonna fly.

It's not possible to create a _fast_ distributed index to replicate Google's performance.

Web Search By The People, For The People: YaCy 1.0

Posted Nov 29, 2011 1:59 UTC (Tue) by yarikoptic (subscriber, #36795) [Link]

Indeed -- they need to add some value (not just decentralization feature mortals do not generally care about). From top of the head -- I might have preferred to use it IF it was seamlessly integrated with a "desktop search" (documents, applications -- installed or available, etc) and possibly using that knowledge to customize my searches.

Web Search By The People, For The People: YaCy 1.0

Posted Nov 29, 2011 18:00 UTC (Tue) by martinfick (subscriber, #4455) [Link]

> It's not possible to create a _fast_ distributed index to replicate Google's performance.

Ever heard of a DHT?

Web Search By The People, For The People: YaCy 1.0

Posted Nov 29, 2011 18:03 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

I have even implemented one. That's why I'm saying it's not possible to have a fast distributed search.

Then there's the next problem - you can't make everything to be a lookup in a DHT.

Web Search By The People, For The People: YaCy 1.0

Posted Nov 29, 2011 20:57 UTC (Tue) by martinfick (subscriber, #4455) [Link]

> I have even implemented one. That's why I'm saying it's not possible to have a fast distributed search.

I know you keep saying that, but do you care to enlighten us as to why it's not possible ("I couldn't do it" isn't very convincing)?

Not to mention that there are other ways to make distributed indexes, such as... the way google does. :)

Web Search By The People, For The People: YaCy 1.0

Posted Nov 29, 2011 21:23 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

OK, I mean 'fast' as in 'takes less than 15 seconds' which is already way too slow compared to Google's latency of several tens of milliseconds.

Let's imagine that you're doing a query for 'tasty peanuts'. First you need to find out _whom_ you are going to send this query. Your node can't hold the list of all DHT nodes.

So you immediately at least two RTTs of latency - from your computer to the coordination server and from coordination server to the nodes in the DHT. And you really can't have one coordination server serving all requests, so you probably need at least one or two levels in your DHT. So we're looking at 3-5 RTTs for the simplest lookups - and just that takes about 1 second in itself.

Suppose that you've sent a request to the desired node. What next? You need to get a list of documents containing the words "tasty" and then the list of documents for "peanuts" and then join them. But both these lists are way too large to be transmitted over WAN links.

So you need to offload joining to the nodes themselves. In essence you need to ask nodes holding "tasty" documents to check if their documents also contain "peanuts". It's possible, but your nodes also would have to expend quite a lot of CPU times responding to queries.

Then there's a question of partitioning - each individual node won't be able to contain all documents with the word 'tasty'. I'm not aware of algorithms that can solve this problem in a true distributed network with untrusted nodes.

Web Search By The People, For The People: YaCy 1.0

Posted Nov 29, 2011 7:35 UTC (Tue) by slashdot (guest, #22014) [Link]

Obviously this can only work if the number of people using this is massive, due to the colossal bandwidth/disk/CPU requirements (and even then, feasibility seems quite dubious).

And the only chance to do that is to have it initially mostly use Google and/or Bing under the covers, and provide some added value on top.

Web Search By The People, For The People: YaCy 1.0

Posted Nov 29, 2011 16:53 UTC (Tue) by bjartur (guest, #67801) [Link]

When I finally found a useful search pattern that returned a result (yes, singular), the response to navigation was HTTP 404.
Count me as skeptical. YaCy might though fulfill the niche of searching a homogeneous set of mildly cooperative sites, i.e. specialized and customizable search for nerds.

Imagine if YouTube and Vimeo standardized on the YaCy protocol instead of (or in addition to) the in-house HTTP+JSON/XML mess of a search protocol Google uses for video search, and whatever protocol Vimeo uses, if any stable at all. Then virtually all video sites would have to implement it in a compatible manner, allowing authors to make video playlisters gathering videos from any of multitude of providers (with the user having the theoretical option of adding their own).

Web Search By The People, For The People: YaCy 1.0

Posted Nov 29, 2011 17:48 UTC (Tue) by b7j0c (subscriber, #27559) [Link]

tried it, it seems interesting and full of potential

it may be years before anything like this matches the results from a real search engine, but i love the experimental approach. if it only serves to further inform developers about the latent processing network available on user's computers, i say its a win

It'll never match "a real search engine"

Posted Nov 29, 2011 20:16 UTC (Tue) by khim (subscriber, #9252) [Link]

This design will never work. The best way to find something on lwn.net is to use Google with "site:lwn.net" restriction. Why? Because even when you restrict search to the single site "a real search engine" still uses metainformation from the whole world wide web.

YaCy just does not have enough information to ever match "a real search engine". The best result it can ever hope to get is mostly unsorted list of links. Relevancy sorting is just impossible - by design (protection of user's privacy).

This is in addition to technical problems: these are also enormous but they can be solved, at least in theory. Principal refusal to use available information (in the name of privacy) cripples the project from the start and can not be ever changed (without changes in stated goals, obviously).

It'll never match "a real search engine"

Posted Nov 29, 2011 21:33 UTC (Tue) by bjartur (guest, #67801) [Link]

No, but the privacy benefit is largely moot as soon as you send your query, which you have to do unless all you want to do is full-text search over your local cache. That's not really where YaCy excels, although it does support a number of complicated formats grep can't handle.
What YaCy provides is a DHT protocol for distributed keyword search. It has the potential to solve the biggest problem with replacing existing web indexes: lack of Internet Archive class bandwidth and storage. In fact, YaCy seems to do it quite well. It is a great step up from the Common Name Resolution Protocol and HTML-form assisted HTTP querying.
There is still much left to improve, with ranking being one. But client-side ranking provides benefits already: customization beyond what Google allows you to do in it's believe that it knows your interests better than you do (and in what language you prefer your content, solely based on your location).

And this is the only part that matter...

Posted Nov 29, 2011 21:51 UTC (Tue) by khim (subscriber, #9252) [Link]

There is still much left to improve, with ranking being one.

This is the only part that matters. Sure, to create search engine you really need beefy hardware and it cost a pretty penny, but... it's only large for an individual. You need few millions of dollars to build datacenter comparable to Google - there are a lot of individuals and organizations which can afford that. But then you need to rank the results - and this where task becomes hard.

But client-side ranking provides benefits already: customization beyond what Google allows you to do in it's believe that it knows your interests better than you do (and in what language you prefer your content, solely based on your location).

Sure, but this an icing on the cake. Show me how and when you'll get the cake - then we'll have a meaningful discussion. Google sorts documents once and then uses millions of times (this simplification, of course: nowadays it alters the the existing ranking "on the fly" and does not rebuild the index each week like it did years ago, but these are minor details) - this makes the whole thing affordable. How do you plan to achieve that with client-side ranking is the question.

Indexing: Done; Crawling: WIP; Ranking: TD

Posted Dec 3, 2011 15:49 UTC (Sat) by bjartur (guest, #67801) [Link]

Ranking seems relatively cheap when all indexers are trusted. Not so in a trust-nothing peer system. YaCy has already pushed the state of the art of P2P higher than I expected in coming years by use of DHT so obvious in retrospect. It's still just a sort-of working proof of concept with great potential for evulutionary enhancement and reworking. In YaCy indexes can be provided either by website run YaCy servers or a distributed network of crawlers. The former works already for the few sites that run YaCy, and with the publicity YaCy has now, distributed crawling might get close to usable soon.
Yes, ranking is hard - but so was distributed indexing. It became easier. For now ranking has to be done by a trusted site. But perhaps the most important product of the YaCy project may become standardization upon a common protocol allowing searchers to more easily aggregate search results from multiple rankers and for rankers to aggregate indexes from even more crawlers.

If Google refuses to rank Yahoo Mail above GMail, then bohoo. If Google omits a site from their index, then shit. It happens so rarely that their results are overall far superior to those of any other engine. Microsoft has crawled an astounding number of pages, but doesn't yet have all those esoteric pages whose authors have long forgot about and are only linked to by that other esoteric page, perhaps not served as HTML. It doesn't even matter whether Bing's ranking algorithm is better than Google's. If Google's is just good enough to allow their maturity to keep and attract users. And Google didn't exactly stop at the original PageRank.

But YaCy allows decoupling of indexing and ranking, even if ranking practically can not be done in a fully distributed fashion, it allows for ranking to be outsourced far more easily than the current mess of custom HTML soup results. Note that a standardized format for search results (uri-list, RSS, Atom or a semantic HTML dialect) would achieve the same, but such a standard will not be adhered to until a few major players have shown it support for it and classified it as the New Deal(tm).

Disclaimer: I do not run a YaCy crawler yet for a lack of bandwidth. Google is the search engine used by this stable build of Opera, but Bing by the customized experimental build for it can't cope with the insanity that is the latest Yahoo-esque revision of Google's search result page.

Web Search By The People, For The People: YaCy 1.0

Posted Nov 29, 2011 21:13 UTC (Tue) by elanthis (guest, #6227) [Link]

I look forward to when p2p search allows adolescent dorks to insert goatse links into random search queries. The Internet will surely be better then than it is with evil companies like Google at the helm now.

Web Search By The People, For The People: YaCy 1.0

Posted Nov 29, 2011 23:21 UTC (Tue) by cmccabe (guest, #60281) [Link]

> I look forward to when p2p search allows adolescent dorks to insert
> goatse links into random search queries.

Luckily for you, those happy days are already here:
http://www.ex-parrot.com/~pete/upside-down-ternet.html

You just need to modify that script a bit.

Web Search By The People, For The People: YaCy 1.0

Posted Nov 29, 2011 23:26 UTC (Tue) by cmccabe (guest, #60281) [Link]

Er, "happy" is intended to be sarcastic here. I keep forgetting that sarcasm doesn't work on the internet. Anyway, don't trust random access points that you find out there!

Web Search By The People, For The People: YaCy 1.0

Posted Nov 30, 2011 5:42 UTC (Wed) by elanthis (guest, #6227) [Link]

I actually have run that before. I think it's great.

That's not at all the same thing as relying on a p2p network where you can get bogus, malevolent results while using the service properly and appropriately. "Stealing" my bandwidth is one thing, and screwing with those people is hilarious, but I don't expect to be mistreated when I do a search on google.com.

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds