YaCy: A peer-to-peer search engine

November 30, 2011

This article was contributed by Nathan Willis

Developers in the "free network services" movement often highlight Google services like GMail, Google Maps, and Google Docs when they speak about the shortcomings of centralized software-as-a-service products, but they rarely address the software behemoth's core product: web search. That may be changing now that the decentralized search engine YaCy has made its 1.0 release. But while the code may be 1.0, the search results may not scream "release quality."

The rationale given for YaCy is that a decentralized, peer-to-peer search service prohibits a central point-of-control and the problems that come with it: censorship, user tracking, commercial players skewing search results, and so forth. Elsewhere on the project site, the fact that a decentralized service also eliminates a central point of failure comes up, as does the potential for faster responses through load-balancing. But user control is the core issue: YaCy users determine which sites and pages get indexed, and it is possible to thoroughly explore the search index from a YaCy client.

In addition to "basic" search functionality indexing the entire web, individual administrators can point YaCy's crawler towards specific content, using it to create a search portal limited in scope to a particular topic, a specific domain, or a company intranet. On one hand, this design decision allows both custom "search appliances" using free software, and, with the same code base, puts the indexing choices directly into the hands of the search engine's users. On the other hand, the portion of the web indexed by the federated network of YaCy installations is much smaller than the existing indexes of Google and other commercial services — and perhaps more importantly, it grows more slowly as well.

The 1.0 release of YaCy is available for download in pre-built packages for Windows, Mac OS X, and Linux, as well as an Apt repository for Debian and Ubuntu. The package requires OpenJDK version 6 or later; once installed it provides a web interface at http://localhost:8090/ that includes searching, administration and configuration, and control over your local index creation. You can also register the local YaCy instance as a toolbar search engine in Firefox, although it must be running in order to function.

In a relatively new move, the project is also running a public web search portal at search.yacy.net. This portal accesses the primary "freeworld" network of public YaCy peers. There are other, independent YaCy networks that do not attempt to index the entire web, such as the scientific research index sciencenet.kit.edu. These portals make it possible to test out YaCy's coverage and results without installing the client software locally.

The architecture among peers

Broadly speaking, a network of YaCy peers (such as freeworld) maintains a single, shared reverse-word-index for all of the crawled pages (in other words, a database of matching URLs, ordered on the words that would make up likely search terms). The difference is that the index is sharded among the peers in a distributed hash table (DHT). Whenever a peer indexes a new set of pages, it writes updates to the DHT. Shards are replicated between multiple peers to bolster lookup speed and availability.

In practice, YaCy's DHT is a bit more complicated. The DHT does not store the full URL reference for every matching page — the complete entry includes not just the source URL, but metadata about it such as the last crawl time, the language, and the filetype (all of which might be important to the user performing the search). That information is stored in a local database on the peer that crawled the page, and replicated to several other peers. The peers' data is kept in a custom-written NoSQL database using AVL trees, a self-balancing search tree that gives logarithmic lookup, insert, and delete operations.

Each record in the DHT's index for a particular word contains a hash of the URL and a hash of the peer where the full reference is stored. Those two hashes are computed and stored separately, so that it is simple to determine that two matching hashes are entries for the same URL. That saves time because a URL that matches multiple search terms only has to be fetched from a peer once for a particular search.

Finally, for common terms, the number of URL references for a given word can become unwieldy, so YaCy partitions the DHT not just on full word-entry boundaries, but splits each word entry across multiple peers. That complicates the search process because matches for each word need to be retrieved from multiple peers, but it balances the load.

YaCy lead developer Michael Christen said that the freeworld network currently partitions word entries into 16 parts (although this is configurable, and he said the freeworld network may soon scale-up). The peer network is visualized as a circle, and the 16 shards evenly "spaced" around the circle. Two mirrors are created for each shard, and are placed in adjacent positions to the "left" and "right" of the primary location. The freeworld network claims around 1000 active peers with about 1500 passive peers (i.e., those not contributing to the public index); together they are currently indexing just under 888 million pages. You can see a live visualization of the network on the YaCy home page, and the YaCy client allows you to explore it more thoroughly (due to the DHT's circular design, the live visualization resembles a borderline-creepy pulsating human eye; it is still not clear to me whether or not this is intentional...).

Search me...

Because a search involves sending several stages of request to multiple peers (i.e., getting the matching URL/peer hashes from the DHT, then querying peers for the URL records), the lag time for a YaCy search is potentially much higher than it is for sending a query to a single search engine datacenter. However, because each YaCy peer is also storing its own portion of the full index in a local database, the system makes use of that existing data structure to speed things up.

First, for every search, a query to the local database is started concurrently with the remote search query. Secondly, each search's results are cached locally, so that on subsequent searches the local query will return more hits and not tax the network. This works best for the scenario when two searches performed in a row are part of a single search session — such as a search for "Linux" followed quickly by a refinement for "Linux" and "kernel." It is also presumed that a YaCy user is likely to have crawled pages that are of particular interest to them, so there is a better-than-average chance that relevant results will already be stored in the local database.

The search process in YaCy is complex not only because of the distributed storage system, but because a central server does not assemble and sort the results. Instead, the local YaCy web application must do so. It collects results from the local database (including cached results from previous queries) and from remote peers and sorts them together in a "Reverse Word Index Queue." The ranking algorithm used is similar to the PageRank used by Google, though Christen describes it as simpler. As you continue to use YaCy, however, it observes your actions and uses those to refine future search results.

YaCy next fetches the contents of each matching page. Christen admits that this is time-consuming, but it is done to prevent spamming and gaming the system — by fetching the page, the local web application can verify that the search terms actually appear in the page. The results are loaded as HTTP chunks so that they appear on screen as fast as possible.

From the user's standpoint, the YaCy search interface is a good deal more complicated than the minimalist design sported by Google and its big-name competitors. The search box is clean, but the results page displays more detail for every entry than a production search engine might: there is a link to the metadata database entry for each hit, another link to details on the indexing and parsing history of the entry, and floating buttons to promote, demote, and bookmark each link. There is also a "filetype navigator," "domain navigator," and "author navigator" for the results page as a whole, along with a tag cloud.

Harvest season

As interesting as the query and replication design may be, the entire system would be useless without a sufficiently large set of indexed pages. As mentioned earlier, index creation is a task left up to the peers as well. The YaCy client software includes several tools for producing and managing the local peer's portion of the global index.

The current release includes seven methods for adding to the index: a site crawler that limits itself to a single domain, an "expert" crawler that will follow links to any depth requested by the user, a network scanner that looks for locally-accessible servers, an RSS importer that indexes individual entries in a feed, an OpenArchives Protocol for Metadata Harvesting (PMH) importer, and two specialized crawlers: one for MediaWiki sites, and one for phpBB3 forums.

YaCy can also be configured to index all URLs visited by the local browser. In this mode, it places several safeguards to protect against indexing personal and protected pages. It skips all pages fetched using POST or GET parameters, those requiring HTTP password authorization, and so on. You can also investigate the local database, pulling up records by word entry or hash (if you happen to know the hash of the word), and edit or delete the metadata stored. You can also blacklist problematic sites with regular expression matching; there is a shared YaCy Blacklist Engine available to all clients, although there is no documentation on its contents.

How ever you add to the global index, the actions you take on the search results also contribute to the search experience: you can promote or demote any entry using the +/- buttons on the results page. There is also an advanced-configuration tool with which you can tweak the weight of about two dozen separate factors in how your search results are sorted locally, including word frequency, placement on the page, appearance in named anchors, and so on. These customizations are local, though; they do not affect which pages the peers send in response to your queries.

Options and the future

The description above outlines the default, "basic" mode of YaCy usage. The administrative interface allows you to configure the YaCy client in several other modes, including intranet indexing and serving as a single-site search portal. You can also set up user accounts for use with a password-protected search engine (as might be the case for a company intranet), load "suggestion dictionaries," tweak the filetypes that YaCy will index, and adjust the memory usage and process priority.

Another wrinkle in YaCy administration is that the peer can be configured to also load search results culled from the Scroogle and Blekko search engines. Scroogle is an engine that scrapes Google, but removes user-tracking data from the results, while Blekko is a search engine dedicated to publishing its search algorithms and optimizations for public consumption.

Both speak to the need to "bootstrap" YaCy's global index. A glance at reader comments to other YaCy stories (such as LWN's announcement and Slashdot's coverage) indicates that many people have tried YaCy and found the ordering of the results to be lacking. The topic comes up repeatedly on the YaCy discussion boards, although Christen noted in an FSCONS 2010 talk that YaCy already has more pages in its index than Google did at the time that it launched.

Nevertheless, the YaCy team has recently been promoting a new idea to boost the size of the index: interoperability with the Apache Solr search platform.

From a practical standpoint, this is probably a good move. YaCy alone is not yet indexing enough of the web to be competitive with commercial search engines. Some modest tests of my own roughly match the experiences of the LWN and Slashdot commenters: YaCy can find big and obvious pages for popular topics, but the real meat of web search is the ability to find the difficult-to-discover content. From one point of view, YaCy is like any other crowd-sourced data initiative: the more people who participate, the better it gets. However, it is drastically different from Wikipedia or OpenStreetMap in one key regard: the partial coverage available during the ramp-up phase of the project makes the system unusable for real work. You can map your own home town and the map will be useful to you on a daily basis — but if you index your own web site, that does not help you find most of your search targets. Better interoperability with Solr and other open source search engines could help, as would a concerted effort to index important un-covered areas of the web (a replacement for Google Code Search comes to mind).

Still, the developers are quick to admit that YaCy is not a production service. At this point, the team is concerned with tackling the tricky problem of distributing indexing, searching, and page ranking over a peer-to-peer network. Which is an original problem, even if the current state of the index is not a major challenge to Google.

Index entries for this article
GuestArticles	Willis, Nathan

YaCy: A peer-to-peer search engine

Posted Dec 1, 2011 2:41 UTC (Thu) by alogghe (subscriber, #6661) [Link] (4 responses)

Everyone is focused on YaCy as a search engine for -discovery- and how it does or does not succeed at this early stage.

I think this looks fascinating as a tool for -memory-, I'd love to have it index every page I'VE read.

I very frequently waste time going -back- to google for some fact or thing I've already read.

This could be really killer for my personal use in this sense.

I could also see doing more interesting stuff with search that -your immediate peers- have read. Working peers that share your goals and interests. What pages come up for searches of things my workgroup has read? Perhaps integrate this with some active hash tagging and active process that people using the system together use.

Cool project, I think the cynicism about it is sad.

YaCy: A peer-to-peer search engine

Posted Dec 1, 2011 12:01 UTC (Thu) by sumanah (guest, #59891) [Link] (1 responses)

So YaCy could be the memex?

YaCy: A peer-to-peer search engine

Posted Dec 2, 2011 0:02 UTC (Fri) by alogghe (subscriber, #6661) [Link]

I know right? Why did we outsource this to google?

YaCy: A peer-to-peer search engine

Posted Dec 1, 2011 13:52 UTC (Thu) by pcampe (guest, #28223) [Link] (1 responses)

>I think this looks fascinating as a tool for -memory-, I'd love to have it
>index every page I'VE read.

Zotero might be your friend.

YaCy: A peer-to-peer search engine

Posted Dec 2, 2011 0:00 UTC (Fri) by alogghe (subscriber, #6661) [Link]

Very nice, that might be useful indeed. Thanks.

Specialized crawlers

Posted Dec 1, 2011 20:13 UTC (Thu) by jnareb (subscriber, #46500) [Link]

I wonder how hard would be to write specialized crawler for StackExchange network of sites (of StackOverflow)...

YaCy: A peer-to-peer search engine

Posted Dec 7, 2011 4:23 UTC (Wed) by k8to (guest, #15413) [Link]

How does the system intend to defend itself against botnet corruption?

YaCy: A peer-to-peer search engine

Posted Dec 8, 2011 6:01 UTC (Thu) by gdt (subscriber, #6284) [Link]

If the program was made more friendly to ISPs then many would be willing to run the harvester for their network and peered networks. Own and peered traffic can be made free by running the harvester outside of peak usage periods.

The major change would be to bias the DHT based upon BGP attributes (community, AS Path length) so that the traffic on transit links is minimised.

Importantly, although ISPs would be happy to run one harvester they would be unhappy to run a second. So there would need to be some interoperable data or search for other FOSS search engines to use.

It would be well worth the authors' time to design an offering to ISPs and to contact the NANOG organisers to give a presentation.

Many ISPs would be delighted to assist a FOSS search engine purely to make a small breach in the current oligopoly, a market situation which carries risks even for the largest ISPs. ISPs are particularly FOSS-friendly, with a long Unix heritage and Linux heavily used and deeply understood. Many directly assist the FOSS community by running software mirrors.