Developers in the "free network services" movement often highlight Google services like GMail, Google Maps, and Google Docs when they speak about the shortcomings of centralized software-as-a-service products, but they rarely address the software behemoth's core product: web search. That may be changing now that the decentralized search engine YaCy has made its 1.0 release. But while the code may be 1.0, the search results may not scream "release quality."
The rationale given for YaCy is that a decentralized, peer-to-peer search service prohibits a central point-of-control and the problems that come with it: censorship, user tracking, commercial players skewing search results, and so forth. Elsewhere on the project site, the fact that a decentralized service also eliminates a central point of failure comes up, as does the potential for faster responses through load-balancing. But user control is the core issue: YaCy users determine which sites and pages get indexed, and it is possible to thoroughly explore the search index from a YaCy client.
In addition to "basic" search functionality indexing the entire web,
individual administrators can point YaCy's crawler towards specific
content, using it to create a search portal limited in scope to a
particular topic, a specific domain, or a company intranet. On one hand,
this design decision allows both custom "search appliances" using free
software, and, with the same code base, puts the indexing choices directly
into the hands of the search engine's users. On the other hand, the
portion of the web indexed by the federated network of YaCy installations
is much smaller than the existing indexes of Google and other commercial
services — and perhaps more importantly, it grows more slowly as well.
The 1.0 release of YaCy is available for download in pre-built packages
for Windows, Mac OS X, and Linux, as well as an Apt repository for Debian and Ubuntu. The package requires OpenJDK version 6 or later; once installed it provides a web interface at http://localhost:8090/ that includes searching, administration and configuration, and control over your local index creation. You can also register the local YaCy instance as a toolbar search engine in Firefox, although it must be running in order to function.
In a relatively new move, the project is also running a public web search portal at search.yacy.net. This portal accesses the primary "freeworld" network of public YaCy peers. There are other, independent YaCy networks that do not attempt to index the entire web, such as the scientific research index sciencenet.kit.edu. These portals make it possible to test out YaCy's coverage and results without installing the client software locally.
The architecture among peers
Broadly speaking, a network of YaCy peers (such as freeworld) maintains
a single, shared reverse-word-index for all of the crawled pages (in other
words, a database of matching URLs, ordered on the words that would make up
likely search terms). The difference is that the index is sharded among
the peers in a distributed hash table (DHT). Whenever a peer indexes a new set of pages, it writes updates to the DHT. Shards are replicated between multiple peers to bolster lookup speed and availability.
In practice, YaCy's DHT is a bit more complicated. The DHT does not
store the full URL reference for every matching page — the complete
entry includes not just the source URL, but metadata about it such as the
last crawl time, the language, and the filetype (all of which might be
important to the user performing the search). That information is stored
in a local database on the peer that crawled the page, and replicated to
several other peers. The peers' data is kept in a custom-written NoSQL database using AVL trees, a self-balancing search tree that gives logarithmic lookup, insert, and delete operations.
Each record in the DHT's index for a particular word contains a hash of the URL and a hash of the peer where the full reference is stored. Those two hashes are computed and stored separately, so that it is simple to determine that two matching hashes are entries for the same URL. That saves time because a URL that matches multiple search terms only has to be fetched from a peer once for a particular search.
Finally, for common terms, the number of URL references for a given word can become unwieldy, so YaCy partitions the DHT not just on full word-entry boundaries, but splits each word entry across multiple peers. That complicates the search process because matches for each word need to be retrieved from multiple peers, but it balances the load.
YaCy lead developer Michael Christen said that the freeworld network
currently partitions word entries into 16 parts (although this is configurable, and he
said the freeworld network may soon scale-up). The peer network is visualized as a circle, and the 16 shards evenly "spaced" around the circle. Two mirrors are created for each shard, and are placed in adjacent positions to the "left" and "right" of the primary location. The freeworld network claims around 1000 active peers with about 1500 passive peers (i.e., those not contributing to the public index); together they are currently indexing just under 888 million pages. You can see a live visualization of the network on the YaCy home page, and the YaCy client allows you to explore it more thoroughly (due to the DHT's circular design, the live visualization resembles a borderline-creepy pulsating human eye; it is still not clear to me whether or not this is intentional...).
Because a search involves sending several stages of request to multiple peers (i.e., getting the matching URL/peer hashes from the DHT, then querying peers for the URL records), the lag time for a YaCy search is potentially much higher than it is for sending a query to a single search engine datacenter. However, because each YaCy peer is also storing its own portion of the full index in a local database, the system makes use of that existing data structure to speed things up.
First, for every search, a query to the local database is started
concurrently with the remote search query. Secondly, each search's results
are cached locally, so that on subsequent searches the local query will
return more hits and not tax the network. This works best for the scenario
when two searches performed in a row are part of a single search session
— such as a search for "Linux" followed quickly by a refinement for
"Linux" and "kernel." It is also presumed that a YaCy user is likely to
have crawled pages that are of particular interest to them, so there is a better-than-average chance that relevant results will already be stored in the local database.
The search process in YaCy is complex not only because of the distributed storage system, but because a central server does not assemble and sort the results. Instead, the local YaCy web application must do so. It collects results from the local database (including cached results from previous queries) and from remote peers and sorts them together in a "Reverse Word Index Queue." The ranking algorithm used is similar to the PageRank used by Google, though Christen describes it as simpler. As you continue to use YaCy, however, it observes your actions and uses those to refine future search results.
YaCy next fetches the contents of each matching page. Christen admits that this is time-consuming, but it is done to prevent spamming and gaming the system — by fetching the page, the local web application can verify that the search terms actually appear in the page. The results are loaded as HTTP chunks so that they appear on screen as fast as possible.
From the user's standpoint, the YaCy search interface is a good deal
more complicated than the minimalist design sported by Google and its
big-name competitors. The search box is clean, but the results page
displays more detail for every entry than a production search engine might:
there is a link to the metadata database entry for each hit, another link
to details on the indexing and parsing history of the entry, and floating
buttons to promote, demote, and bookmark each link. There is also a
"filetype navigator," "domain navigator," and "author navigator" for the
results page as a whole, along with a tag cloud.
As interesting as the query and replication design may be, the entire system would be useless without a sufficiently large set of indexed pages. As mentioned earlier, index creation is a task left up to the peers as well. The YaCy client software includes several tools for producing and managing the local peer's portion of the global index.
The current release includes seven methods for adding to the index: a site crawler that limits itself to a single domain, an "expert" crawler that will follow links to any depth requested by the user, a network scanner that looks for locally-accessible servers, an RSS importer that indexes individual entries in a feed, an OpenArchives Protocol for Metadata Harvesting (PMH) importer, and two specialized crawlers: one for MediaWiki sites, and one for phpBB3 forums.
YaCy can also be configured to index all URLs visited by the local
browser. In this mode, it places several safeguards to protect against
indexing personal and protected pages. It skips all pages fetched using
POST or GET parameters, those requiring HTTP password authorization, and so
on. You can also investigate the local database, pulling up records by word entry or hash (if you happen to know the hash of the word), and edit or delete the metadata stored. You can also blacklist problematic sites with regular expression matching; there is a shared YaCy Blacklist Engine available to all clients, although there is no documentation on its contents.
How ever you add to the global index, the actions you take on the search
results also contribute to the search experience: you can promote or demote
any entry using the +/- buttons on the results page. There is also an advanced-configuration tool with which you can
tweak the weight of about two dozen separate factors in how your search
results are sorted locally, including word frequency, placement on the
page, appearance in named anchors, and so on. These customizations are
local, though; they do not affect which pages the peers send in response to
Options and the future
The description above outlines the default, "basic" mode of YaCy usage. The administrative interface allows you to configure the YaCy client in several other modes, including intranet indexing and serving as a single-site search portal. You can also set up user accounts for use with a password-protected search engine (as might be the case for a company intranet), load "suggestion dictionaries," tweak the filetypes that YaCy will index, and adjust the memory usage and process priority.
Another wrinkle in YaCy administration is that the peer can be configured to also load search results culled from the Scroogle and Blekko search engines. Scroogle is an engine that scrapes Google, but removes user-tracking data from the results, while Blekko is a search engine dedicated to publishing its search algorithms and optimizations for public consumption.
Both speak to the need to "bootstrap" YaCy's global index. A glance at
reader comments to other YaCy stories (such as LWN's announcement and Slashdot's coverage)
indicates that many people have tried YaCy and found the ordering of the
results to be lacking. The topic comes up repeatedly on the YaCy
discussion boards, although Christen noted in an FSCONS 2010 talk that YaCy already has more pages in its index than Google did at the time that it launched.
Nevertheless, the YaCy team has recently been promoting a new idea to boost the size of the index: interoperability with the Apache Solr search platform.
From a practical standpoint, this is probably a good move. YaCy alone is not yet indexing enough of the web to be competitive with commercial search engines. Some modest tests of my own roughly match the experiences of the LWN and Slashdot commenters: YaCy can find big and obvious pages for popular topics, but the real meat of web search is the ability to find the difficult-to-discover content. From one point of view, YaCy is like any other crowd-sourced data initiative: the more people who participate, the better it gets. However, it is drastically different from Wikipedia or OpenStreetMap in one key regard: the partial coverage available during the ramp-up phase of the project makes the system unusable for real work. You can map your own home town and the map will be useful to you on a daily basis — but if you index your own web site, that does not help you find most of your search targets. Better interoperability with Solr and other open source search engines could help, as would a concerted effort to index important un-covered areas of the web (a replacement for Google Code Search comes to mind).
Still, the developers are quick to admit that YaCy is not a production service. At this point, the team is concerned with tackling the tricky problem of distributing indexing, searching, and page ranking over a peer-to-peer network. Which is an original problem, even if the current state of the index is not a major challenge to Google.
to post comments)