The Lucene Search Suite
The Lucene project lets you index the documents on your filesystem or web server so you can run combined full text and metadata searches. A full text search takes one or more words of a human language as a query and should return documents which are the "most relevant" for those words. Web searches are a classic example of full text searches. Metadata searches should be familiar to anyone who has used the find command; for example, looking for all files that have been modified in the last week.
The primary goal of Lucene is to provide a fast index and query implementation and to specify an interface to the index implementation -- how to send queries to it and get your results back as fast as possible. Lucene is not, by itself, designed to be a complete user-facing index solution but rather to provide the heart of such a system. There are also higher level projects which use one of the Lucene implementations to provide search capabilities, for example, KDE4's strigi desktop search. If you just want to add a search capability to something then you might like to explore these higher level tools to see if you can save the time of writing a program that uses the Lucene API directly.
It is tempting to think of adding full text to an index as just a filesystem traversal where you read each file and shove the byte contents into the index. Normally you want to extend this to allow conversions too, such as extracting the plain text of PDF files and indexing the extracted human readable text instead of the bytes that comprise the PDF file. The metadata associated with a document is entirely up to you, for example, extracting the Vorbis artist, album and track comments from FLAC audio files and adding them as metadata.
Using Lucene to index your Web site lets you offer a text search feature - like a Google search box - for servicing searches like "Wakelocks embedded". This is only the beginning though, because you can also offer advanced searches by combining metadata into the search. If you build a Lucene index for each registered user, the personalized search you can offer is hard to beat. For example, finding pages about "locking" that contain a link to a specific web site in the article comments. Or any article on "locking" that contains a comment by any one of your friends.
Lucene is actually an umbrella project which has many implementations in Java, C++, Ruby and PHP among others. Probably the most widely known implementation of Lucene is the original one that is done in Java. In recent times, implementations in C++ (CLucene) and PHP (Zend_Search_ Lucene) have become available. There are also implementations in Perl and Ruby, see the full list for details. The CLucene page states that its primary goal is to be faster than the Java version. It would appear that the PHP implementation was primarily driven by the desire to be homogeneous with the PHP environment.
The implementation of these full text and metadata search types normally call for different queries and thus different implementations to best resolve the queries. For example, it might be quite common to want to search for a range in a metadata query, like all the documents added to the index in December, whereas a full text query might demand ranking of documents that contain the strings "DDR3" and "latency".
You don't really need to know what Lucene does on its side of the API to build and search indexes with it, though a high level knowledge of what happens in the implementation can help you understand how to make efficient use of the API.
Abstractly, a Lucene index consists of many Document objects, each of which contains one or more fields. A field is a key-value pair, for example, the key of "indexed-on" and a value "Wed Dec 17, 2008 @ 3:58 PM". The full text content of a document is also added to the Lucene index as a field property of a document.
Fields can be stored verbatim in the index, or have an index created for them, or both. You might want to index and store the URL that a document was retrieved from, but might want to only index the document text because storing it verbatim might make the index too large for your application. An index on the contents of a file is likely to be much smaller than the file itself. If you have access to the original file you don't really want to store it in the Lucene index verbatim too. A field can also be tokenized or stored atomically (a so called keyword). You would want to tokenize the text content of a file but probably want the date it was indexed to remain an atomic value.
Normally you would have Lucene tokenize the text of a file and build an inverted file arrangement for the tokens. For example, the word "token" would have a list of which document numbers contain that word along with other metadata relating to how often that term appears in each document relative to the length of the document. This way queries looking for "token" and "lucene" can be resolved by merging the two lists for each token.
A great deal of attention has been paid to not locking data in the index with Lucene. This way, the index can undergo updating in the background while it is actively being used to service searches. This eliminates the need to wait on the background process. You can only have a single update running for an index at any time, but many clients can be reading the index while that update is occurring.
A Lucene index is made up of one or more segments. Each segment is fully independent of any other segment and is stored in one or more files. Concurrency without locking is achieved by writing any new or changed data to a new Segment. One way to speed up indexing documents and create fewer segments is to have Lucene cache as many of the added documents in RAM and flush out a single, large segment on a less frequent basis
For Java Lucene the setRAMBufferSizeMB is used to set how much RAM can be used before a new segment is written, its default is only 16Mb. Creating larger segments during indexing means it will take slightly longer before clients can see new documents (because the new segment is not written and is thus not accessible) but will make for fewer, larger segments and thus less need to merge segments later.
Instead of flushing a new segment when enough RAM has been used, you can force a segment to be flushed every X documents with setMaxBufferedDocs. By default, flushing is done when the buffered RAM size is reached and there is no default maximum number of documents before a flush.
Segments are merged either periodically during the adding of documents or by calling one of many optimize methods. If an index is to remain constant for a period of time it is a good idea to optimize it so that multiple segments are converted into a single segment. Optimization has the additional side benefit that if your filesystem is not full, writing a new single-segment Lucene index should also mean that the index is stored in a single filesystem extent.
Adding segments and merging segments are very similar operations. To merge segments, all of the data is copied from the old segments into a new segment and the old segments are then discarded. The currently active segments are listed in the "segments" file. Depending on how the implementation of Lucene you are using operates, the segments file might use a commit lock to protect it while it is being updated. At any rate, as the segments file just lists the file names and other metadata about segments, it can be updated very quickly.
I mentioned at the outset that Lucene specializes in full text indexing. There are some issues when using Lucene for numerical and date metadata which make using those datatypes a more complex task than just shoving full text into the index.
Knowing the Lucene API and how to include and search for information in a Lucene index can allow you to develop many applications. Hopefully the glimpse behind the API that I've included can help you get started writing applications that use Lucene efficiently. Because there are implementations of Lucene in PHP, C++, C#, Java and other languages you can apply general knowledge of Lucene to applications ranging from Web development to embedded coding.
Index entries for this article | |
---|---|
GuestArticles | Martin, Ben |