March 11, 2009
This article was contributed by Ben Martin
The Lucene project lets you index the documents on your filesystem or
web server so you can run combined full text and metadata searches. A full
text search takes one or more
words of a human language as a query and should return documents which
are the "most relevant" for those words. Web searches are a classic
example of full text searches. Metadata searches should be familiar
to anyone who has used the find command; for example,
looking for all files that have been modified in the last week.
The primary goal of Lucene is to provide a fast index and query
implementation and to specify an interface to the index implementation
-- how to send queries to it and get your results back as fast as
possible. Lucene is not, by itself, designed to be a complete user-facing
index solution but rather to provide the heart of such a
system. There are also higher level projects which use one of the
Lucene implementations to provide search capabilities, for example,
KDE4's strigi desktop search. If you just want to add a search
capability to something then you might like to explore these higher level
tools to see if you can save the time of writing a program that uses
the Lucene API directly.
It is tempting to think of adding full text to an index as just a
filesystem traversal where you read each file and shove the byte
contents into the index. Normally you want to extend this to allow
conversions too, such as extracting the plain text of PDF files
and indexing the extracted human readable text instead of the bytes
that comprise the PDF file. The metadata associated with a document is
entirely up to you, for example, extracting the Vorbis artist, album
and track comments from FLAC audio files and adding them as metadata.
Using Lucene to index your Web site lets you offer a text search feature - like
a Google search box - for servicing searches like "Wakelocks embedded".
This is only the beginning though, because you can also offer advanced
searches by combining metadata into the search. If you build a Lucene
index for each registered user, the personalized search you can offer
is hard to beat. For example, finding pages about "locking" that
contain a link to a specific web site in the article comments. Or any
article on "locking" that contains a comment by any one of your
friends.
Lucene is actually an umbrella project which has many implementations
in Java, C++, Ruby and PHP among others.
Probably the most widely known implementation of Lucene is the original one
that is done in Java. In recent times, implementations in C++ (CLucene) and PHP
(Zend_Search_
Lucene)
have become available. There are also implementations in Perl and
Ruby, see the
full list for details.
The CLucene page states that its primary goal is to be
faster than the Java version. It would appear that the PHP
implementation was
primarily driven
by the desire to be homogeneous with the PHP environment.
The implementation of these full text and metadata search types
normally call for different queries and thus different
implementations to best resolve the queries. For example, it might be
quite common to want to search for a range in a metadata query, like
all the documents added to the index in December, whereas a full text
query might demand ranking of documents that contain the strings
"DDR3" and "latency".
You don't really need to know what Lucene does on its side of the API
to build and search indexes with it, though a high level knowledge of
what happens in the implementation can help you understand how to make
efficient use of the API.
Abstractly, a Lucene index consists of many Document objects, each of
which contains one or more fields. A
field is a key-value pair, for example, the key of "indexed-on" and a
value "Wed Dec 17, 2008 @ 3:58 PM". The full text content of a
document is also added to the Lucene index as a field property of a
document.
Fields can be stored verbatim in the index, or have an index created
for them, or both. You might want to index and store the URL that a
document was retrieved from, but might want to only index the document
text because storing it verbatim might make the index too large for
your application. An index on the contents of a file is likely to be
much smaller than the file itself. If you have access to the original
file you don't really want to store it in the Lucene index verbatim
too. A field can also be tokenized or stored atomically (a so called
keyword). You
would want to tokenize the text content of a file but probably want
the date it was indexed to remain an atomic value.
Normally you would have Lucene tokenize the text of a file and build
an inverted
file arrangement
for the tokens. For example, the word "token" would have a list of
which document numbers contain that word along with other metadata
relating to how often that term appears in each document relative to
the length of the document. This way queries looking for "token" and
"lucene" can be resolved by merging the two lists for each token.
A great deal of attention has been paid to not
locking data in the index with Lucene. This way,
the index can undergo updating in the background while it is
actively being used to service searches.
This eliminates the need to wait on the background process.
You can only have a single update running for an index at
any time, but many clients can be reading the index while that update
is occurring.
A Lucene index is made up of one or more segments. Each segment is
fully independent of any other segment and is stored in one or more
files. Concurrency without locking is achieved by writing any new or
changed data to a new Segment. One way to speed up indexing documents
and create fewer segments is to have Lucene cache as many of the added
documents in RAM and flush out a single, large segment on a less
frequent basis
For Java Lucene the setRAMBufferSizeMB
is used to set how much RAM can be used before a new segment is
written, its default is only 16Mb. Creating larger segments during
indexing means it will take slightly longer before clients can see new
documents (because the new segment is not written and is thus not
accessible) but will make for fewer, larger segments and thus less
need to merge segments later.
Instead of flushing a new segment when enough RAM has been used, you
can force a segment to be flushed every X documents with setMaxBufferedDocs. By
default, flushing is done when the buffered RAM size is reached and
there is no default maximum number of documents before a flush.
Segments are merged either periodically during the adding of documents or by
calling one of many optimize
methods. If an index is to remain constant for a period of time it is
a good idea to optimize it so that multiple segments are converted
into a single segment. Optimization has the additional side benefit
that if your filesystem is not full, writing a new single-segment
Lucene index should also mean that the index is stored in a single
filesystem extent.
Adding segments and merging segments are very similar operations.
To merge segments, all of the data is copied from the old segments
into a new segment and the old segments are then discarded.
The currently active segments are listed in the "segments" file.
Depending on how the
implementation of Lucene you are using operates, the segments file
might use a commit lock to protect it while it is being updated.
At any rate, as the segments file just lists the file names and other
metadata about segments, it can be updated very quickly.
I mentioned at the outset that Lucene specializes in full text
indexing. There are some issues when using Lucene for numerical
and date
metadata which make using those datatypes a more complex task than
just shoving full text into the index.
Knowing the Lucene API and how to include and search for information
in a Lucene index can allow you to develop many applications.
Hopefully the glimpse behind
the API that I've included can help you get started writing
applications that use Lucene efficiently. Because there are
implementations of Lucene in PHP, C++, C#, Java and other languages
you can apply general knowledge of Lucene to applications ranging from
Web development to embedded coding.
(
Log in to post comments)