|
|
Subscribe / Log in / New account

Development

The Lucene Search Suite

March 11, 2009

This article was contributed by Ben Martin

The Lucene project lets you index the documents on your filesystem or web server so you can run combined full text and metadata searches. A full text search takes one or more words of a human language as a query and should return documents which are the "most relevant" for those words. Web searches are a classic example of full text searches. Metadata searches should be familiar to anyone who has used the find command; for example, looking for all files that have been modified in the last week.

The primary goal of Lucene is to provide a fast index and query implementation and to specify an interface to the index implementation -- how to send queries to it and get your results back as fast as possible. Lucene is not, by itself, designed to be a complete user-facing index solution but rather to provide the heart of such a system. There are also higher level projects which use one of the Lucene implementations to provide search capabilities, for example, KDE4's strigi desktop search. If you just want to add a search capability to something then you might like to explore these higher level tools to see if you can save the time of writing a program that uses the Lucene API directly.

It is tempting to think of adding full text to an index as just a filesystem traversal where you read each file and shove the byte contents into the index. Normally you want to extend this to allow conversions too, such as extracting the plain text of PDF files and indexing the extracted human readable text instead of the bytes that comprise the PDF file. The metadata associated with a document is entirely up to you, for example, extracting the Vorbis artist, album and track comments from FLAC audio files and adding them as metadata.

Using Lucene to index your Web site lets you offer a text search feature - like a Google search box - for servicing searches like "Wakelocks embedded". This is only the beginning though, because you can also offer advanced searches by combining metadata into the search. If you build a Lucene index for each registered user, the personalized search you can offer is hard to beat. For example, finding pages about "locking" that contain a link to a specific web site in the article comments. Or any article on "locking" that contains a comment by any one of your friends.

Lucene is actually an umbrella project which has many implementations in Java, C++, Ruby and PHP among others. Probably the most widely known implementation of Lucene is the original one that is done in Java. In recent times, implementations in C++ (CLucene) and PHP (Zend_Search_ Lucene) have become available. There are also implementations in Perl and Ruby, see the full list for details. The CLucene page states that its primary goal is to be faster than the Java version. It would appear that the PHP implementation was primarily driven by the desire to be homogeneous with the PHP environment.

The implementation of these full text and metadata search types normally call for different queries and thus different implementations to best resolve the queries. For example, it might be quite common to want to search for a range in a metadata query, like all the documents added to the index in December, whereas a full text query might demand ranking of documents that contain the strings "DDR3" and "latency".

You don't really need to know what Lucene does on its side of the API to build and search indexes with it, though a high level knowledge of what happens in the implementation can help you understand how to make efficient use of the API.

Abstractly, a Lucene index consists of many Document objects, each of which contains one or more fields. A field is a key-value pair, for example, the key of "indexed-on" and a value "Wed Dec 17, 2008 @ 3:58 PM". The full text content of a document is also added to the Lucene index as a field property of a document.

Fields can be stored verbatim in the index, or have an index created for them, or both. You might want to index and store the URL that a document was retrieved from, but might want to only index the document text because storing it verbatim might make the index too large for your application. An index on the contents of a file is likely to be much smaller than the file itself. If you have access to the original file you don't really want to store it in the Lucene index verbatim too. A field can also be tokenized or stored atomically (a so called keyword). You would want to tokenize the text content of a file but probably want the date it was indexed to remain an atomic value.

Normally you would have Lucene tokenize the text of a file and build an inverted file arrangement for the tokens. For example, the word "token" would have a list of which document numbers contain that word along with other metadata relating to how often that term appears in each document relative to the length of the document. This way queries looking for "token" and "lucene" can be resolved by merging the two lists for each token.

A great deal of attention has been paid to not locking data in the index with Lucene. This way, the index can undergo updating in the background while it is actively being used to service searches. This eliminates the need to wait on the background process. You can only have a single update running for an index at any time, but many clients can be reading the index while that update is occurring.

A Lucene index is made up of one or more segments. Each segment is fully independent of any other segment and is stored in one or more files. Concurrency without locking is achieved by writing any new or changed data to a new Segment. One way to speed up indexing documents and create fewer segments is to have Lucene cache as many of the added documents in RAM and flush out a single, large segment on a less frequent basis

For Java Lucene the setRAMBufferSizeMB is used to set how much RAM can be used before a new segment is written, its default is only 16Mb. Creating larger segments during indexing means it will take slightly longer before clients can see new documents (because the new segment is not written and is thus not accessible) but will make for fewer, larger segments and thus less need to merge segments later.

Instead of flushing a new segment when enough RAM has been used, you can force a segment to be flushed every X documents with setMaxBufferedDocs. By default, flushing is done when the buffered RAM size is reached and there is no default maximum number of documents before a flush.

Segments are merged either periodically during the adding of documents or by calling one of many optimize methods. If an index is to remain constant for a period of time it is a good idea to optimize it so that multiple segments are converted into a single segment. Optimization has the additional side benefit that if your filesystem is not full, writing a new single-segment Lucene index should also mean that the index is stored in a single filesystem extent.

Adding segments and merging segments are very similar operations. To merge segments, all of the data is copied from the old segments into a new segment and the old segments are then discarded. The currently active segments are listed in the "segments" file. Depending on how the implementation of Lucene you are using operates, the segments file might use a commit lock to protect it while it is being updated. At any rate, as the segments file just lists the file names and other metadata about segments, it can be updated very quickly.

I mentioned at the outset that Lucene specializes in full text indexing. There are some issues when using Lucene for numerical and date metadata which make using those datatypes a more complex task than just shoving full text into the index.

Knowing the Lucene API and how to include and search for information in a Lucene index can allow you to develop many applications. Hopefully the glimpse behind the API that I've included can help you get started writing applications that use Lucene efficiently. Because there are implementations of Lucene in PHP, C++, C#, Java and other languages you can apply general knowledge of Lucene to applications ranging from Web development to embedded coding.

Comments (none posted)

System Applications

Database Software

Firebird 2.1.2 RC2 released

Version 2.1.2 release candidate 2 of the Firebird DBMS has been announced. "This is the second release candidate of the Firebird version 2.1.2 patch release. It is a BETA whose purpose is for FIELD TESTING. It is recommended that you test it before deploying it into production."

Comments (none posted)

MySQL Community Server 5.1.32 released

Version 5.1.32 of MySQL Community Server has been announced. "MySQL Community Server 5.1.32, a new version of the popular Open Source Database Management System, has been released. MySQL 5.1.32 is recommended for use on production systems."

Full Story (comments: none)

PostgreSQL Weekly News

The March 8, 2009 edition of the PostgreSQL Weekly News is online with the latest PostgreSQL DBMS articles and resources.

Full Story (comments: none)

Embedded Systems

BusyBox 1.13.3 released

Version 1.13.3 of BusyBox, a collection of command line utilities for embedded systems, has been announced. "1.13.3 is a bug fix release. It has fixes for awk, depmod, init, killall, mdev, modprobe, printf, syslogd, tar, top, unzip, wget."

Comments (none posted)

New Online Community for Developers of Embedded Linux Devices (LinuxElectrons)

LinuxElectrons looks at Meld, a new on-line community for embedded Linux. Meld, which is sponsored by MontaVista, takes some of the ideas of social networks and applies them to help embedded Linux developers collaborate. "'Linux is based on the idea of sharing knowledge, and there are strong underpinnings of this throughout the Linux community, yet there isn't a place for embedded Linux developers to go to collaborate and experience that sense of community,' said Joerg Bertholdt, Vice President of Marketing at MontaVista Software. 'Now, through Meld we want all embedded Linux device developers to come together to share their knowledge, collaborate with one another, and speed the design of innovative, commercial solutions running on embedded Linux. A strong community benefits all of its members and we believe this forum will allow Linux to grow and prosper in embedded devices.'"

Comments (11 posted)

Networking Tools

libnetfilter_log 0.0.16 released

Version 0.0.16 of libnetfilter_log has been announced. "libnetfilter_log is a userspace library providing interface to packets that have been logged by the kernel packet filter. It is is part of a system that deprecates the old syslog/dmesg based packet logging."

Full Story (comments: none)

libnetfilter_queue 0.0.17 released

Version 0.0.17 of libnetfilter_queue has been announced. "The netfilter project proudly presents: libnetfilter_queue-0.0.17 is a userspace library providing an API to packets that have been queued by the kernel packet filter. It is is part of a system that deprecates the old ip_queue / libipq mechanism."

Full Story (comments: none)

libnfnetlink 0.0.41 released

Version 0.0.41 of libnfnetlink has been announced. "libnfnetlink is the low-level library for netfilter related kernel/userspace communication. It provides a generic messaging infrastructure for in-kernel netfilter subsystems (such as nfnetlink_log, nfnetlink_queue, nfnetlink_conntrack) and their respective users and/or management tools in userspace."

Full Story (comments: none)

ulogd 2.0.0 beta 3 released

Version 2.0.0 beta 3 of ulogd has been announced. "ulogd is a userspace logging daemon for netfilter/iptables related logging. This includes per-packet logging of security violations, per-packet logging for accounting purpose as well as per-flow logging. "

Full Story (comments: none)

Virtualization Software

ConVirt: goes 1.0 (SourceForge)

Version 1.0 of ConVirt has been announced. "ConVirt is an intuitive, graphical management tool providing comprehensive life cycle management for Virtual Machines. We are extremely pleased to announce the immediate availability of ConVirt v1.0. This critical milestone comes after many months of development, bug-fixing and hard-earned validation in data centers, all of which was made possible by the invaluable feedback, encouragement and contributions from the ConVirt Community."

Comments (none posted)

Web Site Development

lighttpd 1.4.22 released

Version 1.4.22 of the lighttpd web server has been announced. "And here we are again… we had some bad regressions, so 1.4.22 was needed earlier than we expected and spawn-fcgi is still included in this release."

Comments (none posted)

LimeSurvey: 1.80 full release (SourceForge)

Version 1.80 of LimeSurvey has been announced. "LimeSurvey (formerly PHPSurveyor) is a PHP survey software to create online surveys. Features open/closed surveys, branching, participant administration, quotas, WYSIWYG HTML editor, email invitations & reminders, assessments, basic statistics and more. The LimeSurvey 1.80 release marks the end of four release candidates and five month of work on this new release."

Comments (none posted)

Miscellaneous

Python process utility 0.1.1 released

Version 0.1.1 of psutil has been announced. "psutil is a module providing an interface for retrieving information on running processes in a portable way by using Python. It currently supports Linux, OS X, FreeBSD and Windows. Aside from fixing some bugs psutil 0.1.1 includes the following major enhancements: * FreeBSD support has been added * Support for determining process's UID and GID has been added * Support for determining parent PID of a process * A process_iter() function to iterate over processes as Process objects with a generator has been added * Process objects can now also be compared with == operator for equality (PID, name, command line are compared). As of now psutil is released to the general public, and should be considered a beta release implementing basic functionality."

Full Story (comments: none)

Desktop Applications

Audio Applications

First version of jackpanel (0.0.1) released

The initial release of jackpanel has been announced. "jackpanel is a graphical frontend for the JACK audio server, emphasizing simplicity, good look and feel and GNOME integration. Realtime switch, latency and samplerate can be changed with one or two mouse clicks. It comes in two flavors: A GNOME panel applet and standalone. X-runs are displayed and can be reset with a mouse click."

Full Story (comments: none)

Jajuk: 1.7 'Firestarter' made available (SourceForge)

Version 1.7 of Jajuk has been announced. "Jajuk 1.7 comes with major performance enhancements and a brand new rating system. Jajuk is a Java music organizer for all platforms. The main goal of this project is to provide a fully-featured application to advanced users with large or scattered music collections."

Comments (none posted)

Taglib extension library and tools: First release (SourceForge)

The first release of Taglib extension library and tools has been announced. "The libtagext0 library is a short hack to provide extended reading and writing meta tags for several audio files as an extension to Scott Wheeler's "TagLib" library."

Comments (none posted)

Desktop Environments

GNOME 2.26.0 release candidate (2.25.92) released

GNOME 2.26.0, release candidate 2.25.92 has been announced. "My friends, we're nearly there! 2.26.0 will be out in two weeks. Yes, it will! I tell you so. And it will be a milestone in our history. Sure, it will! You don't doubt it. Because it's looking quite good. It definitely does! Ask around you to check. And people will love it. That's for sure! Make people try it. But we can still work a bit more on polishing GNOME for the prime time. In the next ten days, we should all try to focus on the list of showstoppers [1] and try to close as many of them as possible."

Full Story (comments: none)

GNOME Software Announcements

The following new GNOME software has been announced this week: You can find more new GNOME software releases at gnomefiles.org.

Comments (none posted)

KDE 4.2.1 released

Version 4.2.1 of KDE has been announced. "KDE Community Ships First Translation and Service Release of the 4.2 Free Desktop, Containing Numerous Bugfixes, Performance Improvements and Translation Updates".

Full Story (comments: none)

KDE Software Announcements

The following new KDE software has been announced this week: You can find more new KDE software releases at kde-apps.org.

Comments (none posted)

Xorg Software Announcements

The following new Xorg software has been announced this week: More information can be found on the X.Org Foundation wiki.

Comments (none posted)

Desktop Publishing

Inforama: Community Edition 1.2 beta 2 Available (SourceForge)

Version 1.2 beta 2 of Inforama Community Edition has been announced. "Inforama - Document Automation. Document templates, generation and distribution. Create letter templates using OpenOffice and import existing Acrobat forms. Merge data to produce high quality PDF documents and automatically email, print and view. Inforama version 1.2 beta 2 has been released. We didn't announce the beta 1 release as we found some significant bugs which we wanted to fix - hence the jump straight to beta 2."

Comments (none posted)

Mail Clients

Claws Mail 3.7.1 unleashed

Version 3.7.1 of Claws Mail has been announced. "New in this release: * Spell Checking has been added to the Subject in the Compose window. * The 'Quotation characters' option has been moved from the Compose/ Writing page of the preferences to the /Message View/Text Options page, where it should be. * When replying to signed and/or encrypted mail and the preference to sign and/or encrypt is set, the original mail's privacy system is automatically used, if available. * If a text/calendar attachment is present in a message it is automatcally selected if a suitable plugin (i.e. vCalendar) is available. * /Tools/List URLs now shows both the link title and URI if possible. * A URI appearing in the statusbar is now only trimmed if necessary. * When using /Tools/Create filter|procesing rule/Automatically the List-Id header is preferred to X-* headers..."

Full Story (comments: none)

Multimedia

Elisa Media Center 0.5.31 released

Version 0.5.31 of Elisa Media Center has been announced. "Elisa is a cross-platform and open-source Media Center written in Python. It uses GStreamer for media playback and pigment to create an appealing and intuitive user interface. This release is a "light weight" release, meaning it is pushed through our automatic plugin update system."

Full Story (comments: none)

Music Applications

Strasheela 0.9.9 released

Version 0.9.9 of Strasheela has been announced. "Strasheela is a highly expressive constraint-based music composition system. Users declaratively state a music theory and the computer generates music which complies with this theory. A theory is formulated as a constraint satisfaction problem (CSP) by a set of rules (constraints) applied to a music representation in which some aspects are expressed by variables (unknowns). Music constraint programming is style-independent and is well-suited for highly complex theories (e.g. a fully-fledged theory of harmony). Results can be output into various formats including MIDI, Lilypond, and Csound. This release brings many small-scale improvements and extensions to Strasheela."

Full Story (comments: none)

Office Suites

KOffice 2.0 Beta 7 Released (KDEDot)

Version 2.0 Beta 7 of KOffice has been announced. "The KOffice developers have released their seventh beta for KOffice 2.0. This release may be the last of the many betas. A decision on whether there will be another beta or if the next version will be the first Release Candidates will be made next week. The list of changes is longer than ever. For this release we have concentrated on crashes, data loss bugs and ODF saving and loading."

Comments (none posted)

OpenOffice.org Newsletter

The February, 2009 edition of the OpenOffice.org Newsletter is out with the latest OO.o office suite articles and events.

Full Story (comments: none)

Video Applications

FFmpeg 0.5 released

It has been a long time since we have seen an FFmpeg release, but version 0.5 is now out. As one might expect, the changes are extensive and are mostly in the form of new codecs. More information can be found on the web site and in the version 0.5 changelog. "It is codenamed 'half-way to world domination A.K.A. the belligerent blue bike shed' to give an idea where we stand in the grand scheme of things and to commemorate the many fruitful discussions we had during its development."

Comments (6 posted)

Web Browsers

Firefox 3.0.7 is available

Version 3.0.7 of the Firefox web browser has been announced. "As part of Mozilla Corporation's ongoing security and stability update process, Firefox 3.0.7 is now available for Windows, Mac, and Linux for free download from http://getfirefox.com/. We strongly recommend that all Firefox users upgrade to this latest release." Several security fixes are included, see the release notes for more information.

Full Story (comments: 8)

Firefox 3.1 becoming Firefox 3.5

Firefox version 3.1 will be renumbered to version 3.5. "As was discussed at the delivery meeting yesterday, we're proposing to change the version number of Shiretoko from 3.1 to 3.5. The increase in scope represented by TraceMonkey and Private Browsing, plus the sheer volume of work that's gone into everything from video and layout to places and the plugin service make it a larger increment than we believe is reasonable to label ".1"."

Full Story (comments: 2)

Languages and Tools

Caml

Caml Weekly News

The March 10, 2009 edition of the Caml Weekly News is out with new articles about the Caml language.

Full Story (comments: none)

Perl

POE::Component::IRC 6.00 is here

Version 6.00 of POE::Component::IRC has been announced. "For the uninitiated, POE::Component::IRC is an event-driven IRC client library built on top of POE. People mostly use it to write bots. Some have made that even easier by creating a simpler interface suited to that task (see Bot::BasicBot). I became involved in the project about 14 months ago, fixing bugs and adding features. There've been about 50 releases during that time, so there's something for everybody."

Comments (none posted)

Python

Jython 2.5 beta 2 released

Version 2.5 beta 2 of Jython, an implementation of Python in Java, has been announced. "Unless a severe bug is found, this will be the last beta before we start putting out release candidates. The modjy project has been pushed into the core, there have been many bugfixes. I attempted to get all of the bugfixes out of the tracker and into the NEWS file. Hopefully we can get more disciplined about change logs in the future."

Full Story (comments: none)

Jython 2.5 beta 3 released

Version 2.5 beta 3 of Jython has been released. "When I released Beta 2 this Saturday, I said it would be the last beta unless a severe bug was found. Well, a severe bug was found. Under certain circumstances Jython Beta 2 would not start on Windows."

Full Story (comments: none)

Python 3.1 alpha 1 released

Version 3.1 alpha 1 of Python has been announced. "Python 3.1 focuses on the stabilization and optimization of features and changes Python 3.0 introduced. The new I/O system has been rewritten in C for speed. Other features include a ordered dictionary implementation and support for ttk Tile in Tkinter. Please note that these are alpha releases, and as such are not suitable for production environments."

Full Story (comments: none)

Python-URL! - weekly Python news and links

The March 11, 2009 edition of the Python-URL! is online with a new collection of Python article links.

Full Story (comments: none)

Tcl/Tk

Tcl-URL! - weekly Tcl news and links

The March 5, 2009 edition of the Tcl-URL! is online with new Tcl/Tk articles and resources.

Full Story (comments: none)

XML

cssutils 0.9.6a2 released

Version 0.9.6a2 of cssutils has been announced. The software is: "A Python package to parse and build CSS Cascading Style Sheets." Changes include bug fixes, an API change and some new capabilities.

Full Story (comments: none)

Cross Compilers

PyMite release 07 announced

Release 07 of PyMite has been announced, it includes new features and bug fixes. "PyMite is a flyweight Python interpreter written from scratch to execute on 8-bit and larger microcontrollers with resources as limited as 64 KB of program memory (flash) and 4 KB of RAM. PyMite supports a subset of the Python 2.5 syntax and can execute a subset of the Python 2.5 bytecodes. PyMite can also be compiled, tested and executed on a desktop computer."

Full Story (comments: none)

IDEs

eric 4.3.1 released

Version 4.3.1 of eric, an IDE for Python and Ruby, has been announced. "I just uploaded eric 4.3.1. It is a maintenance release fixing some bugs."

Full Story (comments: none)

Test Suites

Linux Desktop Testing Project 1.5.1 released

Version 1.5.1 of the Linux Desktop Testing Project, a test automation framework for desktop applications, has been announced. "This release features number of important breakthroughs in LDTP as well as in the field of Test Automation. This release note covers a brief introduction on LDTP followed by the list of new features and major bug fixes which makes this new version of LDTP the best of the breed."

Full Story (comments: none)

Version Control

Mercurial 1.2 released

Version 1.2 of the Mercurial source code management system has been announced. "This is a larger feature release."

Full Story (comments: none)

TopGit 0.7 announced

Version 0.7 of TopGit has been announced, it adds new features and bug fixes. "The most useful new feature (in my opinion) is a new export method that provides your patches as a linear history in a regular git branch for pulling by your upstream. TopGit aims to make handling of large amount of interdependent topic branches easier. In fact, it is designed especially for the case when you maintain a queue of third-party patches on top of another (perhaps Git-controlled) project and want to easily organize, maintain and submit them - TopGit achieves that by keeping a separate topic branch for each patch and providing few tools to maintain the branches".

Full Story (comments: none)

Page editor: Forrest Cook
Next page: Linux in the news>>


Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds