March 11, 2009
This article was contributed by Ben Martin
The Lucene project lets you index the documents on your filesystem or
web server so you can run combined full text and metadata searches. A full
text search takes one or more
words of a human language as a query and should return documents which
are the "most relevant" for those words. Web searches are a classic
example of full text searches. Metadata searches should be familiar
to anyone who has used the find command; for example,
looking for all files that have been modified in the last week.
The primary goal of Lucene is to provide a fast index and query
implementation and to specify an interface to the index implementation
-- how to send queries to it and get your results back as fast as
possible. Lucene is not, by itself, designed to be a complete user-facing
index solution but rather to provide the heart of such a
system. There are also higher level projects which use one of the
Lucene implementations to provide search capabilities, for example,
KDE4's strigi desktop search. If you just want to add a search
capability to something then you might like to explore these higher level
tools to see if you can save the time of writing a program that uses
the Lucene API directly.
It is tempting to think of adding full text to an index as just a
filesystem traversal where you read each file and shove the byte
contents into the index. Normally you want to extend this to allow
conversions too, such as extracting the plain text of PDF files
and indexing the extracted human readable text instead of the bytes
that comprise the PDF file. The metadata associated with a document is
entirely up to you, for example, extracting the Vorbis artist, album
and track comments from FLAC audio files and adding them as metadata.
Using Lucene to index your Web site lets you offer a text search feature - like
a Google search box - for servicing searches like "Wakelocks embedded".
This is only the beginning though, because you can also offer advanced
searches by combining metadata into the search. If you build a Lucene
index for each registered user, the personalized search you can offer
is hard to beat. For example, finding pages about "locking" that
contain a link to a specific web site in the article comments. Or any
article on "locking" that contains a comment by any one of your
friends.
Lucene is actually an umbrella project which has many implementations
in Java, C++, Ruby and PHP among others.
Probably the most widely known implementation of Lucene is the original one
that is done in Java. In recent times, implementations in C++ (CLucene) and PHP
(Zend_Search_
Lucene)
have become available. There are also implementations in Perl and
Ruby, see the
full list for details.
The CLucene page states that its primary goal is to be
faster than the Java version. It would appear that the PHP
implementation was
primarily driven
by the desire to be homogeneous with the PHP environment.
The implementation of these full text and metadata search types
normally call for different queries and thus different
implementations to best resolve the queries. For example, it might be
quite common to want to search for a range in a metadata query, like
all the documents added to the index in December, whereas a full text
query might demand ranking of documents that contain the strings
"DDR3" and "latency".
You don't really need to know what Lucene does on its side of the API
to build and search indexes with it, though a high level knowledge of
what happens in the implementation can help you understand how to make
efficient use of the API.
Abstractly, a Lucene index consists of many Document objects, each of
which contains one or more fields. A
field is a key-value pair, for example, the key of "indexed-on" and a
value "Wed Dec 17, 2008 @ 3:58 PM". The full text content of a
document is also added to the Lucene index as a field property of a
document.
Fields can be stored verbatim in the index, or have an index created
for them, or both. You might want to index and store the URL that a
document was retrieved from, but might want to only index the document
text because storing it verbatim might make the index too large for
your application. An index on the contents of a file is likely to be
much smaller than the file itself. If you have access to the original
file you don't really want to store it in the Lucene index verbatim
too. A field can also be tokenized or stored atomically (a so called
keyword). You
would want to tokenize the text content of a file but probably want
the date it was indexed to remain an atomic value.
Normally you would have Lucene tokenize the text of a file and build
an inverted
file arrangement
for the tokens. For example, the word "token" would have a list of
which document numbers contain that word along with other metadata
relating to how often that term appears in each document relative to
the length of the document. This way queries looking for "token" and
"lucene" can be resolved by merging the two lists for each token.
A great deal of attention has been paid to not
locking data in the index with Lucene. This way,
the index can undergo updating in the background while it is
actively being used to service searches.
This eliminates the need to wait on the background process.
You can only have a single update running for an index at
any time, but many clients can be reading the index while that update
is occurring.
A Lucene index is made up of one or more segments. Each segment is
fully independent of any other segment and is stored in one or more
files. Concurrency without locking is achieved by writing any new or
changed data to a new Segment. One way to speed up indexing documents
and create fewer segments is to have Lucene cache as many of the added
documents in RAM and flush out a single, large segment on a less
frequent basis
For Java Lucene the setRAMBufferSizeMB
is used to set how much RAM can be used before a new segment is
written, its default is only 16Mb. Creating larger segments during
indexing means it will take slightly longer before clients can see new
documents (because the new segment is not written and is thus not
accessible) but will make for fewer, larger segments and thus less
need to merge segments later.
Instead of flushing a new segment when enough RAM has been used, you
can force a segment to be flushed every X documents with setMaxBufferedDocs. By
default, flushing is done when the buffered RAM size is reached and
there is no default maximum number of documents before a flush.
Segments are merged either periodically during the adding of documents or by
calling one of many optimize
methods. If an index is to remain constant for a period of time it is
a good idea to optimize it so that multiple segments are converted
into a single segment. Optimization has the additional side benefit
that if your filesystem is not full, writing a new single-segment
Lucene index should also mean that the index is stored in a single
filesystem extent.
Adding segments and merging segments are very similar operations.
To merge segments, all of the data is copied from the old segments
into a new segment and the old segments are then discarded.
The currently active segments are listed in the "segments" file.
Depending on how the
implementation of Lucene you are using operates, the segments file
might use a commit lock to protect it while it is being updated.
At any rate, as the segments file just lists the file names and other
metadata about segments, it can be updated very quickly.
I mentioned at the outset that Lucene specializes in full text
indexing. There are some issues when using Lucene for numerical
and date
metadata which make using those datatypes a more complex task than
just shoving full text into the index.
Knowing the Lucene API and how to include and search for information
in a Lucene index can allow you to develop many applications.
Hopefully the glimpse behind
the API that I've included can help you get started writing
applications that use Lucene efficiently. Because there are
implementations of Lucene in PHP, C++, C#, Java and other languages
you can apply general knowledge of Lucene to applications ranging from
Web development to embedded coding.
Comments (none posted)
System Applications
Database Software
Version 2.1.2 release candidate 2 of the Firebird DBMS has been
announced.
"
This is the second release candidate of the Firebird version 2.1.2 patch release. It is a BETA whose purpose is for FIELD TESTING. It is recommended that you test it before deploying it into production."
Comments (none posted)
Version 5.1.32 of MySQL Community Server has been announced.
"
MySQL Community Server 5.1.32, a new version of the popular Open
Source Database Management System, has been released. MySQL 5.1.32 is
recommended for use on production systems."
Full Story (comments: none)
The March 8, 2009 edition of the PostgreSQL Weekly News
is online with the latest PostgreSQL DBMS articles and resources.
Full Story (comments: none)
Embedded Systems
Version 1.13.3 of
BusyBox, a collection of command line utilities for embedded systems, has been announced.
"
1.13.3 is a bug fix release. It has fixes for awk, depmod, init, killall, mdev, modprobe, printf, syslogd, tar, top, unzip, wget."
Comments (none posted)
LinuxElectrons
looks at Meld, a new on-line community for embedded Linux.
Meld, which is sponsored by MontaVista, takes some of the ideas of social networks and applies them to help embedded Linux developers collaborate. "
'Linux is based on the idea of sharing knowledge, and there are strong underpinnings of this throughout the Linux community, yet there isn't a place for embedded Linux developers to go to collaborate and experience that sense of community,' said Joerg Bertholdt, Vice President of Marketing at MontaVista Software. 'Now, through Meld we want all embedded Linux device developers to come together to share their knowledge, collaborate with one another, and speed the design of innovative, commercial solutions running on embedded Linux. A strong community benefits all of its members and we believe this forum will allow Linux to grow and prosper in embedded devices.'"
Comments (11 posted)
Networking Tools
Version 0.0.16 of libnetfilter_log has been announced.
"
libnetfilter_log is a userspace library providing interface to packets
that have been logged by the kernel packet filter. It is is part of a
system that deprecates the old syslog/dmesg based packet logging."
Full Story (comments: none)
Version 0.0.17 of libnetfilter_queue has been announced.
"
The netfilter project proudly presents:
libnetfilter_queue-0.0.17
is a userspace library providing an API to packets
that have been queued by the kernel packet filter. It is is part of a
system that deprecates the old ip_queue / libipq mechanism."
Full Story (comments: none)
Version 0.0.41 of libnfnetlink has been announced.
"
libnfnetlink is the low-level library for netfilter related
kernel/userspace communication. It provides a generic messaging
infrastructure for in-kernel netfilter subsystems (such as
nfnetlink_log, nfnetlink_queue, nfnetlink_conntrack) and their
respective users and/or management tools in userspace."
Full Story (comments: none)
Version 2.0.0 beta 3 of ulogd has been announced.
"
ulogd is a userspace logging daemon for netfilter/iptables related
logging. This includes per-packet logging of security violations,
per-packet logging for accounting purpose as well as per-flow logging.
"
Full Story (comments: none)
Virtualization Software
Version 1.0 of ConVirt has been
announced.
"
ConVirt is an intuitive, graphical management tool providing comprehensive life cycle management for Virtual Machines.
We are extremely pleased to announce the immediate availability of ConVirt v1.0. This critical milestone comes after many months of development, bug-fixing and hard-earned validation in data centers, all of which was made possible by the invaluable feedback, encouragement and contributions from the ConVirt Community."
Comments (none posted)
Web Site Development
Version 1.4.22 of the lighttpd web server has been
announced.
"
And here we are againÂ… we had some bad regressions, so 1.4.22 was needed earlier than we expected and spawn-fcgi is still included in this release."
Comments (none posted)
Version 1.80 of LimeSurvey has been
announced.
"
LimeSurvey (formerly PHPSurveyor) is a PHP survey software to create online surveys. Features open/closed surveys, branching, participant administration, quotas, WYSIWYG HTML editor, email invitations & reminders, assessments, basic statistics and more.
The LimeSurvey 1.80 release marks the end of four release candidates and five month of work on this new release."
Comments (none posted)
Miscellaneous
Version 0.1.1 of psutil has been announced.
"
psutil is a module providing an interface for retrieving information
on running processes in a portable way by using Python.
It currently supports Linux, OS X, FreeBSD and Windows.
Aside from fixing some bugs psutil 0.1.1 includes the following major
enhancements:
* FreeBSD support has been added
* Support for determining process's UID and GID has been added
* Support for determining parent PID of a process
* A process_iter() function to iterate over processes as Process
objects with a generator has been added
* Process objects can now also be compared with == operator for
equality (PID, name, command line are compared).
As of now psutil is released to the general public, and should be
considered a beta release implementing basic functionality."
Full Story (comments: none)
Desktop Applications
Audio Applications
The initial release of jackpanel has been announced.
"
jackpanel is a graphical frontend for the JACK audio server,
emphasizing simplicity, good look and feel and GNOME integration.
Realtime switch, latency and samplerate can be changed with
one or two mouse clicks.
It comes in two flavors: A GNOME panel applet and standalone.
X-runs are displayed and can be reset with a mouse click."
Full Story (comments: none)
Version 1.7 of Jajuk has been
announced.
"
Jajuk 1.7 comes with major performance enhancements and a brand new rating system.
Jajuk is a Java music organizer for all platforms. The main goal of this project is to provide a fully-featured application to advanced users with large or scattered music collections."
Comments (none posted)
The first release of Taglib extension library and tools has been
announced.
"
The libtagext0 library is a short hack to provide extended
reading and writing meta tags for several audio files
as an extension to Scott Wheeler's "TagLib" library."
Comments (none posted)
Desktop Environments
GNOME 2.26.0, release candidate 2.25.92 has been announced.
"
My friends, we're nearly there! 2.26.0 will be out in two weeks. Yes, it
will! I tell you so. And it will be a milestone in our history. Sure, it
will! You don't doubt it. Because it's looking quite good. It definitely
does! Ask around you to check. And people will love it. That's for sure!
Make people try it. But we can still work a bit more on polishing GNOME
for the prime time.
In the next ten days, we should all try to focus on the list of
showstoppers [1] and try to close as many of them as possible."
Full Story (comments: none)
The following new GNOME software has been announced this week:
You can find more new GNOME software releases at
gnomefiles.org.
Comments (none posted)
Version 4.2.1 of KDE has been announced.
"
KDE Community Ships First Translation and Service Release of the 4.2 Free
Desktop, Containing Numerous Bugfixes, Performance Improvements and
Translation Updates".
Full Story (comments: none)
The following new KDE software has been announced this week:
You can find more new KDE software releases at
kde-apps.org.
Comments (none posted)
The following new Xorg software has been announced this week:
More information can be found on the
X.Org Foundation wiki.
Comments (none posted)
Desktop Publishing
Version 1.2 beta 2 of Inforama Community Edition has been
announced.
"
Inforama - Document Automation. Document templates, generation and distribution. Create letter templates using OpenOffice and import existing Acrobat forms. Merge data to produce high quality PDF documents and automatically email, print and view.
Inforama version 1.2 beta 2 has been released. We didn't announce the beta 1 release as we found some significant bugs which we wanted to fix - hence the jump straight to beta 2."
Comments (none posted)
Mail Clients
Version 3.7.1 of Claws Mail has been announced.
"
New in this release:
* Spell Checking has been added to the Subject in the Compose window.
* The 'Quotation characters' option has been moved from the Compose/
Writing page of the preferences to the /Message View/Text Options
page, where it should be.
* When replying to signed and/or encrypted mail and the preference to
sign and/or encrypt is set, the original mail's privacy system is
automatically used, if available.
* If a text/calendar attachment is present in a message it is
automatcally selected if a suitable plugin (i.e. vCalendar) is
available.
* /Tools/List URLs now shows both the link title and URI if possible.
* A URI appearing in the statusbar is now only trimmed if necessary.
* When using /Tools/Create filter|procesing rule/Automatically
the List-Id header is preferred to X-* headers..."
Full Story (comments: none)
Multimedia
Version 0.5.31 of Elisa Media Center has been announced.
"
Elisa is a cross-platform and open-source Media Center written in Python.
It uses GStreamer for media playback and pigment to create an
appealing and intuitive user interface.
This release is a "light weight" release, meaning it is pushed through
our automatic plugin update system."
Full Story (comments: none)
Music Applications
Version 0.9.9 of Strasheela has been announced.
"
Strasheela is a highly expressive constraint-based music composition
system. Users declaratively state a music theory and the computer
generates music which complies with this theory. A theory is
formulated as a constraint satisfaction problem (CSP) by a set of
rules (constraints) applied to a music representation in which some
aspects are expressed by variables (unknowns). Music constraint
programming is style-independent and is well-suited for highly
complex theories (e.g. a fully-fledged theory of harmony). Results
can be output into various formats including MIDI, Lilypond, and Csound.
This release brings many small-scale improvements and extensions to
Strasheela."
Full Story (comments: none)
Office Suites
Version 2.0 Beta 7 of KOffice has been
announced.
"
The KOffice developers have released their seventh beta for KOffice 2.0. This release may be the last of the many betas. A decision on whether there will be another beta or if the next version will be the first Release Candidates will be made next week.
The list of changes is longer than ever. For this release we have concentrated on crashes, data loss bugs and ODF saving and loading."
Comments (none posted)
The February, 2009 edition of the OpenOffice.org Newsletter
is out with the latest OO.o office suite articles and events.
Full Story (comments: none)
Video Applications
It has been a long time since we have seen an FFmpeg release, but version
0.5 is now out. As one might expect, the changes are extensive and are
mostly in the form of new codecs. More information can be found on
the web site and in
the
version 0.5 changelog. "
It is codenamed 'half-way to world
domination A.K.A. the belligerent blue bike shed' to give an idea where we
stand in the grand scheme of things and to commemorate the many fruitful
discussions we had during its development."
Comments (6 posted)
Web Browsers
Version 3.0.7 of the Firefox web browser has been announced.
"
As part of Mozilla Corporation's ongoing security and stability update
process, Firefox 3.0.7 is now available for Windows, Mac, and Linux
for free download from
http://getfirefox.com/.
We strongly recommend that all Firefox users upgrade to this latest
release." Several security fixes are included, see the
release notes for more information.
Full Story (comments: 8)
Firefox version 3.1 will be renumbered to version 3.5.
"
As was discussed at the delivery meeting yesterday, we're proposing to
change the version number of Shiretoko from 3.1 to 3.5. The increase in
scope represented by TraceMonkey and Private Browsing, plus the sheer
volume of work that's gone into everything from video and layout to
places and the plugin service make it a larger increment than we
believe is reasonable to
label ".1"."
Full Story (comments: 2)
Languages and Tools
Caml
The March 10, 2009 edition of the Caml Weekly News
is out with new articles about the Caml language.
Full Story (comments: none)
Perl
Version 6.00 of POE::Component::IRC has been
announced.
"
For the uninitiated, POE::Component::IRC is an event-driven IRC client library built on top of POE. People mostly use it to write bots. Some have made that even easier by creating a simpler interface suited to that task (see Bot::BasicBot).
I became involved in the project about 14 months ago, fixing bugs and adding features. There've been about 50 releases during that time, so there's something for everybody."
Comments (none posted)
Python
Version 2.5 beta 2 of Jython, an implementation of Python in Java,
has been announced.
"
Unless a severe bug is found, this will be the last beta before we
start putting out release candidates. The modjy project has been
pushed into the core, there have been many bugfixes. I attempted to
get all of the bugfixes out of the tracker and into the NEWS file.
Hopefully we can get more disciplined about change logs in the future."
Full Story (comments: none)
Version 2.5 beta 3 of Jython has been released.
"
When I released Beta 2 this Saturday, I said it would be the last beta
unless a severe bug was found. Well, a severe bug was found. Under
certain circumstances Jython Beta 2 would not start on Windows."
Full Story (comments: none)
Version 3.1 alpha 1 of Python has been announced.
"
Python 3.1 focuses on the stabilization and optimization of features and changes
Python 3.0 introduced. The new I/O system has been rewritten in C for speed.
Other features include a ordered dictionary implementation and support for ttk
Tile in Tkinter.
Please note that these are alpha releases, and as such are not suitable for
production environments."
Full Story (comments: none)
The March 11, 2009 edition of the Python-URL! is online with
a new collection of Python article links.
Full Story (comments: none)
Tcl/Tk
The March 5, 2009 edition of the Tcl-URL! is online with new
Tcl/Tk articles and resources.
Full Story (comments: none)
XML
Version 0.9.6a2 of cssutils has been announced.
The software is:
"
A Python package to parse and build CSS Cascading Style Sheets." Changes include bug fixes, an API change and some new capabilities.
Full Story (comments: none)
Cross Compilers
Release 07 of PyMite has been announced, it includes new features and
bug fixes.
"
PyMite is a flyweight Python interpreter written from scratch to
execute
on 8-bit and larger microcontrollers with resources as limited as 64
KB of
program memory (flash) and 4 KB of RAM. PyMite supports a subset
of the Python 2.5 syntax and can execute a subset of the Python 2.5
bytecodes. PyMite can also be compiled, tested and executed on a
desktop
computer."
Full Story (comments: none)
IDEs
Version 4.3.1 of eric, an IDE for Python and Ruby, has been announced.
"
I just uploaded eric 4.3.1. It is a maintenance release fixing some bugs."
Full Story (comments: none)
Test Suites
Version 1.5.1 of the Linux Desktop Testing Project, a test
automation framework for desktop applications, has been announced.
"
This release features
number of important breakthroughs in LDTP as well as in the field of Test
Automation. This release note covers a brief introduction on LDTP followed
by the list of new features and major bug fixes which makes this new version
of LDTP the best of the breed."
Full Story (comments: none)
Version Control
Version 1.2 of the Mercurial source code management system has been announced.
"
This is a larger feature release."
Full Story (comments: none)
Version 0.7 of TopGit has been announced, it adds new features and bug fixes.
"
The most useful new feature (in my opinion) is a new export method that
provides your patches as a linear history in a regular git branch for
pulling by your upstream.
TopGit aims to make handling of large amount of interdependent topic
branches easier. In fact, it is designed especially for the case when
you maintain a queue of third-party patches on top of another (perhaps
Git-controlled) project and want to easily organize, maintain and submit
them - TopGit achieves that by keeping a separate topic branch for each
patch and providing few tools to maintain the branches".
Full Story (comments: none)
Page editor: Forrest Cook
Next page: Linux in the news>>