The OCRopus Project
is a new open-source optical character recognition (OCR) effort
that was launched
this week by Google:
OCRopus is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities.
The OCRopus engine is based on two research projects: a high-performance handwriting recognizer developed in the mid-90's and deployed by the US Census bureau, and novel high-performance layout analysis methods.
OCRopus is development is sponsored by Google and is initially intended for high-throughput, high-volume document conversion efforts. We expect that it will also be an excellent OCR system for many other applications.
According to the
FAQ document, OCRopus is mainly intended to be used for character
recognition of scanned and digitally photographed text.
Output will be in HTML+CSS format.
The OCRopus plug-in architecture will support
multiple character recognition plug-ins.
Scanning of non-English text will be provided by language-specific
Processing Steps diagram gives a graphical overview of the
The software is being released under the Apache license, it is written
in C++ and Python. One of the main components of OCRopus is
which was released as open-source code by HP and UNLV in 2005.
The lead OCRopus developer is Professor Thomas Breuel
from the German Research Center for Artificial Intelligence in
Kaiserslautern. Funding has been set aside to support a
number of graduate students.
is available for an early release of the project:
"The technology preview release is basically the first check-in of the source code into the subversion repository. What you can expect is that this code performs about as well as Tesseract in terms of character-level performance, but that is able to cope better with non-trivial layouts. There is no packaging, binary distribution, or full autoconf yet."
The getting started document explains the dependencies and shows
how to build the software.
The project roadmap calls for an alpha release in the third quarter of
2007, a beta release in the first quarter of 2008 and a 1.0 release
in the third quarter of 2008.
Open-source contributions are being requested:
"We are hoping for contributions by the open source community in areas such
as adapting the system to additional languages, creating a Gnome desktop
application, integration with Gnome desktop search, web-based tools for
proofing and training, language modeling, additional character recognition
engines, and other useful tools and add-ons."
Help is being requested for porting to non-Linux platforms.
Support for KDE is not yet mentioned, but should be possible with
a bit of developer effort.
to post comments)