LWN.net Logo

Introducing the OCRopus Project

The OCRopus Project is a new open-source optical character recognition (OCR) effort that was launched this week by Google:

OCRopus is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities. The OCRopus engine is based on two research projects: a high-performance handwriting recognizer developed in the mid-90's and deployed by the US Census bureau, and novel high-performance layout analysis methods. OCRopus is development is sponsored by Google and is initially intended for high-throughput, high-volume document conversion efforts. We expect that it will also be an excellent OCR system for many other applications.

According to the FAQ document, OCRopus is mainly intended to be used for character recognition of scanned and digitally photographed text. Output will be in HTML+CSS format. The OCRopus plug-in architecture will support multiple character recognition plug-ins. Scanning of non-English text will be provided by language-specific plug-in modules. The Processing Steps diagram gives a graphical overview of the code flow.

The software is being released under the Apache license, it is written in C++ and Python. One of the main components of OCRopus is Tesseract OCR, which was released as open-source code by HP and UNLV in 2005.

The lead OCRopus developer is Professor Thomas Breuel from the German Research Center for Artificial Intelligence in Kaiserslautern. Funding has been set aside to support a number of graduate students.

The source code is available for an early release of the project: "The technology preview release is basically the first check-in of the source code into the subversion repository. What you can expect is that this code performs about as well as Tesseract in terms of character-level performance, but that is able to cope better with non-trivial layouts. There is no packaging, binary distribution, or full autoconf yet." The getting started document explains the dependencies and shows how to build the software.

The project roadmap calls for an alpha release in the third quarter of 2007, a beta release in the first quarter of 2008 and a 1.0 release in the third quarter of 2008.

Open-source contributions are being requested: "We are hoping for contributions by the open source community in areas such as adapting the system to additional languages, creating a Gnome desktop application, integration with Gnome desktop search, web-based tools for proofing and training, language modeling, additional character recognition engines, and other useful tools and add-ons."

Help is being requested for porting to non-Linux platforms. Support for KDE is not yet mentioned, but should be possible with a bit of developer effort.


(Log in to post comments)

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds