LWN.net Logo

The OCRopus project launches

From:  "mark cox" <markcox-AT-email.com>
To:  lwn-AT-lwn.net
Subject:  google Announces the OCRopus Open Source OCR System
Date:  Wed, 11 Apr 2007 09:59:42 +1000

http://google-code-updates.blogspot.com/2007/04/announcin...
Announcing the OCRopus Open Source OCR
System<http://google-code-updates.blogspot.com/2007/04/announcin...>
Monday,
April 09, 2007

Posted by Thomas Breuel, OCRopus Project Leader

We're happy to announce the OCRopus OCR Project <http://www.ocropus.org/>, a
Google-sponsored project to develop advanced
OCR<http://en.wikipedia.org/wiki/Optical_character_recognition>technologies
in the IUPR
research group <http://www.iupr.org/doku.php>, headed by Prof. Thomas Breuel
at the DFKI (German Research Center for Artificial Intelligence,
Kaiserslautern, Germany).

The goal of the project is to advance the state of the art in optical
character recognition and related technologies, and to deliver a high
quality OCR system suitable for document conversions, electronic libraries,
vision impaired users, historical document analysis, and general desktop
use. In addition, we are structuring the system in such a way that it will
be easy to reuse by other researchers in the field.

The OCRopus <http://www.ocropus.org/> engine is based on two research
projects: a high-performance handwriting recognizer developed in the
mid-90's and deployed by the US Census bureau, and novel high-performance
layout analysis methods.

The project is expected to run for three years and support three Ph.D.
students or postdocs. We are announcing a technology preview release of the
software under the Apache license (English-only, combining the Tesseract
character recognizer with IUPR layout analysis and language modeling tools),
with additional recognizers and functionality in future releases.

The IUPR research group has extensive experience in OCR and related
technologies, and will be basing the work on previous research and existing
software in the area. Existing software components include high-performance
handwriting recognition software that has received top evaluations by NIST
and was deployed by the US Census Bureau, the recently open sourced Tesseract
OCR system <http://sourceforge.net/projects/tesseract-ocr>, a separate
Google project for probabilistic natural language modeling, and software for
layout analysis and character recognition. The IUPR research group
gratefully acknowledges funding by the German BMBF, the state of Rhineland
Palatinate, and other public and private partners (please see
www.iupr.org<http://www.iupr.org/doku.php>for more details).

We are hoping for contributions by the open source community in areas such
as adapting the system to additional languages, creating a Gnome desktop
application, integration with Gnome desktop search, web-based tools for
proofing and training, language modeling, additional character recognition
engines, and other useful tools and add-ons.

The project web page can be found at ocropus.org <http://www.ocropus.org/>.



(Log in to post comments)

The OCRopus project launches

Posted Apr 15, 2007 21:49 UTC (Sun) by alexvoda (guest, #44678) [Link]

I would prefer a Java based ui.
This way it will work on Mac OS X, on Windows, on Linux and many many other.

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds