Two OCR packages for Linux compared (LinuxWorld)

[Posted March 4, 2003 by ris]

In this article on LinuxWorld.com Joe Barr compares two Optical Character Recognition packages -- one is free software, the other proprietary. "In the legal and medical fields, document management is a very big deal. In modern office environments, OCR often plays a key role in solving that problem. Because OCR for Linux is one area I don't hear or read a lot about, I decided to do some digging and see what I could find. This week, I'll tell you about two solutions I found: one from the free-software camp and one proprietary application."

Two OCR packages for Linux compared (LinuxWorld)

Posted Mar 5, 2003 1:09 UTC (Wed) by rjamestaylor (guest, #339) [Link] (1 responses)

Interesting article. I used GOCR in the summer of 2001 to process an image that I received via a FAX service (which emailed a TIFF to me; I used fetchmail and a Perl module for processing the MIME attachments, a command line image converter, and the Perl module Imager to grab, convert and slice the fax into preset pieces) to obtain a routing code for web delivery. This was for routing prescriptions to the correct browsers in asundry locations. Not a simple task and a wrong delivery would have HIPAA implications. GOCR, along with the rest of the toolset, performed marvellously for my needs.

Of course now the data is sent by the originator as an XML stream via SOAP. But the proof of concept required the fax delivery in order to obtain the green light for the remainder of the project. It was a lot of fun!

That's when I realized almost(*) everything I need I can find Open Source.

(*) except an ERP solution. Compiere, while I wish the project best of luck, didn't seem viable for my needs when I last looked.

ERP

Posted Mar 6, 2003 8:24 UTC (Thu) by yodermk (subscriber, #3803) [Link]

Have you seen GNU Enterprise? http://www.gnue.org

It's still in development, but getting cooler by the day. Some people use it in production. I know it is fully intending to be a complete ERP solution. And it doesn't depend on any non-Free software -- IIRC Compiere does.

Two OCR packages for Linux compared (LinuxWorld)

Posted Mar 5, 2003 2:15 UTC (Wed) by proski (subscriber, #104) [Link]

From the article:

Don't get me wrong; I'm a big fan of open source and believe that, in the end, it will become the dominant genre in many areas (read: operating systems, for sure). I am not bad-mouthing Gocr in the least when I point out that OCR Shop is clearly superior in terms of speed and accuracy.

I'm tired of such "disclaimers". Writing a honest review doesn't make one an advocate of free software or commercial software. It should be obvious and such comments should not be required.

I think that if the review favored the free application, such disclaimer ("I'm not bad-mouthing OCR Shop" in this case) would not appear regardless of the site that would publish such review.

I'm worried whether the reviewer was expecting angry e-mails from open source zealots and tried to avoid them. Or maybe that paragraph was addressed to the editors of linuxworld.com to make them look at the review more favorably?

Either way, I'm worried about "open source, closed minds" syndrome. It is damaging to the perception of free software by those who don't use it yet but read such reviews occasionally.

Two OCR packages for Linux compared (LinuxWorld)

Posted Mar 5, 2003 12:05 UTC (Wed) by csigler (subscriber, #1224) [Link] (2 responses)

Hi,

This is the most comprehensive list of OCR resources I've found:

http://www.humboldt1.com/~jiva/ocr/ocr_resource.htm

I've tried GOCR/JOCR and Clara and been fairly disappointed with both. From my reading of the TODO list (out of date) on the jocr.sourceforge.net website, GOCR/JOCR is very difficult to "train." A direct quote:

"You have to take such a example and look at the engine, why the error occurs, and find a good way to fix the problem."

The times I've tried GOCR/JOCR it gets maybe 50-90% recognition. Perhaps I'm doing something wrong? I'll try some better, higher DPI scans again soon.

Clara is the exact opposite. It comes without any training at all. You must tune it to your book or document, and the user interface seems _extremely_ sophisticated. However, from what I've read in the Clara FAQs the training must be done manually. If an automatic training interface were written for Clara, IMHO it could be quite powerful.

That being said, I've been disappointed with Clara not recognizing the same letter written multiple times in the same font style and size. Some letters it gets right -- it identifies all identical letters after training only one. On others it never recognizes the same letter twice and you have to train all the letters separately. This somewhat defeats the purpose of OCR, I'm thinking :^P

I should probably say I've spent only a little time looking at GOCR/JOCR and Clara because I'm afraid my initial impressions have been rather negative. Neither seems to me to be close to beta quality. I wish I could report better experiences, but I'm afraid I can't. I wish I had time to hack on these projects! But OCR is just something to experiment with in my spare time.

Clemmitt Sigler

Two OCR packages for Linux compared (LinuxWorld)

Posted Mar 6, 2003 15:32 UTC (Thu) by csigler (subscriber, #1224) [Link] (1 responses)

(Bad form, following up myself, eh? ;^)

I've retried GOCR/JOCR and Clara using high resolution scans, and the results from GOCR/JOCR are pretty good :^) I had been using low resolution screen captures, not realizing how important high definition was in getting OCR working (duh!).

I used 600 dpi scans of three pages of plain ASCII text, in letter-sized portrait format using a Times font. GOCR/JOCR did really well, with >98% correct recognition. (I think one problem with OCR is that more than one error per page makes the OCR package very time-consuming. GOCR/JOCR did have more than one error per page, but I was impressed with its performance :^)

I tested Clara by scanning the same three pages, but I also created a very simple "training" page using the exact same size/style font. I typed in the upper and lower case alphabets and the numbers 0-9 four times each, split across multiple lines, twice with the letters separated and twice not separated by spaces. I ran Clara with the `-y 600' option and trained it with the "training" page, then I ran it on each of the three pages of plain text. Unfortunately, Clara only identified eight letters and no numbers after this training procedure :^(

I must be doing something wrong with Clara, but I'm not sure what.... I didn't take the time to tune Clara's options, however, and this may be what's needed to optimize its performance.

Clemmitt Sigler

Two OCR packages for Linux compared (LinuxWorld)

Posted Apr 21, 2003 23:29 UTC (Mon) by tony_the_tech (guest, #10804) [Link]

I have also tinkered with both G/JOCR and clara, and I find similar issues. The more you
"train" gocr, the more confused it gets. Clara needs to be trained, and sometimes it just
doesn't seem to get it. I must have typed in about a hundred zeros for the numbers I'm
digitizing, and the software won't recognize any of them. even if the symbol is the *exact*
same pixel pattern...