|
|
Subscribe / Log in / New account

Two OCR packages for Linux compared (LinuxWorld)

Two OCR packages for Linux compared (LinuxWorld)

Posted Mar 5, 2003 12:05 UTC (Wed) by csigler (subscriber, #1224)
Parent article: Two OCR packages for Linux compared (LinuxWorld)

Hi,

This is the most comprehensive list of OCR resources I've found:

http://www.humboldt1.com/~jiva/ocr/ocr_resource.htm

I've tried GOCR/JOCR and Clara and been fairly disappointed with both. From my reading of the TODO list (out of date) on the jocr.sourceforge.net website, GOCR/JOCR is very difficult to "train." A direct quote:

"You have to take such a example and look at the engine, why the error occurs, and find a good way to fix the problem."

The times I've tried GOCR/JOCR it gets maybe 50-90% recognition. Perhaps I'm doing something wrong? I'll try some better, higher DPI scans again soon.

Clara is the exact opposite. It comes without any training at all. You must tune it to your book or document, and the user interface seems _extremely_ sophisticated. However, from what I've read in the Clara FAQs the training must be done manually. If an automatic training interface were written for Clara, IMHO it could be quite powerful.

That being said, I've been disappointed with Clara not recognizing the same letter written multiple times in the same font style and size. Some letters it gets right -- it identifies all identical letters after training only one. On others it never recognizes the same letter twice and you have to train all the letters separately. This somewhat defeats the purpose of OCR, I'm thinking :^P

I should probably say I've spent only a little time looking at GOCR/JOCR and Clara because I'm afraid my initial impressions have been rather negative. Neither seems to me to be close to beta quality. I wish I could report better experiences, but I'm afraid I can't. I wish I had time to hack on these projects! But OCR is just something to experiment with in my spare time.

Clemmitt Sigler


to post comments

Two OCR packages for Linux compared (LinuxWorld)

Posted Mar 6, 2003 15:32 UTC (Thu) by csigler (subscriber, #1224) [Link] (1 responses)

(Bad form, following up myself, eh? ;^)

I've retried GOCR/JOCR and Clara using high resolution scans, and the results from GOCR/JOCR are pretty good :^) I had been using low resolution screen captures, not realizing how important high definition was in getting OCR working (duh!).

I used 600 dpi scans of three pages of plain ASCII text, in letter-sized portrait format using a Times font. GOCR/JOCR did really well, with >98% correct recognition. (I think one problem with OCR is that more than one error per page makes the OCR package very time-consuming. GOCR/JOCR did have more than one error per page, but I was impressed with its performance :^)

I tested Clara by scanning the same three pages, but I also created a very simple "training" page using the exact same size/style font. I typed in the upper and lower case alphabets and the numbers 0-9 four times each, split across multiple lines, twice with the letters separated and twice not separated by spaces. I ran Clara with the `-y 600' option and trained it with the "training" page, then I ran it on each of the three pages of plain text. Unfortunately, Clara only identified eight letters and no numbers after this training procedure :^(

I must be doing something wrong with Clara, but I'm not sure what.... I didn't take the time to tune Clara's options, however, and this may be what's needed to optimize its performance.

Clemmitt Sigler

Two OCR packages for Linux compared (LinuxWorld)

Posted Apr 21, 2003 23:29 UTC (Mon) by tony_the_tech (guest, #10804) [Link]

I have also tinkered with both G/JOCR and clara, and I find similar issues. The more you
"train" gocr, the more confused it gets. Clara needs to be trained, and sometimes it just
doesn't seem to get it. I must have typed in about a hundred zeros for the numbers I'm
digitizing, and the software won't recognize any of them. even if the symbol is the *exact*
same pixel pattern...


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds