LWN.net Logo

There is no need to actually OCR the image

There is no need to actually OCR the image

Posted Aug 26, 2006 4:43 UTC (Sat) by spitzak (subscriber, #4593)
Parent article: Fighting image spam

There is no need to actually read the text in the image. All that is needed is a way to detect that the image *is* text. Since ther is no legitimate reason to send text as an image this will indicate span. I'm pretty sure it is far easier to indicate that an image contains a lot of text with 99% certanty than it is to determine the image's text contains a certain word.


(Log in to post comments)

There is no need to actually OCR the image

Posted Aug 28, 2006 17:43 UTC (Mon) by tack (subscriber, #12542) [Link]

I occasionally receive scanned newspaper articles that may be of interest to me.

I prefer the approach of OCRing the image and filtering that through a Bayesian classifier. One might be able to use some of the techniques described here to first optionally determine if the image likely contains a lot of text, and only then OCR it, which would help out with the CPU overhead.

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds