Fighting image spam
Posted Aug 24, 2006 1:17 UTC (Thu) by
dskoll (subscriber, #1630)
Parent article:
Fighting image spam
gocr is the right way to go, but "fuzzy" matching rules are the wrong way to go. It's better just to feed the output of gocr into your Bayes database and let the statistical algorithm lock on to the words.
You'll quickly find that "INVESToR", "Indudry", "accomplishmems" and
so on become incredibly strong indicators of spaminess. gocr's common
mistakes actually make it *easier* for Bayes to lock on.
(
Log in to post comments)