LWN.net Logo

Fighting image spam

Fighting image spam

Posted Aug 24, 2006 1:17 UTC (Thu) by dskoll (subscriber, #1630)
Parent article: Fighting image spam

gocr is the right way to go, but "fuzzy" matching rules are the wrong way to go. It's better just to feed the output of gocr into your Bayes database and let the statistical algorithm lock on to the words.

You'll quickly find that "INVESToR", "Indudry", "accomplishmems" and
so on become incredibly strong indicators of spaminess. gocr's common
mistakes actually make it *easier* for Bayes to lock on.


(Log in to post comments)

Fighting image spam

Posted Aug 24, 2006 1:25 UTC (Thu) by flewellyn (subscriber, #5047) [Link]

I agree. Especially since there are legitimate uses for image attachments in email.

Fighting image spam

Posted Aug 24, 2006 2:18 UTC (Thu) by Ross (subscriber, #4065) [Link]

Do spammers use the same misspellings over and over? If not (they aren't that stupid), there are an incredibly large number of combinations for each word. Especially when you throw in "looks like" numbers, symbols, and alternate letters. Spammy words would never already be in the database.

Fighting image spam

Posted Aug 24, 2006 9:28 UTC (Thu) by nix (subscriber, #2304) [Link]

The point is that *gocr* makes the same mistakes over and over, acting as, in effect, a 'this word came from an image' tag. :)

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds