Fighting image spam

Posted Sep 1, 2006 23:14 UTC (Fri) by decoder (guest, #40285)
Parent article: Fighting image spam

Hello all, I am the developer that invented the FuzzyOcr plugin and I just wanted to pick up some things I've read here:

- Using the text recognized by OCR for bayes/other SA rules/etc:

Most gocr output can be really bad, completely unsuitable for the standard SA rules. Also you'd feed more junk to bayes (the output often contains a lot of junk too) than useful words, and no, gocr does not always do the same mistakes on words. It depends on colors, fonts, size, etc. and that will cause bayes to be ineffective.

- Can we keep up with spammers obfuscating their images?

Yes and no. On one side, spammers will always find a way to avoid our detection methods, hence we try to keep the internals secret. But that won't work forever, so spammers will evolve. But on the other side, we have to keep one thing in mind: These pictures are still ads. Ads are supposed to have an appeal, and more obfuscation leads to less appeal. Comparing this to captchas is not suitable, as a captcha does not have this intention. And a captcha can hardly be so appealing as a normal ad can be.

- It takes up many resources:

Yes, it does. OCR will always take up many resources. But FuzzyOcr has already many improvements that save actual OCR passes. For example, the new image hash system filters out up to 40% of the image spam (depending on what spam you get) after they were recognized once, hence saving OCR.runtime.

- Legitimate uses of image attachments:

Yes, they are there. And FuzzyOcr is actually the most friendly plugin concerning people that send images in their mails. Most images that get sent in ham are not even near in getting detected by FuzzyOcr. The only class of images that can cause false positives are screenshots. But with the new word threshold tweaking, you can eleminate almost all "true" false positives (Caused by too fuzzy matching or too common words). However, if the screenshot really contains spam words, then this will most likely produce a false positive.

- Greylisting > *.

I agree that greylisting stops a lot of spam. But in what time do you actually live? At our university, there is greylisting in place and even though we have greylisting, we still get TONS of spam. Greylisting worked nice and fine for quite some time, but the latest stats show that the effectiveness decreased heavily.

- Does usable free OCR software exist?

Yes it does... For those that aren't satisfied with gocr, you should try tesseract. This OCR was once a commercial engine and was released just today. It will be included in FuzzyOcr in approximately 8-9 weeks as an experimental feature.

- No need to actually read the text, text means spam:

See the screenshot false positive situation. For most situations though, you are right.

- Spammers don't care about OCR:

Wrong... see http://users.own-hero.net/~decoder/forgiving26.gif just as an example. They actually do care and they are not dumb at all.

So, what do I recommend as a general anti spam receipt for the future you ask?

In my opinion, it would already help A LOT, if stuff like DKIM was used by everyone. We are far away from that, but if everyone was using it, then this would be the first step to a more controlled spam flow. This would surely not stop the spam, but it would change how spam is sent, it would have to be sent from valid domains, forcing the spammers to get easier trackable/blockable, costing them money and everything that is connected to that.

My 2 cents :)

Chris