Spam avoidance techniques

[Posted September 11, 2002 by corbet]

It is said that most free software comes about as the result of some developer scratching a personal itch. It's also said that very little innovative free software development is done; free software projects spend their time "chasing taillights" - catching up to the features offered by proprietary code. The field of spam filtering may well confirm one of those stereotypes while refuting the other. After all, if there is anything that truly itches, it's spam. But some of the free software being developed to combat spam is truly innovative.

Most spam filtering work has involved two techniques: testing mail against patterns indicative of spam and blocking mail from known sources of spam (and other likely sources, such as ISP dialup lines). Source-based blocking can be effective, but it also tends to block a fair amount of legitimate mail along with the spam. For example, some blacklists cause the blocking of mail from kernel.org, despite the fact that no spam originates there. Source-based blocking is unreliable enough that quite a few sites are unwilling to use it, despite a strong desire to be rid of spam.

Pattern matching has shown more promise. Early spam filtering was done with complex procmail scripts, but the current champion of pattern-based spam filtering can only be SpamAssassin. Using a detailed set of rules, SpamAssassin cleans out the trash to great effect. LWN has been using it for some months, and it has made life much easier - lwn@lwn.net gets a lot of spam. SpamAssassin has returned much of our time back to us to work on LWN, as well as keeping us from accidentally deleting mail from readers that tended to get buried in the spam.

One thing that SpamAssassin users tend to notice, however, is that its effectiveness decreases over time. Each new update blocks more spam - a recent upgrade freed us from a whole unpleasant class of Nigeria spam, for example. But pattern-based matching only works as well as its patterns, and they tend to go stale as spammers move on to new tricks. Keeping SpamAssassin effective requires a number of highly dedicated people to actually read all that spam and come up with new rules. Most SpamAssassin users are unlikely to be able (or willing) to write new rules themselves.

Recently, a new approach to spam filtering has attracted a lot of attention, thanks mostly to Paul Graham's paper A Plan for Spam. Rather than try to come up with an endless stream of clever patterns to detect spam, why not just look at the words spammers use? Each word can be assigned a probability that any message that contains it is spam; the probabilities for the words in any specific message are then combined using a Bayesian filter, yielding an overall probability estimate. If that estimate is high enough, the message is classified as spam.

At a first glance, going up against a tool as good as SpamAssassin with such a simple technique seems like a losing battle, but this approach has a number of advantages:

Development of the word-based rules can be automated - it is just a matter of feeding the filter enough spam and "ham" (legitimate mail) and letting it work out a probability factor for each word.
The filter can be made to follow shifting patterns in spam by passing it each message that it misclassifies. Users can not be expected to master regular expressions and write patterns, but they can be asked to hit a "this is spam" key in their mailer.
Each user's spam filter comes to reflect the mail that the user receives. Spam seems like the ultimate in indiscriminate marketing, but the fact is that different people can receive very different spam. An individually derived rule base should prove more effective than a "one size fits all" set of patterns.
Classification of mail with a Bayesian filter can be done relatively quickly.

All of the above is irrelevant, however, if the Bayesian approach does not succeed in actually filtering spam. To get a sense for the state of the art, we took 3000 messages received at lwn@lwn.net - a little under two weeks worth. 295 of those messages were real mail, and 2705 were spam. If one were to believe the bulk of our mail, one would conclude that about every part of our anatomy (even those we don't possess) is the wrong size, that we are so honest that people want to extract money from Africa via our bank account, that we're missing out on numerous hot stocks, that we have a strange attraction to domesticated animals, and that the purchase of something called the "TushyClean" would greatly improve our lives. Trust us, this exercise has not been fun, but no sacrifice is too great for our readers.

Once the messages were sorted, we fed them all to SpamAssassin and to bogofilter, a new Bayesian filter written by Eric Raymond. Bogofilter was tested twice: once after training with 15% of the 3000 messages, and once after being trained with the whole set. Then we ran both filters on 5000 recent postings from the linux-kernel list, twelve of which were spam (devfs flames were not counted). The results were:

Filter	False positives	False negatives	Run time (seconds)
-- 3000 lwn@lwn.net messages --
SpamAssassin	2	250	11,900
Bogofilter (15%)	0	517	108
Bogofilter (100%)	0	94	134
-- 5000 linux-kernel messages --
SpamAssassin	0	6	19,600
Bogofilter	0	4	251

False positives are legitimate mail classified as spam. These, of course, are bad news, since they can cause the loss of real mail. False negatives are spam that slip through - an annoyance. It is appropriate that spam filters tend to err toward false negatives, and both filters shown here do exactly that.

The results indicate that bogofilter requires a substantial amount of training before it reaches the level of effectiveness achieved by SpamAssassin. This training is best done with each individual user's mail, but most users are unlikely to have a few thousand nicely sorted messages sitting around to train their filters with. So bogofilter is likely to be frustrating for many users to adopt - it won't work well until the user has run "about one thousand" (according to Eric Raymond) messages through it.

That said, bogofilter is surprisingly effective for a tool that is so new and very much still in development. And the run time relative to SpamAssassin speaks for itself. Much of the difference there will be explained by the fact that bogofilter is coded in C, while SpamAssassin is in Perl. But bogofilter also owes its speed to a much faster algorithm.

The Bayesian filter idea is not new - see this 1998 paper on the Microsoft site, for example. But recently a great deal of effort has gone into expressing this approach in free software. Bogofilter is one example; another is the spambayes project, which has been set up as a testbed for variants on the Bayesian filter idea. It will be interesting to see where these projects go; they seem to be off to an interesting start. Taking on a tool as effective as SpamAssassin is a difficult challenge, but the free software world likes challenges.

Idea for increasing effectiveness

Posted Sep 12, 2002 3:25 UTC (Thu) by Strike (guest, #861) [Link] (4 responses)

Maybe I'm crazy, but I don't see why you can't simply daisy-chain the two together to provide even better results. This way you can tweak SpamAssassin to a good enough target score that won't produce false positives (I've found that an aggregate score of 8 or so without changing any of the test scores does a fine job, though does miss a few), and then the mails that have gone to great enough length to assure that all the header tests, MX tests, subject tests, and content (such as MIME type) tests that SpamAssassin does don't pump up the score very high will be subject to the Bayesian approach as well.

This way, spam mails that are clever enough to pass one but not the other, will be tossed aside.

Idea for increasing effectiveness

Posted Sep 12, 2002 22:54 UTC (Thu) by gswoods (subscriber, #37) [Link] (3 responses)

I am curious about the legal issues. I personally am not a lawyer, but
when I have taken tutorials at conferences on Internet legal issues, I
have been warned repeatedly about content filtering. SpamAssassin and the
Bayesian filters are content filtering, because they examine the content
of the message itself and filter based on that. This is fine for the end
user to do, but if you do it as an organization, you are potentially
opening yourself up to a big liability. Remember the Prodigy case? The
ruling there was essentially that, since they were doing content filtering,
they were liable for whatever *did* get through. So if you use SpamAssassin
on the organization's mail server, and one of your employees gets a kiddie
porn spam in spite of that and is offended by it, you could be sued.

We have started using IP blacklist filters here. This is safer from the
legal point of view, because the content of the message itself is never
examined. The message is rejected before it is ever sent. Our blacklist
filters have a lot of false negatives, but the problem with false positives
has been nearly nonexistent. Also, I used to get hundreds of bounced spams
every day, and the number has dropped to nearly zero since we started filtering.

I think IP blacklists still have their place.

Sued ?!?

Posted Sep 13, 2002 10:16 UTC (Fri) by job (guest, #670) [Link] (2 responses)

It sounds like the problem is that you don't live in a free country, rather than being a problem with content filtering.

Sued ?!?

Posted Sep 13, 2002 16:36 UTC (Fri) by gswoods (subscriber, #37) [Link] (1 responses)

Yeah, right. And just where is this 'free' country?

US law has lots of problems, but so does anyplace else.

Besides, this is a gratuitous anti-American comment. There's no need for
that here. We all have valid reasons for living where we do. I think it
would be stupid to move to another country just so that I could be free
to filter content for an entire organization. And it's just as easy to
argue that content filtering restricts the freedom of employees to use
the Internet. I'm not saying I agree with that argument, but you have to
be very careful when glibly tossing around the word 'free'.

Lastly, content filtering is not illegal even here. It's just that if you
filter, then you're responsible for what gets through your filters.
Personally, I would agree that's silly, but that's what the courts have ruled.

Sued ?!?

Posted Sep 19, 2002 11:08 UTC (Thu) by job (guest, #670) [Link]

Oh, I have no idea you lived in the US. It was not an anti-American comment at all, just pro-free society, no matter where on Earth it may be. Don't take it personally.

Spam avoidance techniques

Posted Sep 12, 2002 4:12 UTC (Thu) by dwheeler (guest, #1216) [Link]

It's certainly reasonable to combine multiple anti-spam techniques, in fact, a lot of people do exactly this.

Obviously, a lot of people only learned about this technique from Paul Graham's plan for spam (a well-written piece!). The LWN study shown here is wonderful confirmation that it has value. It's worth noting that there have been other studies on the topic, including An evaluation of Naive Bayesian anti-spam filtering, An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages, Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach, and information from lsi.upc.es and monmouth.edu; Slashdot has carried a discussion about it. Ifile implemented the idea many years ago - it claims a first release date of Aug 3 20:49:01 EDT 1996, and the author doesn't claim that this program is the first implementation of the idea, either.

A selected set from the newsgroup news.admin.net-abuse.sightings might be useful for initial training of a spam filter. That would eliminate the problem you mention.

I think every email reader should have a "big SPAM button" that adds an email to the "spam" folder (so it can be used for future analysis), as well as other configurable actions. See http://www.dwheeler.com/essays/stopspam.html for more information about this.

Spam avoidance techniques

Posted Sep 12, 2002 5:35 UTC (Thu) by fcrozat (subscriber, #175) [Link] (2 responses)

Since SpamAssassin is written in perl, when you use it through procmail, a new perl intepreter is started for each message.. This is the main cause of the high "run time" figure.

To prevent this from happening, you should use the spamc/spamd tools which are shipped with SpamAssassin :

Spamd is a daemon which starts a spamassassin process which is kept in memory, fixing the startup latency problem of SpamAssassin.
Spamc is a client which connect to spamd and can be used instead of spamassassin in procmail rules (replace "spamassassin -P" with "spamc").

You should try it, runtime will probably more reasonable.

Spam avoidance techniques

Posted Sep 12, 2002 7:36 UTC (Thu) by Dom2 (guest, #458) [Link]

I do this and it makes spamassassin usable over my dialup. At this point, most of my runtime costs are DNS lookups from spamd.

-Dom

spamc/spamd

Posted Sep 12, 2002 14:28 UTC (Thu) by corbet (editor, #1) [Link]

You know, mentioning (if not trying) the spamc/spamd pair was on my list as I put the article together, but somehow in the excitement of ordering all those bulk email lists I dropped it. It's really true that reading your spam rots the brain.

I just ran the 5000-message linux-kernel test using spamd. The filtering results were the same, of course, and the run time dropped to 2400 seconds. That's a big speed improvement, but still an order of magnitude slower than bogofilter.

Another paper

Posted Sep 12, 2002 12:21 UTC (Thu) by armijn (subscriber, #3653) [Link] (2 responses)

At the SANE 2002 conference in the Netherlands a paper was presented
about a self learning content based spam filter. According to it you
can get quite some good results with it:

http://www.nluug.nl/events/sane2002/papers.html

Instead of Bayesian learning it uses the k-nearest neighbours algorithm.

Another paper

Posted Sep 12, 2002 13:17 UTC (Thu) by jrennie (guest, #3655) [Link] (1 responses)

FYI, k-nearest neighbors (kNN) is very slow compared to filtering by rules or Bayesian approaches (like Graham describes, bogofilter and ifile). For each message you want filtered, kNN compares that message to all messages in the training database. So, filtering n messages is O(nm) (where m is # of training messages). Bayesian approaches scale as O(n).

Jason Rennie
Author of ifile - the original intelligent e-mail filter

machine learning techniques

Posted Sep 13, 2002 13:05 UTC (Fri) by robertb (guest, #3673) [Link]

There are lots of machine learning techniques which do have reasonable test run-times (the training run-times can be quite high for some, 'though). I'm sure we'll hear about spam filters based on other techniques over the coming years, and hopefully not based on only words or even combinations of words. (Razor may be going in this direction with its fuzzy matching techniques (sucking up swaths of text rather than individual words).)

On a different subject, it's surprising that there's been no mention of "white list keywords". I think this can be an effective technique, particularly on the individual level. (Eventually, using Bayesian techniques such as ifile, these may be able to be generated automatically and then pruned by the individual as necessary.)

<begin plug>
See this page for lots of spam fighting techniques/ideas.
<end plug>

Here we go re-inventing the ego-wheel again

Posted Sep 12, 2002 13:55 UTC (Thu) by garym (guest, #251) [Link]

Remember the Freshmeat editorial on how to pick an opensource project? The basic sarcastic rule was "Pick something already done, and do it the same way." Bogofilter may not be mature (and the less said about buffer-overrun and similar failures in its sibling software the better), but iFile is not; its probably unwise to chastize a net god, but even if Eric has black kettle concerns that it's not proper, I can't command him, but I'd rather he just contribute to the pre-existing software, standing on shoulders instead of standing on the toes of others. iFile is all ready, coded and deployed, and contributor packages include scripts for folding in with procmail or any of about half a dozen email readers, and, of course, it's free software.

(statistically) biased tests?

Posted Sep 12, 2002 17:49 UTC (Thu) by bockman (guest, #3650) [Link] (2 responses)

For the little I know of statistic filters, if you train a filter with a set of data, then you should not use the same set of data to evaluate how good the trained filter is ( since you are testing on the training data, the filter obviously shows good results ).

A better test was maybe to train the filter with half of the data set and then test it with the other half.

(statistically) biased tests?

Posted Sep 12, 2002 18:26 UTC (Thu) by corbet (editor, #1) [Link] (1 responses)

"A better test was maybe to train the filter with half of the data set and then test it with the other half."

That was the first (15%) test, essentially. And the linux-kernel test too.

(statistically) biased tests?

Posted Sep 13, 2002 21:01 UTC (Fri) by ElMiguel (guest, #741) [Link]

But the numbers most people will remember from this article will be the ones with the 100% of lwn@lwn.net messages, since they are the ones showing the most striking advantage in favour of Bogofilter. And, as Bockman says, that is the least realistic test case of all, since you previously optimized the filter for precisely that set of messages. Perhaps you should make a note in the article itself to warn people who don't read the comments of that circumstance?

(Otherwise than that and overlooking spamc/spamd, great articles, as always :-)).

Spam avoidance techniques

Posted Sep 12, 2002 19:05 UTC (Thu) by bitbytebit (guest, #3664) [Link] (1 responses)

A program I have developed in C and uses SpamAssassin and many other spam blocking techniques combined (even can run SpamAssassin or BogoFilter from it) is BlackHole, available at BlackHole on FreshMeat. It also combines free virus checking or Sophos/Mcafee/Trendmicro checking, and many more techniques of filtering email, hopefully it can help too. Thanks, Chris getdown@groovy.org

Spam avoidance techniques

Posted Sep 12, 2002 19:07 UTC (Thu) by bitbytebit (guest, #3664) [Link]

that's actually BlackHole on Freshmeat.

Bayesian Spam avoidance possible hole ?

Posted Sep 18, 2002 6:29 UTC (Wed) by guybar (guest, #798) [Link] (3 responses)

It seems a bit stupid to ask, but what's stoping the spammers from attaching a random scientific/financial/other serious/ article after the actual spam ?

wouldn't this defeat the bayesian techniques described ?

How does online shopping work

Posted Nov 4, 2002 1:31 UTC (Mon) by mcisaac (guest, #7442) [Link]

Great article!

I'm worried that routine activities such as online shopping might be difficult with this approach. In the "definition of spam" section, the paper touches on what is and is not spam, referencing a merchant receipt as an example of something commercial that isn't spam.

My question is, does the receipt pass the Bayesian filter or get flagged as spam?

Bayesian Spam avoidance possible hole ?

Posted Apr 15, 2003 4:27 UTC (Tue) by mattknox (guest, #10640) [Link]

Actually, this would not work all that well, unless the spammers chose a document that is in your field. If you get a lot of mail about scripting languages or kernel development, then an article about one of these topics might help spam get through(at least a few times). However, if a random article that you would not normally recieve in the mail was attached, it would do nothing, because the terms would neither look like spam nor like ham. So the only way for spammers to win on this strategy is to find words that are found in mail that goes to a lot of people, and has not appeared in spam yet. This will be, at best, an uphill battle for them.

Bayesian Spam avoidance possible hole ?

Posted Oct 2, 2004 12:58 UTC (Sat) by jerry (guest, #25162) [Link]

Theoretically, it will not defeat the bayesian filter, but in reality, it does affect the filter, especially considering the impact on speed and memory usage. personally, I think designing an algorithm that "forget" rarely used tokens may be tedious or costly/impractical...

The way we used in a real world implementation spamweed is to mix bayesian filter with other technologies, especially those that can extract useful information among spammer's decoys, which significanly increase bayesian filter efficiency and stabilily.

Jerry: Engineer SpamWeed.com

Spam avoidance techniques

Posted Dec 2, 2002 23:17 UTC (Mon) by Waldo (guest, #8343) [Link]

Hi,
filter training must be done with two sets of mails, archives of spam and good mail. Someone has to classify all mail to one of these. This might be done automatically with SpamAssassin or by each user. If the sysadmin does this job he/she has to read someone´s mail and this is against the law, at least in my country. The next problem is the archive itself. This is stored private data from different individuals, that is the next criminal act. It is not allowed to store private data, not the spam mail, it is the loveletter for your collegue.
So the user has to classify and this does not guaratee a better result.

Greetings from Europe

Spam avoidance techniques

Posted May 4, 2004 9:47 UTC (Tue) by RobbyRDG (guest, #21287) [Link] (4 responses)

The spam filter that I use - Spam Bully, has multiple filtering techniques like friends/spammers list; block email by country/language; block certain words or phrases; RBL integrationalso it allows you to see detailed information about each email you receive- IP address, country, character set, and how SpamBully ranked it. Tells you why a message was or was not blocked and how to correct this in the future. But it always blows me away how incredibly accurate a pure Bayesian approach is. All those other methods add complexity to the process and no measurable improvement in accuracy except when you first stat training your filter. also it allows you to see detailed information about each email you receive- IP address, country, character set, and how SpamBully ranked it. Tells you why a message was or was not blocked and how to correct this in the future.

Spam avoidance techniques

Posted Feb 8, 2005 2:06 UTC (Tue) by patrickcurrier (guest, #27757) [Link] (1 responses)

by the way there is great info about spam filter stuff at http://www.spamfilternews.com

Spam avoidance techniques

Posted Apr 19, 2005 23:44 UTC (Tue) by patrickcurrier (guest, #27757) [Link]

Sorry I meant to add the actual link --- Spam Filter - http://www.spamfilternews.com

Spam avoidance techniques

Posted Aug 15, 2006 12:36 UTC (Tue) by andycreats (guest, #39905) [Link]

I can recommend a good outlook spam filter - Spam Reader. I used it about 6 months and very satisfied. I tried before also Spam Bully and it worked fine but there was one thing I don't like - it's too much complicated in its options. I needed very simple but effective spam filter - Spam Reader is the best in this category. If you need some fully customizable spam filter then of course you need to use Spam Bully, but be ready guys to spent about 3-4 hours to read its help file :)

Spam avoidance techniques

Posted Jul 15, 2009 19:46 UTC (Wed) by patrickcurrier (guest, #27757) [Link]

Wow time flies. Here's a more updated list of spam filters http://www.theyellowlists.com/TYL/Spam_Filters.html