Spam avoidance techniques
Most spam filtering work has involved two techniques: testing mail against patterns indicative of spam and blocking mail from known sources of spam (and other likely sources, such as ISP dialup lines). Source-based blocking can be effective, but it also tends to block a fair amount of legitimate mail along with the spam. For example, some blacklists cause the blocking of mail from kernel.org, despite the fact that no spam originates there. Source-based blocking is unreliable enough that quite a few sites are unwilling to use it, despite a strong desire to be rid of spam.
Pattern matching has shown more promise. Early spam filtering was done with complex procmail scripts, but the current champion of pattern-based spam filtering can only be SpamAssassin. Using a detailed set of rules, SpamAssassin cleans out the trash to great effect. LWN has been using it for some months, and it has made life much easier - lwn@lwn.net gets a lot of spam. SpamAssassin has returned much of our time back to us to work on LWN, as well as keeping us from accidentally deleting mail from readers that tended to get buried in the spam.
One thing that SpamAssassin users tend to notice, however, is that its effectiveness decreases over time. Each new update blocks more spam - a recent upgrade freed us from a whole unpleasant class of Nigeria spam, for example. But pattern-based matching only works as well as its patterns, and they tend to go stale as spammers move on to new tricks. Keeping SpamAssassin effective requires a number of highly dedicated people to actually read all that spam and come up with new rules. Most SpamAssassin users are unlikely to be able (or willing) to write new rules themselves.
Recently, a new approach to spam filtering has attracted a lot of attention, thanks mostly to Paul Graham's paper A Plan for Spam. Rather than try to come up with an endless stream of clever patterns to detect spam, why not just look at the words spammers use? Each word can be assigned a probability that any message that contains it is spam; the probabilities for the words in any specific message are then combined using a Bayesian filter, yielding an overall probability estimate. If that estimate is high enough, the message is classified as spam.
At a first glance, going up against a tool as good as SpamAssassin with such a simple technique seems like a losing battle, but this approach has a number of advantages:
- Development of the word-based rules can be automated - it is
just a matter of feeding the filter enough spam and "ham" (legitimate
mail) and letting it work out a probability factor for each word.
- The filter can be made to follow shifting patterns in spam by
passing it each message that it misclassifies. Users can not be
expected to master regular expressions and write patterns, but they
can be asked to hit a "this is spam" key in their mailer.
- Each user's spam filter comes to reflect the mail that the user
receives. Spam seems like the ultimate in indiscriminate marketing,
but the fact is that different people can receive very different spam.
An individually derived rule base should prove more effective than a
"one size fits all" set of patterns.
- Classification of mail with a Bayesian filter can be done relatively quickly.
All of the above is irrelevant, however, if the Bayesian approach does not succeed in actually filtering spam. To get a sense for the state of the art, we took 3000 messages received at lwn@lwn.net - a little under two weeks worth. 295 of those messages were real mail, and 2705 were spam. If one were to believe the bulk of our mail, one would conclude that about every part of our anatomy (even those we don't possess) is the wrong size, that we are so honest that people want to extract money from Africa via our bank account, that we're missing out on numerous hot stocks, that we have a strange attraction to domesticated animals, and that the purchase of something called the "TushyClean" would greatly improve our lives. Trust us, this exercise has not been fun, but no sacrifice is too great for our readers.
Once the messages were sorted, we fed them all to SpamAssassin and to bogofilter, a new Bayesian filter written by Eric Raymond. Bogofilter was tested twice: once after training with 15% of the 3000 messages, and once after being trained with the whole set. Then we ran both filters on 5000 recent postings from the linux-kernel list, twelve of which were spam (devfs flames were not counted). The results were:
Filter | False positives |
False negatives |
Run time (seconds) |
---|---|---|---|
-- 3000 lwn@lwn.net messages -- | |||
SpamAssassin | 2 | 250 | 11,900 |
Bogofilter (15%) | 0 | 517 | 108 |
Bogofilter (100%) | 0 | 94 | 134 |
-- 5000 linux-kernel messages -- | |||
SpamAssassin | 0 | 6 | 19,600 |
Bogofilter | 0 | 4 | 251 |
False positives are legitimate mail classified as spam. These, of course, are bad news, since they can cause the loss of real mail. False negatives are spam that slip through - an annoyance. It is appropriate that spam filters tend to err toward false negatives, and both filters shown here do exactly that.
The results indicate that bogofilter requires a substantial amount of training before it reaches the level of effectiveness achieved by SpamAssassin. This training is best done with each individual user's mail, but most users are unlikely to have a few thousand nicely sorted messages sitting around to train their filters with. So bogofilter is likely to be frustrating for many users to adopt - it won't work well until the user has run "about one thousand" (according to Eric Raymond) messages through it.
That said, bogofilter is surprisingly effective for a tool that is so new and very much still in development. And the run time relative to SpamAssassin speaks for itself. Much of the difference there will be explained by the fact that bogofilter is coded in C, while SpamAssassin is in Perl. But bogofilter also owes its speed to a much faster algorithm.
The Bayesian filter idea is not new - see this 1998
paper on the Microsoft site, for example. But recently a great deal of
effort has gone into expressing this approach in free software. Bogofilter
is one example; another is the spambayes
project, which has been set up as a testbed for variants on the Bayesian
filter idea. It will be interesting to see where these projects go; they
seem to be off to an interesting start. Taking on a tool as effective as
SpamAssassin is a difficult challenge, but the free software world likes
challenges.
Posted Sep 12, 2002 3:25 UTC (Thu)
by Strike (guest, #861)
[Link] (4 responses)
This way, spam mails that are clever enough to pass one but not the other, will be tossed aside.
Posted Sep 12, 2002 22:54 UTC (Thu)
by gswoods (subscriber, #37)
[Link] (3 responses)
We have started using IP blacklist filters here. This is safer from the I think IP blacklists still have their place.
Posted Sep 13, 2002 10:16 UTC (Fri)
by job (guest, #670)
[Link] (2 responses)
Posted Sep 13, 2002 16:36 UTC (Fri)
by gswoods (subscriber, #37)
[Link] (1 responses)
US law has lots of problems, but so does anyplace else. Besides, this is a gratuitous anti-American comment. There's no need for Lastly, content filtering is not illegal even here. It's just that if you
Posted Sep 19, 2002 11:08 UTC (Thu)
by job (guest, #670)
[Link]
Posted Sep 12, 2002 4:12 UTC (Thu)
by dwheeler (guest, #1216)
[Link]
Obviously, a lot of people only learned about this technique
from
Paul Graham's plan for spam (a well-written piece!).
The LWN study shown here is wonderful confirmation that
it has value.
It's worth noting that there have been other studies
on the topic, including
An evaluation of Naive Bayesian anti-spam filtering,
An Experimental Comparison of
Naive Bayesian and Keyword-Based Anti-Spam Filtering
with Personal E-mail Messages,
Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach,
and information from
lsi.upc.es and
monmouth.edu;
Slashdot has carried a discussion about it.
Ifile implemented the idea
many years ago - it claims a first release date of
Aug 3 20:49:01 EDT 1996, and the author doesn't claim that this
program is the first implementation of the idea, either.
A selected set from the
newsgroup news.admin.net-abuse.sightings might be useful for initial
training of a spam filter.
That would eliminate the problem you mention.
I think every email reader should have a "big SPAM button"
that adds an email to the "spam" folder (so it can be used for
future analysis), as well as other configurable actions. See
http://www.dwheeler.com/essays/stopspam.html for more
information about this.
Posted Sep 12, 2002 5:35 UTC (Thu)
by fcrozat (subscriber, #175)
[Link] (2 responses)
To prevent this from happening, you should use the spamc/spamd tools which are shipped with SpamAssassin : Spamd is a daemon which starts a spamassassin process which is kept in memory, fixing the startup latency problem of SpamAssassin. You should try it, runtime will probably more reasonable.
Posted Sep 12, 2002 7:36 UTC (Thu)
by Dom2 (guest, #458)
[Link]
-Dom
Posted Sep 12, 2002 14:28 UTC (Thu)
by corbet (editor, #1)
[Link]
I just ran the 5000-message linux-kernel test using spamd. The filtering
results were the same, of course, and the run time dropped to 2400 seconds.
That's a big speed improvement, but still an order of magnitude slower
than bogofilter.
Posted Sep 12, 2002 12:21 UTC (Thu)
by armijn (subscriber, #3653)
[Link] (2 responses)
http://www.nluug.nl/events/sane2002/papers.html Instead of Bayesian learning it uses the k-nearest neighbours algorithm.
Posted Sep 12, 2002 13:17 UTC (Thu)
by jrennie (guest, #3655)
[Link] (1 responses)
FYI, k-nearest neighbors (kNN) is very slow compared to filtering by rules or Bayesian approaches (like Graham describes, bogofilter and ifile). For each message you want filtered, kNN compares that message to all messages in the training database. So, filtering n messages is O(nm) (where m is # of training messages). Bayesian approaches scale as O(n).
Jason Rennie
Posted Sep 13, 2002 13:05 UTC (Fri)
by robertb (guest, #3673)
[Link]
On a different subject, it's surprising that there's been no mention of "white list keywords". I think this can be an effective technique, particularly on the individual level. (Eventually, using Bayesian techniques such as ifile, these may be able to be generated automatically and then pruned by the individual as necessary.)
<begin plug>
Posted Sep 12, 2002 13:55 UTC (Thu)
by garym (guest, #251)
[Link]
Remember the Freshmeat editorial on how to pick an opensource project? The basic sarcastic rule was "Pick something already done, and do it the same way." Bogofilter may not be mature (and the less said about buffer-overrun and similar failures in its sibling software the better), but iFile is not; its probably unwise to chastize a net god, but even if Eric has black kettle concerns that it's not proper, I can't command him, but I'd rather he just contribute to the pre-existing software, standing on shoulders instead of standing on the toes of others. iFile is all ready, coded and deployed, and contributor packages include scripts for folding in with procmail or any of about half a dozen email readers, and, of course, it's free software.
Posted Sep 12, 2002 17:49 UTC (Thu)
by bockman (guest, #3650)
[Link] (2 responses)
A better test was maybe to train the filter with half of the data set and then test it with the other half.
Posted Sep 12, 2002 18:26 UTC (Thu)
by corbet (editor, #1)
[Link] (1 responses)
That was the first (15%) test, essentially. And the linux-kernel test too.
Posted Sep 13, 2002 21:01 UTC (Fri)
by ElMiguel (guest, #741)
[Link]
But the numbers most people will remember from this article will be the ones with the 100% of lwn@lwn.net messages, since they are the ones showing the most striking advantage in favour of Bogofilter. And, as Bockman says, that is the least realistic test case of all, since you previously optimized the filter for precisely that set of messages. Perhaps you should make a note in the article itself to warn people who don't read the comments of that circumstance?
(Otherwise than that and overlooking spamc/spamd, great articles, as always :-)).
Posted Sep 12, 2002 19:05 UTC (Thu)
by bitbytebit (guest, #3664)
[Link] (1 responses)
Posted Sep 12, 2002 19:07 UTC (Thu)
by bitbytebit (guest, #3664)
[Link]
Posted Sep 18, 2002 6:29 UTC (Wed)
by guybar (guest, #798)
[Link] (3 responses)
wouldn't this defeat the bayesian techniques described ?
Posted Nov 4, 2002 1:31 UTC (Mon)
by mcisaac (guest, #7442)
[Link]
I'm worried that routine activities such as online shopping might be difficult with this approach. In the "definition of spam" section, the paper touches on what is and is not spam, referencing a merchant receipt as an example of something commercial that isn't spam. My question is, does the receipt pass the Bayesian filter or get flagged as spam?
Posted Apr 15, 2003 4:27 UTC (Tue)
by mattknox (guest, #10640)
[Link]
Posted Oct 2, 2004 12:58 UTC (Sat)
by jerry (guest, #25162)
[Link]
Posted Dec 2, 2002 23:17 UTC (Mon)
by Waldo (guest, #8343)
[Link]
Greetings from Europe
Posted May 4, 2004 9:47 UTC (Tue)
by RobbyRDG (guest, #21287)
[Link] (4 responses)
Posted Feb 8, 2005 2:06 UTC (Tue)
by patrickcurrier (guest, #27757)
[Link] (1 responses)
Posted Apr 19, 2005 23:44 UTC (Tue)
by patrickcurrier (guest, #27757)
[Link]
Posted Aug 15, 2006 12:36 UTC (Tue)
by andycreats (guest, #39905)
[Link]
Posted Jul 15, 2009 19:46 UTC (Wed)
by patrickcurrier (guest, #27757)
[Link]
Maybe I'm crazy, but I don't see why you can't simply daisy-chain the two together to provide even better results. This way you can tweak SpamAssassin to a good enough target score that won't produce false positives (I've found that an aggregate score of 8 or so without changing any of the test scores does a fine job, though does miss a few), and then the mails that have gone to great enough length to assure that all the header tests, MX tests, subject tests, and content (such as MIME type) tests that SpamAssassin does don't pump up the score very high will be subject to the Bayesian approach as well.Idea for increasing effectiveness
I am curious about the legal issues. I personally am not a lawyer, butIdea for increasing effectiveness
when I have taken tutorials at conferences on Internet legal issues, I
have been warned repeatedly about content filtering. SpamAssassin and the
Bayesian filters are content filtering, because they examine the content
of the message itself and filter based on that. This is fine for the end
user to do, but if you do it as an organization, you are potentially
opening yourself up to a big liability. Remember the Prodigy case? The
ruling there was essentially that, since they were doing content filtering,
they were liable for whatever *did* get through. So if you use SpamAssassin
on the organization's mail server, and one of your employees gets a kiddie
porn spam in spite of that and is offended by it, you could be sued.
legal point of view, because the content of the message itself is never
examined. The message is rejected before it is ever sent. Our blacklist
filters have a lot of false negatives, but the problem with false positives
has been nearly nonexistent. Also, I used to get hundreds of bounced spams
every day, and the number has dropped to nearly zero since we started filtering.
It sounds like the problem is that you don't live in a free country, rather than being a problem with content filtering.
Sued ?!?
Yeah, right. And just where is this 'free' country?Sued ?!?
that here. We all have valid reasons for living where we do. I think it
would be stupid to move to another country just so that I could be free
to filter content for an entire organization. And it's just as easy to
argue that content filtering restricts the freedom of employees to use
the Internet. I'm not saying I agree with that argument, but you have to
be very careful when glibly tossing around the word 'free'.
filter, then you're responsible for what gets through your filters.
Personally, I would agree that's silly, but that's what the courts have ruled.
Oh, I have no idea you lived in the US. It was not an anti-American comment at all, just pro-free society, no matter where on Earth it may be. Don't take it personally.
Sued ?!?
It's certainly reasonable to combine multiple anti-spam techniques,
in fact, a lot of people do exactly this.
Spam avoidance techniques
Since SpamAssassin is written in perl, when you use it through procmail, a new perl intepreter is started for each message.. This is the main cause of the high "run time" figure.Spam avoidance techniques
Spamc is a client which connect to spamd and can be used instead of spamassassin in procmail rules (replace "spamassassin -P" with "spamc").
I do this and it makes spamassassin usable over my dialup. At this point, most of my runtime costs are DNS lookups from spamd.Spam avoidance techniques
You know, mentioning (if not trying) the spamc/spamd pair was on my list
as I put the article together, but somehow in the excitement of ordering
all those bulk email lists I dropped it. It's really true that reading
your spam rots the brain.
spamc/spamd
At the SANE 2002 conference in the Netherlands a paper was presentedAnother paper
about a self learning content based spam filter. According to it you
can get quite some good results with it:Another paper
Author of ifile - the original intelligent e-mail filter
There are lots of machine learning techniques which do have reasonable test run-times (the training run-times can be quite high for some, 'though). I'm sure we'll hear about spam filters based on other techniques over the coming years, and hopefully not based on only words or even combinations of words. (Razor may be going in this direction with its fuzzy matching techniques (sucking up swaths of text rather than individual words).)
machine learning techniques
See this page for lots of spam fighting techniques/ideas.
<end plug>
Here we go re-inventing the ego-wheel again
For the little I know of statistic filters, if you train a filter with a set
of data, then you should not use the same set of data to evaluate how good the trained filter is ( since you are testing on the training data, the
filter obviously shows good results ).
(statistically) biased tests?
(statistically) biased tests?
"A better test was maybe to train the filter with half of the data set and then test it with the other half."
(statistically) biased tests?
A program I have developed in C and uses SpamAssassin and many other spam blocking techniques combined (even can run SpamAssassin or BogoFilter from it) is BlackHole, available at BlackHole on FreshMeat.
It also combines free virus checking or Sophos/Mcafee/Trendmicro checking, and many more techniques of filtering email, hopefully it can help too.
Thanks,
Chris
getdown@groovy.org
Spam avoidance techniques
that's actually BlackHole on Freshmeat.
Spam avoidance techniques
Bayesian Spam avoidance possible hole ?
It seems a bit stupid to ask, but what's stoping the spammers from attaching a random scientific/financial/other serious/ article after the actual spam ?
Great article!How does online shopping work
Actually, this would not work all that well, unless the spammers chose a document that is in your field. If you get a lot of mail about scripting languages or kernel development, then an article about one of these topics might help spam get through(at least a few times). However, if a random article that you would not normally recieve in the mail was attached, it would do nothing, because the terms would neither look like spam nor like ham. So the only way for spammers to win on this strategy is to find words that are found in mail that goes to a lot of people, and has not appeared in spam yet. This will be, at best, an uphill battle for them.
Bayesian Spam avoidance possible hole ?
Theoretically, it will not defeat the bayesian filter, but in reality, it does affect the filter, especially considering the impact on speed and memory usage. personally, I think designing an algorithm that "forget" rarely used tokens may be tedious or costly/impractical...Bayesian Spam avoidance possible hole ?
The way we used in a real world implementation spamweed is to mix bayesian filter with other technologies, especially those that can extract useful information among spammer's decoys, which significanly increase bayesian filter efficiency and stabilily.
Jerry: Engineer SpamWeed.com
Hi,Spam avoidance techniques
filter training must be done with two sets of mails, archives of spam and good mail. Someone has to classify all mail to one of these. This might be done automatically with SpamAssassin or by each user. If the sysadmin does this job he/she has to read someone“s mail and this is against the law, at least in my country. The next problem is the archive itself. This is stored private data from different individuals, that is the next criminal act. It is not allowed to store private data, not the spam mail, it is the loveletter for your collegue.
So the user has to classify and this does not guaratee a better result.
The spam filter that I use - Spam Bully, has multiple filtering techniques like friends/spammers list; block email by country/language; block certain words or phrases; RBL integrationĀ
also it allows you to see detailed information about each email you receive- IP address, country, character set, and how SpamBully ranked it. Tells you why a message was or was not blocked and how to correct this in the future.
But it always blows me away how incredibly accurate a pure Bayesian approach is. All those other methods add complexity to the process and no measurable improvement in accuracy except when you first stat training your filter.
also it allows you to see detailed information about each email you receive- IP address, country, character set, and how SpamBully ranked it. Tells you why a message was or was not blocked and how to correct this in the future.
Spam avoidance techniques
by the way there is great info about spam filter stuff at http://www.spamfilternews.com
Spam avoidance techniques
Sorry I meant to add the actual link --- Spam Filter - http://www.spamfilternews.com Spam avoidance techniques
I can recommend a good outlook spam filter - Spam Reader. I used it about 6 months and very satisfied. I tried before also Spam Bully and it worked fine but there was one thing I don't like - it's too much complicated in its options. I needed very simple but effective spam filter - Spam Reader is the best in this category. If you need some fully customizable spam filter then of course you need to use Spam Bully, but be ready guys to spent about 3-4 hours to read its help file :)
Spam avoidance techniques
Wow time flies. Here's a more updated list of spam filters http://www.theyellowlists.com/TYL/Spam_Filters.html
Spam avoidance techniques