The Grumpy Editor's guide to bayesian spam filters
This article is part of the LWN Grumpy Editor series. |
In recent years, much of the development energy in anti-spam circles has gone into bayesian filters. The bayesian approach was kicked off in 2002 with the publication of Paul Graham's A plan for spam. In its simplest form, a bayesian filter keeps track of words found in email messages and, for each word, a count of how many times that word appeared in spam and in legitimate mail ("ham"). Over time, these statistics can be used to examine incoming mail and come up with a probability that each is spam.
Bayesian filters have proved to be surprisingly effective. A well-trained filter can catch a high proportion of incoming spam with a very low false positive rate. The filter will adapt to each user's particular email stream, which is an important feature. It should not be surprising that different people have wildly different legitimate email. It turns out, though, that the spam stream can also vary quite a bit. An account which looks like it could belong to a woman, for example, will tend to get messages offering to alter the sizes of different parts of the recipient's anatomy than a man's account. So the ability to tune a filter to a specific mail stream - ham and spam - will increase its accuracy.
There is quite a large selection of free bayesian filters out there. Your editor decided to have a look around to see if there is any reason to prefer one over the others. To that end, a number of characteristics were examined:
- Accuracy. A filter which does not accurately classify mail will not
be of much use, so good results in this area are required. In
particular, false positives (legitimate mail classified as spam) must
be avoided.
- Training. Bayesian filters must be trained before they become
effective. Some filters, it turns out, are easier to train than
others. In general, the training process is somewhat like
house-training a puppy: it's a painful process, involving contact with
unpleasant materials, and with a messy failure mode. And, somewhere
in the process, something you care about is likely to get chewed up.
So, in general, this is a process which one would like to be done
with quickly and not have to do again later on.
There are people who lovingly tweak and tune their spam filters the way an automobile enthusiast tweaks his car. Your editor is not one of those people. Life is too short - and too busy - to spend a lot of time screwing around with spam filters.
- Speed. The difference in performance between the fastest and slowest
filters covers two orders of magnitude. Since filtering tends to be
done in the background, speed will not be crucially important to all
users. When filtering is being done on a busy mail server, however,
processing speed can matter a lot.
- Ease of integration. How much work is it to hook a filter into the mail stream?
To carry out the tests, your editor collected two piles of mail from his personal stream; one was purely spam, and the other ham. Just over 1,000 messages from each pile were set aside to be used to train the filters. Then, 6,000 hams and 9,000 spams were fed to each filter, with the filter's verdict and processing time being recorded. Each mis-classified message was immediately fed back to the filter to further train it. Multiple runs were made with different parameters, but, in general, your editor resisted the urge to go tweaking knobs. Some of these filters offer a vast number of obscure parameters to set; one can only hope they come with reasonable defaults.
As a side note, your editor was surprised and dismayed at how difficult the task of producing pure sets of spam and ham was. The process started with mail sorted by SpamAssassin at delivery time. Your editor then passed over the entire set, twice, reclassifying each message which was in the wrong place. It was only after some early tests started reporting "false positives" that it became clear how much spam still lurked in the "ham" pile. It took more manual passes, and many passes with multiple filters, to clean them all out. The developers who claim that their filters do a better job than a human does are right - when that human is your editor, at least. It also turns out that a few incorrectly classified messages can badly skew the results; bayesian filters are easily confused if you train them badly.
Anyway, the results will be presented in five batches of 1200 hams and 1800 spams. Nothing special was done between these batches; this presentation is intended to show how the filter's behavior evolves as it is trained on more messages. All of the results are also pulled together in a summary table at the end of the article.
Bogofilter
Bogofilter was originally written by Eric Raymond shortly after Paul Graham's article was posted. It has evolved over time, and has picked up a wider community of contributors and maintainers. Bogofilter uses a modified version of the bayesian technique, with a number of knobs to tweak. It is written in C and is quite fast.
Training for bogofilter is somewhat complex; your editor was unable to train it into a stable configuration by feeding it hams and spams directly. The presence of several different training scripts in the source tree's "contrib" directory suggests that others have had to put some work into training as well. In the end, the contributed "trainbogo.sh" script appeared to do a reasonable job, but it required about three runs to get bogofilter into a stable state.
Bogofilter offers two approaches to ongoing training. By default, the filter is not trained by new messages as it classifies them. People who use bogofilter in this mode will set aside bogofilter's mistakes for later training. When the -u option is provided, however, bogofilter will train itself on all messages it feels sufficiently strongly about. Use of -u makes retraining on mistakes even more important, or the filter will become increasingly likely to misclassify mail. In general, training a bayesian filter on its own output must be done with care. It can help the filter to keep up with the spam stream as it evolves, but it also is a positive feedback loop which can go badly wrong if not carefully watched.
Your editor ran bogofilter (v1.01) in both modes (starting with a freshly trained database in each case). Here's the results:
Batch: | 1 | 2 | 3 | 4 | 5 | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Size | |
bogofilter | 141 | 0 | 0.02 | 69 | 0 | 0.01 | 96 | 0 | 0.02 | 48 | 0 | 0.02 | 52 | 0 | 0.02 | 5 |
bogofilter -u | 87 | 0 | 0.05 | 54 | 0 | 0.05 | 41 | 0 | 0.05 | 45 | 0 | 0.06 | 41 | 0 | 0.09 | 32 |
Legend: Fn is the number of false negatives (spam which makes it through the filter); Fp is false positives (legitimate mail misclassified as spam), and T is the processing time (clock, not CPU), in seconds per message. The Size value at the end is the final size of the word database, in MB.
Here we see that bogofilter without the -u option tends toward around 50 missed spams out of a set of 1800 (a 2.8% error rate), but with no false positives at all. If bogofilter self-trains itself, the false negative rate drops to closer to 2.2%. As we will see, these results are not as good as some of the other filters reviewed.
Without self-training, bogofilter requires a roughly constant 0.02 seconds
of time to classify a message; with self-training, that time increases as
the word database grows.
Bogofilter is clearly fast - the fastest of all the filters reviewed here.
One of the ways in which it gains that speed
is to not bother with attachments in the mail. The web page says
"Experience from watching the token streams suggests that spam with
enclosures invariably gives itself away through cues in the headers and
non-enclosure parts.
"
Bogofilter is intended to be integrated as a simple mail filter, optimally invoked via procmail. It can place a header in processed messages (making life easy for procmail) and also returns the spam status in its exit code (making life easy for grumpy editor testing scripts). Bogofilter has options for dumping out the word database and, for a given message, listing the words which most influenced how that message was classified. Nothing special has been done to make retraining easy; most users will probably create folders of mistakes and occasionally feed them to the filter.
CRM114
An interesting - if intimidating - offering is CRM114, subtitled "the controllable regex mutilator." While the main use of CRM114 appears to be spam filtering, it has a wider scope; it can be trained, for example, to filter interesting lines out of logfiles. According to the project page:
This tool comes with a 275-page book in PDF format, and needs every one of those pages. Setting up CRM114 is not for the faint of heart; it involves the manual creation of database files and the editing of a long configuration file which, while not quite up to sendmail.cf standards, is still one of the more challenging files your editor has encountered in some time. Once all of that is done, however, the "crm" executable can be hooked into a procmail recipe in the usual way without too much trouble.
The CRM114 documentation recommends against any sort of initial training of the filter. The developers are strong believers in the "train on errors" approach, saying that there are "mathematically complicated" reasons why pre-training leads to worse results. For users who don't get the hint, they do provide a way to perform pre-training:
The prospect of hand-decoding binary spam attachments is likely to put off most people who were pondering pre-training their filters - and, one assumes, that is the desired result. Of course, one can also use the normal training commands to feed messages into the system in a pre-training mode, but the documentation doesn't say that.
While filter training can be done on the command line, users can also retrain the filter by forwarding errors back to themselves. The message must be edited to include a training command and password; the developers also recommend removing anything which shouldn't be part of the training. Strangely, users are also told to remove the markup added by CRM114 itself - something which, one would think, could be handled automatically.
Your editor tested the 20060118 "BlameTheReavers" release of CRM114. The first test was done without training, as recommended; then, just to be stubborn, a test was run with a pre-trained filter.
Batch: | 1 | 2 | 3 | 4 | 5 | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Size | |
CRM114 | 1 | 1 | 0.06 | 1 | 1 | 0.06 | 3 | 2 | 0.06 | 4 | 6 | 0.06 | 5 | 6 | 0.07 | 24 |
CRM pretrain | 6 | 2 | 0.07 | 1 | 2 | 0.07 | 5 | 2 | 0.07 | 1 | 2 | 0.07 | 1 | 6 | 0.07 | 24 |
Some things jump out immediately from those numbers. CRM114 is quite fast. It is also quite effective very early on; the first 3000 messages were processed with exactly one false positive and one false negative - starting with an untrained filter. On the other hand, its performance appears to worsen over time, and, in particular, the false positive rate grows in a discouraging way. The false positives varied from vaguely spam-like messages (Netflix updates, for example) to things like kernel patches. Your editor concludes that CRM114 operates as a very aggressive and quick-learning filter, but that it is also relatively unstable.
DSPAM
DSPAM is a GPL-licensed filter written in C. It is clearly aimed at large installations - places with dedicated administrators and, possibly, relatively unsophisticated users. As a result, it has a few features not found in other systems. For example, it has a web interface with statistics, facilities to allow users to manage filter training, and "pretty graphs to dazzle CEOs." Users who don't want to train the filter through a web page can forward mistakes to a special address instead.
There are several ways to hook DSPAM into the mail system, including a command-line filter, a POP proxy, and an SMTP front-end which can be put between the net and the mail delivery agent. There are several choices of backend storage (SQLite, Berkeley DB, PostgreSQL, MySQL, Oracle, and more), and a number of different filtering techniques. The filter can also run in a client-server mode, much like SpamAssassin.
DSPAM is also a package with a dual-license option; companies interested in shipping the software without providing source can purchase a separate license from the developer.
The system is intended to require relatively little maintenance. It has a set of tools, meant to be run from cron, which handle much of the routine housekeeping. Among other things, DSPAM will automatically trim its word list - getting rid of terms which have not been seen for a while and which have little influence on message scoring.
Initial training can be performed using the dspam_train utility; it uses a train-on-errors approach. Thereafter, DSPAM offers several training modes. The recommended "teft" mode trains on every message passing through the system. There is a train-on-errors mode, and a "tum" mode ("train until mature") which emphasizes relatively new and uncommon words. Your editor ran DSPAM (in the standalone, command-line mode) using all three training schemes, with the following results:
Batch: | 1 | 2 | 3 | 4 | 5 | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Size | |
DSPAM teft | 17 | 0 | 0.1 | 17 | 0 | 0.1 | 11 | 0 | 0.1 | 3 | 0 | 0.1 | 7 | 0 | 0.1 | 305 |
DSPAM toe | 23 | 0 | 0.1 | 21 | 3 | 0.1 | 12 | 0 | 0.1 | 3 | 1 | 0.1 | 8 | 4 | 0.1 | 276 |
DSPAM tum | 26 | 0 | 0.1 | 23 | 0 | 0.1 | 12 | 0 | 0.1 | 7 | 0 | 0.1 | 15 | 0 | 0.1 | 305 |
So DSPAM shows strong spam detection in all three modes with a mid-range execution time; it is much slower than bogofilter, but much faster than some of the alternatives. The comprehensive training mode would appear to be the most effective; the TUM mode increases the false negative rate slightly, and the TOE mode introduces false positives. Note that the DSPAM database is quite large; to a great extent, this volume is taken up by a directory full of message hashes used to keep track of which messages have been used to train the filter.
SpamAssassin
SpamAssassin, which is written in Perl, is unique among the filters tested in that it combines a bayesian filter with a large set of heuristic scoring rules. The filter, in essence, is just another rule which gets mixed in with the rest. The rule database takes a great deal of effort (on the part of the SpamAssassin developers) to maintain, and testing messages against all of those rules makes SpamAssassin relatively slow. There is a huge advantage to this approach, however: SpamAssassin works well starting with the first message it sees, and it is able to train its own bayesian filter using the results from the rules.
Another nice feature in SpamAssassin is its word list maintenance. Most bayesian filters seem to grow their word lists without bound. Since spam can contain a great deal of random nonsense (actually, much of your editor's ham does as well), the word list can quickly fill up with tokens which are highly unlikely to ever help in classifying messages. Documentation for some other filters suggests occasionally dumping the word list and starting over. SpamAssassin, instead, will occasionally (and automatically) delete tokens which have not been seen for some time. So the word list stays within bounds. In general, SpamAssassin is relatively good at minimizing the need for the user to perform maintenance tasks.
The sa-learn utility is used for most bayesian filter tasks. It can retrain the filter on mistakes, dump out word information, and force cleanup operations. SpamAssassin can be run in a client/server mode, which improves performance on busy systems. The client/server mode can also help to put a bound on SpamAssassin's memory use, which can be a little frightening. Standalone SpamAssassin on a small-memory system can create severe thrashing.
Your editor ran two sets of tests with SpamAssassin 3.1.0, running in the client/server mode, with network blacklist tests enabled. (Before somebody asks: the test was run on a standalone system to avoid any possible contamination by your editor's regular mail stream). Exactly one scoring tweak was made: the score for BAYES_99 (invoked when the bayesian filter is absolutely sure that the message is spam) was set to 5.0, enabling the filter to condemn messages on its own. That change helps to emphasize the bayesian side of SpamAssassin, and, in your editor's experience, makes it more effective. The first test involved a pre-trained database, as was done with the other filters. The second test, instead, started with an empty bayesian database in an effort to see how well the tool trains itself. Here's the results:
Batch: | 1 | 2 | 3 | 4 | 5 | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Size | |
SpamAssassin | 8 | 0 | 1.1 | 3 | 0 | 1.1 | 5 | 0 | 1.1 | 3 | 0 | 1.0 | 2 | 0 | 1.0 | 10 |
SA untrained | 32 | 0 | 0.6 | 9 | 0 | 1.0 | 18 | 0 | 1.0 | 15 | 0 | 1.0 | 7 | 0 | 1.0 | 10 |
The results here show that SpamAssassin filters up to 99.9% of incoming spam, at the cost of significant amounts of CPU time. The untrained run shows higher error rates, but does eventually converge on something similar to the pre-trained version. But, at over one second per message, each testing run (comprising 15,000 messages) took a rather long time.
SpamAssassin operates as a filter, adding a header to messages as they pass through. That header can be used in procmail recipes; the thunderbird mail agent is also set up to optionally use the SpamAssassin header.
SpamBayes
SpamBayes is a filter written in Python. The SpamBayes hackers have, perhaps more than some of the other filter developers, made tweaks to the bayesian algorithm in an attempt to improve performance. Those hackers have also put more effort into mail system integration than some; as a result, SpamBayes comes with an Outlook plugin, POP and IMAP proxy servers, and a filter for Lotus Notes. It is still possible to use SpamBayes as a command-line filter with procmail, however.There is a separate script (sb_mboxtrain.py) which is used to train the filter. Your editor followed the instructions and found it seemingly easy to use - it nicely understands things like MH and Maildir folders. However, when used as documented, sb_mboxtrain.py happily (and silently) puts the resulting word database in an undisclosed location, and filtering works poorly. Adding a few options to make the database location explicit took care of the problem.
SpamBayes 1.0.4 was tested in two modes: retraining just on errors, and training on all messages.
Batch: | 1 | 2 | 3 | 4 | 5 | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Size | |
SpamBayes | 71 | 0 | 0.4 | 44 | 0 | 0.4 | 29 | 0 | 0.4 | 21 | 0 | 0.4 | 20 | 1 | 0.4 | 4 |
SB train all | 90 | 0 | 0.8 | 58 | 0 | 0.8 | 54 | 0 | 0.8 | 46 | 0 | 0.9 | 46 | 0 | 0.9 | 16 |
SpamBayes takes a while to truly train itself, but it does eventually get to a 98.9% filtering rate - better than some, but not truly amazing. The word database remains relatively small, but processing time is significant - especially if comprehensive training is used. Everything gets worse with comprehensive training, however - the spam detection rate drops while processing time increases. SpamBayes is able to avoid false positives in both modes, however.
SpamProbe
SpamProbe is a filter written in C++ and released under the Q Public License. Unlike most filters, which record statistics on individual words, SpamProbe is also able to track pairs of words (DSPAM can do that too). SpamProbe looks at text attachments, discarding other types of attachments with one exception: there is a simple parser for GIF images. This parser creates various words describing images in a message (based on sizes, color tables, GIF extensions, etc.) and uses them in evaluating each message.
SpamProbe is packaged as a single command with a vast number of options. There is an "auto-train" mode for getting the filter trained in the first place. There are two filtering modes which the author calls "train" and "receive." Both will filter the message; the "train" mode only updates the word database "if there was insufficient confidence in the message's score," while "receive" always updates the database. The author recommends "train" mode; your editor tested SpamProbe 1.4a in both modes:
Batch: | 1 | 2 | 3 | 4 | 5 | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Fn | Fp | T | Size | |
SpamProbe train | 80 | 0 | 0.2 | 39 | 1 | 0.1 | 37 | 1 | 0.1 | 39 | 1 | 0.1 | 27 | 0 | 0.1 | 81 |
SP receive | 90 | 0 | 0.6 | 51 | 2 | 0.6 | 39 | 1 | 0.6 | 42 | 0 | 0.7 | 35 | 1 | 0.9 | 201 |
SpamProbe's "receive" mode demonstrates that, with bayesian filters, more training is not always better. The added training slows down processing significantly, to the point that SpamProbe is almost as slow as SpamAssassin, but the end results are worse than those obtained without comprehensive training. SpamProbe has a significant false positive rate in either mode, but the "receive" mode makes it worse. In either mode, SpamProbe generates vast amounts of disk traffic, rather more than was observed with the other filters.
Unlike most other filters, SpamProbe does not insert a header in filtered mail. Instead, it emits a single line giving its verdict; the author then suggests using a tool like formail to create a header using that score. So integration of SpamProbe is a little harder than with some other tools.
Summary
Here is a summary table combining all of the filter runs described above:
Test False neg. False pos. Time Size bogofilter 406 5.5% 0.02 5 bogofilter -u 268 3.0% 0.06 32 CRM114 14 0.1% 16 0.3% 0.06 24 CRM114 pretrain 14 0.2% 15 0.3% 0.06 24 DSPAM teft 50 0.6% 0.1 305 DSPAM toe 67 0.7% 15 0.3% 0.1 276 DSPAM tum 83 0.9% 0.1 305 SpamAssassin 21 0.2% 1.1 10 SpamAssassin untrained 81 0.9% 0.9 10 SpamBayes 185 2.1% 1 0.02% 0.4 4 SpamBayes train all 294 3.3% 0.8 16 SpamProbe train 222 2.5% 3 0.05% 0.1 81 SpamProbe receive 257 2.9% 4 0.07% 0.7 201
In the above table, the "false positives" columns were left blank for tests in which there were none. Since false positives will be the bane of any spam filter, it is good if they stand out.
One should, of course, take all of the above figures with a substantial grain of salt. They reflect performance on your editor's particular mail stream; things could be very different with somebody else's mail. Still, your editor's mail stream is varied enough that, perhaps, a few conclusions can be drawn.
One of those would be that SpamAssassin is still hard to beat. It is, by far, the slowest of the filters, but it is highly effective with a minimum amount of required setup and maintenance on the user's part. For the most part, it Just Works, and it works quite well. In situations where an administrator is setting things up for a large group of users, DSPAM may well be indicated. The broad flexibility of that tool make it easy to integrate into just about any mail system, and the web interface makes its operation relatively transparent to users. Just be sure you have a big disk for its databases.
CRM114 is an interesting project; its combination of technologies has the potential to make it the most accurate of all the filters. It has the look of a hardcore power tool. This tool, however, is not ready for prime time at this point. It is a major hassle to set up, and, for your editor at least, keeping the filter stable was a challenge. The other three filters all have their strong points, but none of them had the level of spam detection that your editor would like to see.
There are, of course, other filters out there as well. Some of the graphical mail clients have started to integrate their own filters. There is a great convenience in having a "junk" button handy, but the integrated filters sacrifice transparency and, in your editor's (admittedly limited) experience, they do not seem to develop the same level of accuracy. There is also ifile, which is intended to be a more general mail classifier. That tool is no longer under development, however.
In the end, none of the filters reviewed is perfect - it would be nice to
see no spam at all. But some of them are surprisingly close. Think back,
for a minute, to the days when were complaining about getting a dozen spams
per day - or per week. Who would have thought that we would be able to
cope with thousands of spams per day and still deal with our mail? The
developers of these filters have, in a significant way, saved the net, and
your editor thanks them.
Posted Feb 22, 2006 23:58 UTC (Wed)
by njhurst (guest, #6022)
[Link] (8 responses)
http://research.microsoft.com/~horvitz/junkfilter.htm
Not sure why Graham paper gets such acclaim.
Currently I'm just using greylisting and getting good results. Spam Assassin is become unworkable. I'll probably link to SPEWS or similar when spammers cotton on to greylisting.
Posted Feb 23, 2006 0:31 UTC (Thu)
by dwheeler (guest, #1216)
[Link]
But Graham rightly gets credit for promulgating the idea. Having an idea isn't enough... other people have to know about it. Graham did a good job of connecting problem and solution, in a way that got others interested.
Posted Feb 23, 2006 4:10 UTC (Thu)
by FraserCampbell (guest, #33142)
[Link] (5 responses)
So I get a 98% reduction in my spam and a pittance coming through, I don't have to hunt through a spam folder to see if things were missed and all my normal mails and mailing list traffic is totally unimpacted (I'm using postgrey's default of 30 day storage for ip/from/rcpt triplet).
I never liked the solutions that filter mail to folders (or /dev/null) because there are inevitably mistakes ... at least I always found mistakes (false positives). I don't recall all the different filters I've tried over the years but bogofilter was one of them. Perhaps things have improved but with greylisting so incredibly effective at this point I'll leave it to others to worry abou the spam problem.
Posted Feb 23, 2006 8:46 UTC (Thu)
by tzafrir (subscriber, #11501)
[Link] (4 responses)
Posted Feb 23, 2006 12:47 UTC (Thu)
by shane (subscriber, #3335)
[Link] (3 responses)
Sure, but until then you drop 90% of spam without having to filter.
Even if spammers do start to retry 3 times, they can be quickly
identified
unless the sender keeps state about the sessions, and retries only after a
retry message from the server. For a spammer this means increased
memory/disk useage, and more
complicated spambots.
Plus compromised hosts (which send a lot of spam) tend to get discovered,
and taken off-line. If this happens before the re-send the mail never gets
delivered.
- Jan L. A. van de
Snepscheut
Posted Feb 24, 2006 1:06 UTC (Fri)
by njhurst (guest, #6022)
[Link]
Posted Feb 24, 2006 2:32 UTC (Fri)
by AnswerGuy (guest, #1256)
[Link] (1 responses)
It's only an incremental extra effort to deliver multiple times regardless of whether the earlier copies reached you or not.
I get over a 1000 slices of spam per day through my personal inbox ... even after greylisting and some blacklisting (very limited blacklisting). SpamAssassing gets over 99% of that, but I still end up with 20 or so, per day, that make their way through SA. Often I get about 5 copies of any that do get through.
Periodically I go through the spam folder (and one "longlines" folder which contains spam with very long lines --- basically no standards compliant linefeeds at all, which have broken my procmail recipies for SA and YAVR in the past). (YAVR is an anti-virus recipe since SA normally doesn't catch those as "spam" per se).
I've only noticed a tiny handful of false positives dumped into my spam folder by SA. (Less than a half dozen in almost two years). This is a very unscientific measure since I am pretty heavy handed with the delete key when I do spot check the spam folder; and I'm only human with a limited about of time and energy to spend on rescuing mail sent by strangers who's content looks too "spammy."
I don't keep metrics on rejections ... nor on greylisting delivery deferrals that never get delivered. I only have one confirmed case of a bit of e-mail that was not spam but which got greylisted for 49000 seconds (wife was attempting a PayPal password change and their mail server didn't respect the conventional retry delays --- the Postfix greylisting daemon we're using punishes apparently attackers with an exponential back off). (She resolved that by simply whitelisting them and forcing another PW change; while also teaching the silly tech support person there all about proper MTA behavior, and greylisting over the phone).
So, greylisting helps a little ... but too many spammers have adapted and now simply, blindly try everything multiple times. (Also anyone that does successfully dump their spam on an open relay gets the delivery retries for free ... alll standards compliant MTAs do that, and most open relays are just old, unpatched, poorly configured copies of sendmail).
(I do NOT use ORBS type blocking ... I refuse to configure my MTAs to implement a set of dynamically changing policies that are set by strangers ... so I only add my own connection blocking sparingly ... so far).
The main thing that seems to limit my spam load is my own paltry bandwidth. My mailservers share bandwidth with DNS and web traffic (from my clients and servers) over a little old 144Kbps IDSL line. I have about the equivalent of two bonded 56K leased lines. Apparently a significant number of spam cannons time out and drop connections on such a slow link (they've got millions of other targets to get to).
JimD
Posted Feb 24, 2006 13:53 UTC (Fri)
by copsewood (subscriber, #199)
[Link]
Also occasional genuine messages getting rejected by the MTA that accepts mail from across admin boundaries will result in a bounce to the sender, while not sending bounces to innocent victims, e.g. which happens if you reject at an internal incoming MTA. I set spamassassin score > 10 to MTA reject and > 7.5 to go to my spam folder via procmail. I also use the spamhaus DNSBL which currently rejects about 900 spams a week on my server and with which I have never seen a single false positive in about 2 years use, and use less reliable DNSBLs e.g. spews for spam folder filtering.
Posted Feb 26, 2006 15:09 UTC (Sun)
by job (guest, #670)
[Link]
So while not being in any way groundbreaking, he pointed out some interesting points that really re-ignited the idea. That, and that is was probably a good time when many people was unable to manually cope with all spam, was what makes people refer to the paper so often.
Posted Feb 23, 2006 0:11 UTC (Thu)
by adamgundy (subscriber, #5418)
[Link]
I haven't used it myself in quite a while, but I know others who do and say it's good..
Posted Feb 23, 2006 0:20 UTC (Thu)
by dlang (guest, #313)
[Link] (1 responses)
it is more then just a spam/ham filter, it can filter into many different catagories (there are people who use it to filter into >50 catagories)
it started out as a pop3 filter, but now will also do SMTP and NNTP as well as providing a XMLRPC and IMAP interfaces.
it's IMAP interface is fairly unique in that it doesn't act as a proxy between your mail client and the mail server, instead it acts as a client itself and watches your inbox, automaticaly moving messages into subfolders for you (and you train it by moving messages to the proper subfolder from wherever they get put by popfile).
Popfile provides a web interface to reclassify messages and do other configuration tasks (IMAP users won't bother with this much once they set it up)
David Lang
Posted Feb 23, 2006 1:54 UTC (Thu)
by mbcook (subscriber, #5517)
[Link]
I recently tried the IMAP support, which was cool. I'm not an IMAP person so I went back to using
It also supports SSL is you put in the have the needed perl modules. SSL support works great and
Check it out.
Posted Feb 23, 2006 1:17 UTC (Thu)
by karim (subscriber, #114)
[Link] (1 responses)
Any chance you could tell us what machine you've been running these tests
Thanks,
Karim
Posted Feb 23, 2006 1:57 UTC (Thu)
by corbet (editor, #1)
[Link]
Posted Feb 23, 2006 1:57 UTC (Thu)
by wstearns (subscriber, #4102)
[Link] (6 responses)
If you'll allow me to paraphrase, SpamAssassin (bias alert - I
Posted Feb 23, 2006 2:07 UTC (Thu)
by dlang (guest, #313)
[Link]
the problem with putting filters in series is that they can start paying more attention to the results of the earlier filters then they should (rather then evaluating the message itself)
David Lang
Posted Feb 23, 2006 2:08 UTC (Thu)
by corbet (editor, #1)
[Link] (4 responses)
Posted Feb 23, 2006 2:23 UTC (Thu)
by wstearns (subscriber, #4102)
[Link] (1 responses)
Don't get me wrong - I still love, and use, and actively support SA, but the all-or-nothing approach means it needs beefy CPUs, lots of ram, and a relatively large amount of time to use all tests.
Posted Feb 23, 2006 8:47 UTC (Thu)
by zmi (guest, #4829)
[Link]
And with the additional RulesDuJour script found on
have fun without spam,
Posted Feb 23, 2006 10:06 UTC (Thu)
by nix (subscriber, #2304)
[Link]
The perceptron was persistently choosing overly low scores for the Bayesian filters, because *when SA's static regex rules work well*, choosing low scores for the high-probability Bayesian learner hits does indeed minimize FPs, as genuine spams tend to hit large numbers of static regex rules as well --- but those rules work less and less well after SA's release, and the perceptron cannot take that into account.
Hence the hardwiring of the Bayesian scores henceforward.
Posted Feb 27, 2006 9:42 UTC (Mon)
by Ross (guest, #4065)
[Link]
Posted Feb 23, 2006 5:51 UTC (Thu)
by besonen (guest, #22686)
[Link]
> you could also try PopFile -
yes, please add popfile to your review.
popfile is the best bayesian-type spam filter, imho.
peace,
Posted Feb 23, 2006 6:09 UTC (Thu)
by gdt (subscriber, #6284)
[Link] (14 responses)
What these packages really need is a more convenient user interface for the corporate user. Train the filter with a command line interface? I think not. Most corporate Linux environments don't offer xterm to the average user. It would be nice if a simple action, such as pressing the Junk button in the mail reader or moving a message to a Spam folder, could be communicated back to to the filters by the IMAP server.
Posted Feb 23, 2006 6:57 UTC (Thu)
by dlang (guest, #313)
[Link]
unfortunantly it has two problems (one major, one minor)
the major one is that it is still single-user (multi-user operation is under development, but stalled as the core developer moved to France)
the minor one is that popfile maintains one IMAP session per folder, this is actually much less resource intensive for the server then opening and closing folders to cycle through them, but it 'looks suspicious' when a sysadmin does a netstat and sees lots of connections from one client. when I get a chance I intend to add a 'max connections' parameter that will limit how many connections are in use at a time.
Posted Feb 23, 2006 8:21 UTC (Thu)
by davidw (guest, #947)
[Link] (2 responses)
Indeed, I would have liked to have seen Thunderbird included in this test - it too claims to have a Bayesian spam filter.
Posted Feb 23, 2006 13:33 UTC (Thu)
by bk (guest, #25617)
[Link] (1 responses)
Posted Feb 23, 2006 13:55 UTC (Thu)
by bronson (subscriber, #4806)
[Link]
I think that ALL spam filtering should occur on the server, before the message queue. This just solves so many problems. However, training necessarily requires user interaction, and therefore should be supported in the MUA. True, there are web-based hacks, like in dspam, to try to work around this but just try explaining them to a clueless user!
Posted Feb 23, 2006 8:41 UTC (Thu)
by lolando (guest, #7139)
[Link]
Posted Feb 23, 2006 17:30 UTC (Thu)
by vmole (guest, #111)
[Link] (6 responses)
The problem with feeding back the message the user sees is that by that time the user sees it, Outlook has already destroyed the headers - there is no
standard way to extract an undamaged, as-delivered message from Outlook and send it back to the anti-spam system.
Dspam work around this by inserting a hash into the ongoing message. When it's sent back for "rescoring", it extracts the hash code and then just reassigns the tokens associated with that hash. The downside is that the users wonder about that weird code in all their e-mails.
Another method is used by Maia Mailguard, which just keeps a cache of all the mail. The user needs to go to a web interface to mark false negatives, but it also let's them look through the spam cache to find and release false positives. (Maia is "just" a front-end to amavisd/spamassassin).
Posted Feb 23, 2006 22:09 UTC (Thu)
by NightMonkey (subscriber, #23051)
[Link] (1 responses)
Posted Feb 28, 2006 13:46 UTC (Tue)
by vmole (guest, #111)
[Link]
Posted Feb 26, 2006 15:21 UTC (Sun)
by job (guest, #670)
[Link] (2 responses)
Are other headers destroyed too? Do different MUAs destroy different headers? I would be very interested to know!
Posted Feb 28, 2006 13:44 UTC (Tue)
by vmole (guest, #111)
[Link] (1 responses)
There's no way (that I've found[1]) to bounce a message from Outlook. By "bounce" (aka "redirect"), I mean to send the message to a different address w/o mucking with headers, so that it goes to the second address looking just as if it had been sent there originally (modulo any "Received" headers, but those aren't particularly relevant at this point). Forwarding is not equivalent. as that adds all sorts of crap to the body of the message *and* loses the headers.
Why is this a big deal? Because to retrain Bayesian-type spam filters, you need to provide something relatively close to the original message. Outlook just can't do it.
Steve
[1] There are add-on modules that will do this for Outlook. The chance of getting one installed and working on an existing collection of ~20,000 pcs of various vintages is 0.
Posted Mar 25, 2006 11:27 UTC (Sat)
by job (guest, #670)
[Link]
Posted Sep 22, 2006 6:47 UTC (Fri)
by bobpriston (guest, #40434)
[Link]
Posted Feb 23, 2006 17:56 UTC (Thu)
by johill (subscriber, #25196)
[Link]
I do exactly that:
Posted Mar 2, 2006 10:08 UTC (Thu)
by jpetso (subscriber, #36230)
[Link]
Posted Feb 23, 2006 9:22 UTC (Thu)
by Segora (subscriber, #8209)
[Link]
Posted Feb 23, 2006 9:42 UTC (Thu)
by walterh (guest, #19113)
[Link] (2 responses)
Posted Feb 23, 2006 10:40 UTC (Thu)
by robdinn (guest, #30753)
[Link]
But may that is an argument for spamassassin + greylisting :)
Posted Feb 23, 2006 13:36 UTC (Thu)
by bk (guest, #25617)
[Link]
Posted Feb 23, 2006 12:09 UTC (Thu)
by pink18 (guest, #32445)
[Link]
I found this when I needed some filtering that was easy to integrate into a system for a single user. It is simple to build, install and use.
Posted Feb 23, 2006 12:57 UTC (Thu)
by jpmcc (guest, #2452)
[Link] (1 responses)
John
Posted Feb 24, 2006 5:23 UTC (Fri)
by thedevil (guest, #32913)
[Link]
Gmail antispam is to spamassassin and friends what "web forums"
(The reason I do use gmail is they offer SSL and authentication
Posted Feb 23, 2006 13:54 UTC (Thu)
by glouis (guest, #526)
[Link] (1 responses)
Posted Feb 24, 2006 19:39 UTC (Fri)
by Dom2 (guest, #458)
[Link]
-Dom
Posted Feb 23, 2006 15:21 UTC (Thu)
by tjw.org (guest, #20716)
[Link] (1 responses)
I access email through IMAP and I use mutt to read it. I have done this for a long time and generally resist change. The Bayesian filters available didn't integrate with this configuration at all. I would have had to use fetchmail or something similar and configuring mutt to train seemed complex to me.
I settled on using Thunderbird soley as a Bayesian filter/trainer. I just run in minimized and it takes care of almost all of my spam. In mutt, I move any undetected spam messages to a special IMAP mailbox and occasionally train thunderbird on the contents of that mailbox.
The nicest part is that when Thunderbird marks messages as "Junk" and moves them to it's designated IMAP mailbox for spam, the messages still show up in mutt in my INBOX as deleted (until I hit x to sync the mailbox). This gives me a chance to spot any false positives (of which there are very few). Also Thunderbird deletes the spam it identifies after 2 weeks automatically saving me the cleanup task.
Posted Feb 23, 2006 21:26 UTC (Thu)
by dlang (guest, #313)
[Link]
Posted Feb 24, 2006 1:18 UTC (Fri)
by barbara (guest, #3014)
[Link]
Posted Feb 24, 2006 4:55 UTC (Fri)
by csawtell (guest, #986)
[Link] (1 responses)
Posted Mar 7, 2006 11:43 UTC (Tue)
by gvy (guest, #11981)
[Link]
Posted Mar 2, 2006 19:35 UTC (Thu)
by sholdowa (guest, #34811)
[Link] (2 responses)
Could you post your tests so we can do this?
Posted Mar 2, 2006 20:33 UTC (Thu)
by corbet (editor, #1)
[Link] (1 responses)
Posted Mar 3, 2006 0:15 UTC (Fri)
by sholdowa (guest, #34811)
[Link]
Posted Mar 4, 2006 8:31 UTC (Sat)
by Ausdarren (guest, #36280)
[Link] (1 responses)
If sendmail on the receiving system, verified using DNS, that the ip address of the mail server trying to connect to deposit a mail item on the receiving mail server, was really coming from where it said it was coming from, and then reject all mail that appear false, most of the problem would be resolved. I'm not a programmer, and I expect that legitimate operators would need to ensure that the domain they use to send from was correctly registered in DNS; however the benefits to the entire internet community would be substantial, and well worth the teething trouble to implement.
Can this be done? Is it Feasiable?
many thanks,
Darren Latta
Posted Mar 7, 2006 11:45 UTC (Tue)
by gvy (guest, #11981)
[Link]
Posted Mar 7, 2006 22:08 UTC (Tue)
by jaalto (guest, #36348)
[Link] (1 responses)
Another article with tests titled: *Spam* *Filters* by Sam Holden
http://freshmeat.net/articles/view/964/
Jari
[1]
Posted Oct 5, 2007 16:55 UTC (Fri)
by johill (subscriber, #25196)
[Link]
To be fair, people other than Graham built bayesian spam classifiers before his paper. I recall that Microsoft published a paper a good 6 months earlier. Here's one from 1998 (last millenium :):The Grumpy Editor's guide to bayesian spam filters
Ifile implemented the idea of automatically categorizing years earlier thatn that. It claims a first release date of Aug 3 20:49:01 EDT 1996. The author doesn't claim that is the first time this has been implemented, either.
Filtering even earlier - 1996
Greylisting is fantastic. I enabled greylisting on February 6th, since enablement I have averaged 2.88 spams per day. Prior to enablement I had averaged 121 pieces of spam per day (just looked at month of January).The Grumpy Editor's guide to bayesian spam filters
It's only a matter of time until spammers learn to retry three times. And then we get even more useless spam traffic.The Grumpy Editor's guide to bayesian spam filters
It's only a matter of time until spammers learn to retry three
times.
The Grumpy Editor's guide to bayesian spam filters
In theory there is no difference between theory in practice. But in
practice, there is.
The bottom line is that today, it works great.
And when they are resending 3 times there is a high probability that they hit one of your honeypots. (SPEWS)The Grumpy Editor's guide to bayesian spam filters
... they just retry everything three or four times at five minute intervals.Actually they don't bother with state
(The Linux Gazette "Answer Guy" --- no I didn't pick the name; yes, my e-mail address is still the same: jimd@starshine.org --- and published monthly in several languages and countries around the world for several years).
While greylisting makes sense for the reasons other contributors suggest, you might spend less time and have fewer manual false discardsreject on black, filter on grey
going through your spam folder if you reject more of the very high probability spam at the MTA level. Checking manually what went into the spam folder is quicker and better on doubtful messages only or there will be too many in the spam folder to do this job accurately.
If you'd actually read Grahams paper you'd know that Bayesian auto classification of mail is a lot older than that. What Graham did was point out that many of the existing attempts had been "too smart" and that results could actually be more accurate when not converting character sets, cleaning up headers etc. since their very existence adds to the amount of classification information.The Grumpy Editor's guide to bayesian spam filters
you could also try PopFile - http://popfile.sourceforge.net/The Grumpy Editor's guide to bayesian spam filters
unfortunantly you missed popfile (which just released 0.22.4 today)The Grumpy Editor's guide to bayesian spam filters
more info at
http://popfile.sourceforge.net/ or http://sourceforge.net/projects/popfile/
I have to agree. I've been using POPFile for years, both on Windows and OS X. It is a fantastic The Grumpy Editor's guide to bayesian spam filters
little program. It doesn't detect "ham/spam", it puts things into buckets. Now I use Ham and
Spam as my buckets, but you can add more and it will learn where things go. So you could make
a bucket for kernel patches and it would learn when things go in there. That would probably
increase the accuracy since it doesn't have to lump kernel patches (which would be largely C
code) in with Ham (which would contain all sorts of stuff).
POP3 but it did have one really cool feature: classification by folders. Since it monitors what is in
each folder, it knows when you move things. Thus, when you move a message out of your Spam
folder into your Ham folder, it learns that it made a mistake (and vice versa). This is such a cool
idea. Because of the way it works you could set a little home server to run POPFile in IMAP mode
and then whenever and where ever you check your e-mail you can reclassify things and it will
learn from that.
provides security that is nice to have when you often work on open networks where someone
could get your e-mail password.
Jonathan,The Grumpy Editor's guide to bayesian spam filters
on? I'm sure you're not ill-equipped ;), but it'd be great if we could
relate the execution times to some clock speed ...
You know, I meant to do that. Either that, or normalize all the numbers, because I think it's the differences between them that matter. Anyway, for what it's worth, the test system is a dual 2.2 GHz Xeon box.
Test system
Good comparison, Jon - thanks.The best of both worlds - a hybrid approach?
contribute to SpamAssassin) is accurate but slow. Other tools are faster,
but not quite as accurate. How about getting the speed of other tools and
the accuracy of SpamAssassin?
Picture procmail first handing the message off to a fast filter
such as bogofilter, CRM114, or DSPAM. Those are told to only score if
they're certain a message is ham or spam, which would probably mean
adjusting the thresholds for ham and spam and leaving a larger gray area
between those thresholds. When the above bayesian filter is not sure,
procmail then hands the message off to SpamAssassin (with bayes filtering
turned off but all the network checks active) for more in depth checks.
For the vast majority of the messages you get quick filtering.
When that bayes score is borderline, we check again with a slower but more
accurate tool. Wouldn't that be the best of both worlds - accurate and
almost as fast as the initial filter by itself?
I thought that's exactly what SpamAssassin claims to do (it's the justification for leaving in all the other checks)The best of both worlds - a hybrid approach?
The thing is...SpamAssassin is the hybrid approach. I just think it needs to learn to trust its bayesian filter a bit more. Certainly I've never had cause to regret raising the score on BAYES_99. Maybe all SA really needs is a rule that shorts out all the other tests if the bayesian filter is convinced that it has figured out the message. That, alone, might make SA a lot faster, at least in the client/server mode.
The best of both worlds - a hybrid approach?
There's the catch - Spamassassin is all or nothing. I've watched the mailing list for many years now, and every once in a while someone will suggest bypassing expensive checks if the score is high enough with the quick and easy ones. The answer always comes back no, when you use SA all the checks are used, even if the sender is in a whitelisted domain (that just subtracts 100 from the still-expensively-calculated score). The logic is that the expensive tests might reduce a high score, so we need to do all the checks.The best of both worlds - a hybrid approach?
One thing to mention when using SA is that there is Making SA even better
http://rulesemporium.com/ with lots of additional rules to filter certain
types of SPAM. With those, classification can be made almost perfect.
http://www.exit0.us/index.php?pagename=RulesDuJour you can even update
those rules automatically each night, adopting to new spam almost
immediately.
zmi
The developers agree. As of SA 3.2, the Bayesian scores are fixed and not trained by the perceptron.The best of both worlds - a hybrid approach?
One thing I've always wondered is why they do it "backwards". Presumably they could do away with a lot of manual tuning if they fed the individual test results into the stream as words. So you could have "SATEST=TEST_RESULT" incorporated into the Bayesian decision making. Two complicating factors would be how to protect it from spammers incorporating those tokens (maybe use a installation-specific "password" prepended to the token), and chunking the numeric results so that they match for similar inputs (obviously floating point numbers won't generally work well). Of course it wouldn't do anything to speed up processing.The best of both worlds - a hybrid approach?
on Feb 23, 2006 0:11 UTC (Thu) adamgundy wrote:The Grumpy Editor's guide to bayesian spam filters
> http://popfile.sourceforge.net/
david
User interfaces
popfile will do exactly this (classify based on moving the message between IMAP folders), and has a web GUI to use as well.User interfaces
This is a very important point. As Linux heads towards the desktop, we need spam filters that can be trained by clicking on 'spam' 'not spam' buttons, ala Thunderbird.User interfaces - yes!
I mostly disagree. Spam filtering is an upstream, mail-provider problem. Failing that, however (let's say you're stuck with a lax sysadmin who doesn't care about the spam flooding his/her users), it is good to have a fallback like MUA filtering.User interfaces - yes!
Except that training, an essential component of spam filtering these days, is a downstream, user problem.
User interfaces - yes!
Doesn't Evolution do just that? That's not a rhetorical question; I'm a Gnus user myself, but the Debian Evolution packages recommend spamassassin, so I'd guess they offer at least some kind of GUI for it.User interfaces
User interfaces
DSPAM can put the Signature in the headers. Then, your users won't wonder about anything. ;)User interfaces
If it's in the headers, it doesn't do any good: Outlook won't send it back to the filter when the message is forwarded.User interfaces
What headers does Outlook destroy? I understand that the Stats-headers is modified, but that's the same with all MUAs and could easily be fixed by excluding that specific line.User interfaces
User interfaces
Moving a message doesn't destroy the headers. User interfaces
You can use the field PR_TRANSPORT_MESSAGE_HEADERS from IMessage for programming or Options.. in right click menu to get "original" internet headers for email. Outlook spam filter http://www.spam-reader.com Spam Reader and others get all necessary information in this way. And work with good effeciency.
User interfaces
> It would be nice if a simple action, such as pressing the Junk button inUser interfaces
> the mail reader or moving a message to a Spam folder, could be communicated
> back to to the filters by the IMAP server.
http://johannes.sipsolutions.net/Projects/dovecot-dspam-i...
I think Kontact/KMail offers the optimal solution here by integrating the User interfaces
command-line filters with its graphical interface. You just have to choose
one of bogofilter, spamassasin or whatever in the Anti-Spam Wizard and
Kontact sets up the appropriate filter rules and toolbar buttons.
I think it's cool that they don't try to reinvent the wheel (like
Thunderbird does) but let the users select their favorite, already mature
tools. (They also do it the same way for anti-virus filtering, which may
become relevant in a few years too...)
.. and then there's spam.el for Gnus users.The Grumpy Editor's guide to bayesian spam filters
I think that running SpamAssassin with the network tests enabled is unfair and invalidates the statistic. After all, your mail was collected over some time and only then fed to SpamAssassin. So in the meantime, all the network databases that SpamAssassin queries already had the spam from your set marked by other users. This is very different to the situation where you pipe your incoming spam through SpamAssassin in real-time, because then most spam isn't marked yet.The Grumpy Editor's guide to bayesian spam filters
That's a good point.The Grumpy Editor's guide to bayesian spam filters
(of course that wouldn't help if everyone followed that approach)
It's also unfair in terms of performance. SA with only local tests is *much* faster (although, arguably, still not 'fast' in the bogofilter sense of the word).The Grumpy Editor's guide to bayesian spam filters
dbacl - a digramic Bayesian classifierThere is also dbacl
http://dbacl.sourceforge.net/
Or you can always route all your emails through gmail. If you measure effectiveness by (quality of filtering)/(amount of effort), they win hands down. Of course, you may not be happy happy with the folks at Google seeing a copy of all your emails...The Grumpy Editor's guide to bayesian spam filters
That's what I do, but not because of antispam. You cannot controlThe Grumpy Editor's guide to bayesian spam filters
the policies, except by manually flagging single messages in your
browser, which is not the most efficient of interfaces.
are to Usenet.
in both directions. I couldn't find an ISP who does this in my area.)
Disclaimer: I'm a bogofilter developer and the original author of the bogotune utility.The Grumpy Editor's guide to bayesian spam filters
Bogofilter works fairly well out of the box, with minimal training, as you saw. With careful tuning, particularly of the spam and nonspam cutoff values, and a rather larger amount of training, results like 0.5% false negatives and <1 in 150,000 false positives are attainable. See the papers at http://www.bgl.nu/bogofilter/ for details. The work involved is nontrivial but rewarding.
My main problem with bogofilter is its use of berkeley DB. Today I've finally switched to sqlite
instead and I hope that this will stop the seemingly neverending database corruptions that I kept
experiencing. Apart from that, bogofilter's a pretty useful little tool.
The Grumpy Editor's guide to bayesian spam filters
I've dabbled with a few of the programs in the article, but settled on an easier to use (for me) solution.a less conventional method
at the risk of repeating myself, check the popfile IMAP module, it should integrate seamlessly with your standard process, no need for thunderbird.a less conventional method
I use SpamBayes on my home machine so I was particularly interested in how The Grumpy Editor's guide to bayesian spam filters
this software stacked up against the others in the review. Jon, I didn't
see any mention of emails that fell into the SpamBayes "unsure"
classification. This feature is a life saver. Those are the only ones you
have to review and there aren't that many of them.
You might care top add SpamOracle to the list. The Grumpy Editor's guide to bayesian spam filters
http://pauillac.inria.fr/~xleroy/software.html
http://packages.debian.org/unstable/net/spamoracle
My experience with it is most pleasurable, in a couple of words, it's
"just magical". Used with in conjunction with procmail.
I've only got a feeling that its fp/fn ratio and values wouldn't be olympic, having used it for some 2 or 3 years...spamoracle
We run an open source project that offers Bayesian filtering ( http://sourceforge.net/projects/mailwasher ), and it would be interesting to see how we fare compared to your selections.Can we test ours too?
I could post the script, but I would be reluctant to post the test messages - that is a fair amount of my personal mail, after all. If I get a chance, I'll run the test and post the results.
Can we test ours too?
That would also be good. We test by sending all mail in a directory to the desired server - we do tend to get sidetracked onto the performance side of things, and it would be handy to see how it stacks up against the opposition when run by a far more independent source! Cheers, SteveCan we test ours too?
I think the filter debate has missed one crucial fact. Most Spam is trying to pretend it is someone it is not. Almost all genuine mail actually comes from the server that it appears to.Spam filters by DNS sender verification
daslynne@swiftdsl.com.au
Oh no. Doesn't Work.Spam filters by DNS sender verification -- nooo
Here is one study [1] I wrote on the subject that may interest readers. See under topic "3.2 Advice for the normal account" where several bayesian tools are discussed.
http://pm-lib.sf.net/README.html
Links to more filters
1.0 Thoughts about increasing spam annoyance
1.1 Bouncing messages do no good
1.2 Rule based systems are not the solution
1.3 Challenge-Response systems make matters worse
1.3.1 Challenge-Response is not a doorbell but a gun shooting decoys
1.3.2 Modern C-R systems are no better
1.3.3 Questioning Challenge-Response system implementations
1.3.4 Summary - What are the effects of Challenge-Response systems
1.3.5 Further reading - Mail, C-R systems, Spam
1.3.6 Further reading - Some known C-R systems
1.4 Spam appearing in your yard - a story
2.0 A lightweight UBE block system with pure procmail
2.1 Suitable for accounts which ...
2.2 Where to put "pure procmail" UBE checks?
2.3 Using Procmail Module Library to fight spam
3.0 A heavyweight UBE blocking system
3.1 Advice for Debian Exim 4 mail system administrator
3.2 Advice for the normal account
3.3 Configuring Bayesian programs
3.4 A heavyweight spam catch setup using procmail
Not sure I want to read an article that classifies crm114 as a "python based tool"...Python based?!