|
A grumpy editor's bayesian followupA grumpy editor's bayesian followupPosted Mar 2, 2006 16:16 UTC (Thu) by glouis (subscriber, #526)Parent article: A grumpy editor's bayesian followup
As my remark last week implied, comparison between Bayesian filters as they operate "out of the box" can be misleading. Bogofilter's performance can be improved by an order of magnitude if appropriate training and tuning precede the performance test. The fact that it works moderately well "out of the box" in a wide variety of configurations is convenient for the newbie, but should not encourage the user to remain a newbie.
A comparison among carefully-trained (and if need be, tuned) Bayesian spam filters would obviously be somewhat difficult to organize, but would do your readers a much better service, IMHO. (I'd love to see the results of such a comparison myself!)
(Log in to post comments)
Training and tweaking Posted Mar 2, 2006 16:22 UTC (Thu) by corbet (editor, #1) [Link] FWIW, the filters *were* "carefully trained." Just over 2000 messages were pulled out of the stream and used only for that purpose. They were well inspected to avoid mistraining the filter. How much more careful does one need to be?I did avoid tweaking the various knobs exported by some of the filters, with the well-documented SpamAssassin exception. I believe that was the right choice: most users (even those who are not "newbies") are unlikely to mess with them, and the defaults should be reasonable.
Training and tweaking Posted Mar 2, 2006 17:07 UTC (Thu) by glouis (subscriber, #526) [Link] We could discuss this further here but I'd like to refer you to http://www.bgl.nu/bogofilter/tuning.htmlwhich details a process by which bogofilter can be optimized. The process is onerous and requires a considerably larger number of messages than the 2,000 each that you mention. It can also be well worth the effort for a prospective production user. I don't mean in any way to imply that you should have done all that, plus the equivalent for your other tested packages, by yourself; it would have taken you months. I just want to avoid leaving people with the impression that your table reflects the performance of which bogofilter is capable.
|
Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.