Proposed addition to Bayes

Posted Oct 16, 2003 6:32 UTC (Thu) by eru (subscriber, #2753)
In reply to: Open spam filtering rules considered harmful? by Ross
Parent article: Open spam filtering rules considered harmful?

I wish there was some way to include things like "the subject line contained 20 consecutive spaces" into the baysian filtering.

Why not assign spamminess weights to general features like this, and inject them into the Bayesian analysis exactly like the presence of certain words? I.e. if a feature-detecting rule fires, it is like the presence of a word.

The actual features to be weighted need to be invented by humans, after looking at actual spam and imagining other plausible spam techiques. Some probable spam indicators I have seen:
- Is the message in HTML with lots of short HTML comments and/or invalid tags? (I would say a message with this feature is spam with 100% certainty)
- Is the message contents just a single image?
- Is the message in HTML with invisible text? (like white text on white background)
- Is there lots of consequtive white space in subject? (the feature you mentioned)
- Is there a space-separated sequence of several individual characters in the subject? (like "V I A G R A").
...

This is probably an obvious idea implemented somewhere, but here it is, if anyone is interested in trying it.

Proposed addition to Bayes

Posted Oct 16, 2003 14:49 UTC (Thu) by climent (guest, #7232) [Link]

This has been discussed on the mailing list, at some point, and it was probed that it gave more problems than benefits, since the rules are a deterministic way of providing info, while words are random tokens (from a bayesian point of view)

Anyway, check the archives to read the real reasons. I might have invented mine in this post, out of memory... ;)