October 29, 2003
This article was contributed by Jake Edge.
The war over spam has
erupted recently in a new arena: weblog comments.
The parallels to the battles that have been fought on the email spam
front are considerable, but unlike email spam, weblog spam is targeted
at Google (and other search engines that use number of links to derive
page rankings) to increase the visibility of the sites that are being
advertised via spam. Comment spam seems to be on the rise with weblog
owners noticing a large increase in the number of incidents over the last
month or two.
Weblogs are sites that allow the owner to post articles and essays of
whatever happens to strike their fancy that day and most weblog software
enables readers to post comments on the stories. LWN's comment system provides
the same feature for this site but, unlike LWN comments, many weblogs allow
(and even encourage) anonymous comments. That openness, like the lack
of sender authentication for email, provides an avenue for abuse. Requiring
registration before allowing comments does not eliminate the problem
entirely (LWN has had a small amount of comment spam), but it does increase
the amount of work the spammer must do.
The basic mode of attack uses a program to automatically post comments
on multiple articles throughout the weblog. These unwanted messages include
the URL of a website that
will give you the opportunity to buy one or more of the usual items:
diplomas, prescription drugs, porn,
etc. The program then moves on to other sites using the same software,
aided, no doubt, by the various directories of weblogs using a particular
software package that are available. Eventually, Google and other search
engines visit the weblog sites; thereafter, the
spammer's site gains a high ranking due to all of the links to it that are
found.
One of the more popular (though not entirely free) packages for running a
weblog is
Movable Type; its user
community has been the most active so far in combating comment spam.
For example,
one set of tips
(described by Yoz Grahame)
attempts to thwart the way the current spam programs work by changing
the default behavior of the software. Something as simple as changing the
"post a comment" link can be sufficient to confuse most automated comment
posting scripts. These techniques will only help until
enough people implement them and it makes it worth the effort for a
spammer to write more adaptable code to circumvent them.
Many of the other comment spam handling techniques will seem very familiar
to anyone who has been dealing with the deluge of email spam:
bayesian filtering
and
blacklisting
based on the URLs in the comment and/or user profile are two of the more
popular techniques.
Bayesian filtering uses the frequency of words in
a message and a database of word counts
from previous messages that have been categorized as spam or non-spam
(often called "ham") to determine a probability that the new message is
spam. If the probability is too high, the message is rejected.
The blacklisting patch collects the URLs that are advertised in the offending
messages and rejects any comments that refer to any of those URLs.
Both of these techniques can be worked around by a spammer with enough
incentive, but it does make it much more difficult.
Another technique that is becoming more popular is email and web-based
challenge-response systems which generate a blurry graphic that is (presumably)
only readable by humans. Such systems require that the text in the graphic be typed
into a form to ensure that a human, and not a program, is initiating the
action. This technique, too, has made its way into the arsenal of webloggers
via
this plug-in
for Movable Type.
This scheme does have a number of downsides because it requires a graphical
browser to post messages and may be unusable by the visually impaired.
Other weblogging software developers may have run into this problem and come up
with their own sets of fixes, but the Movable Type community appears to be
the at the forefront of this particular battle. Perhaps the spammers have
yet to target other systems
in an automated way. If (or more likely when) they do, newly targeted weblogging software can
use one or more of the techniques above to combat the spam.
Both weblog comment and email spam fighters are running into the same issues
and producing similar solutions in many cases and cooperation between the
two groups will lead to better spam fighting.
One of the future plans for Jay Allen's blacklist
(above) is to create a distributed list of URLs that are being advertised
via spam and with proper controls one can imagine that list being useful
to the email spam fighting crowd. A filter using the rules for email
message bodies in
SpamAssassin might be useful
for folks confronting spam in their weblog comments as well.
(
Log in to post comments)