Weblog Comments - A New Frontier for Spam

October 29, 2003

This article was contributed by Jake Edge.

The war over spam has erupted recently in a new arena: weblog comments. The parallels to the battles that have been fought on the email spam front are considerable, but unlike email spam, weblog spam is targeted at Google (and other search engines that use number of links to derive page rankings) to increase the visibility of the sites that are being advertised via spam. Comment spam seems to be on the rise with weblog owners noticing a large increase in the number of incidents over the last month or two.

Weblogs are sites that allow the owner to post articles and essays of whatever happens to strike their fancy that day and most weblog software enables readers to post comments on the stories. LWN's comment system provides the same feature for this site but, unlike LWN comments, many weblogs allow (and even encourage) anonymous comments. That openness, like the lack of sender authentication for email, provides an avenue for abuse. Requiring registration before allowing comments does not eliminate the problem entirely (LWN has had a small amount of comment spam), but it does increase the amount of work the spammer must do.

The basic mode of attack uses a program to automatically post comments on multiple articles throughout the weblog. These unwanted messages include the URL of a website that will give you the opportunity to buy one or more of the usual items: diplomas, prescription drugs, porn, etc. The program then moves on to other sites using the same software, aided, no doubt, by the various directories of weblogs using a particular software package that are available. Eventually, Google and other search engines visit the weblog sites; thereafter, the spammer's site gains a high ranking due to all of the links to it that are found.

One of the more popular (though not entirely free) packages for running a weblog is Movable Type; its user community has been the most active so far in combating comment spam. For example, one set of tips (described by Yoz Grahame) attempts to thwart the way the current spam programs work by changing the default behavior of the software. Something as simple as changing the "post a comment" link can be sufficient to confuse most automated comment posting scripts. These techniques will only help until enough people implement them and it makes it worth the effort for a spammer to write more adaptable code to circumvent them.

Many of the other comment spam handling techniques will seem very familiar to anyone who has been dealing with the deluge of email spam: bayesian filtering and blacklisting based on the URLs in the comment and/or user profile are two of the more popular techniques. Bayesian filtering uses the frequency of words in a message and a database of word counts from previous messages that have been categorized as spam or non-spam (often called "ham") to determine a probability that the new message is spam. If the probability is too high, the message is rejected. The blacklisting patch collects the URLs that are advertised in the offending messages and rejects any comments that refer to any of those URLs. Both of these techniques can be worked around by a spammer with enough incentive, but it does make it much more difficult.

Another technique that is becoming more popular is email and web-based challenge-response systems which generate a blurry graphic that is (presumably) only readable by humans. Such systems require that the text in the graphic be typed into a form to ensure that a human, and not a program, is initiating the action. This technique, too, has made its way into the arsenal of webloggers via this plug-in for Movable Type. This scheme does have a number of downsides because it requires a graphical browser to post messages and may be unusable by the visually impaired.

Other weblogging software developers may have run into this problem and come up with their own sets of fixes, but the Movable Type community appears to be the at the forefront of this particular battle. Perhaps the spammers have yet to target other systems in an automated way. If (or more likely when) they do, newly targeted weblogging software can use one or more of the techniques above to combat the spam.

Both weblog comment and email spam fighters are running into the same issues and producing similar solutions in many cases and cooperation between the two groups will lead to better spam fighting. One of the future plans for Jay Allen's blacklist (above) is to create a distributed list of URLs that are being advertised via spam and with proper controls one can imagine that list being useful to the email spam fighting crowd. A filter using the rules for email message bodies in SpamAssassin might be useful for folks confronting spam in their weblog comments as well.

Index entries for this article
GuestArticles	Edge, Jake

Weblog Comments - A New Frontier for Spam

Posted Oct 30, 2003 3:22 UTC (Thu) by arcticwolf (guest, #8341) [Link] (2 responses)

There's an easy solution for this problem: don't allow anonymous comments.

Did you read the whole article?

Posted Oct 30, 2003 4:44 UTC (Thu) by Ross (guest, #4065) [Link] (1 responses)

The article suggested exactly what you do. It also said that it won't
completely fix the problem but it at least increases the amount of work
that a spammer has to do.

Did you read the whole article?

Posted Oct 30, 2003 9:10 UTC (Thu) by pointwood (guest, #2814) [Link]

Furthermore, sometimes that is also considered too much of a hassle for the people that would like to submit comments.

Weblog Comments - A New Frontier for Spam

Posted Oct 30, 2003 9:16 UTC (Thu) by eru (subscriber, #2753) [Link] (1 responses)

Another technique that is becoming more popular is email and web-based challenge-response systems which generate a blurry graphic that is (presumably) only readable by humans. [....] This scheme does have a number of downsides because it requires a graphical browser to post messages and may be unusable by the visually impaired.

The second accessibility problem might be fixable by adding a similar method that replaces the image with a link to a sound file reciting a short sequence of random numbers (composed from snippets each saying a single numeral). Visually impaired people are likely to use an audio-capable browser. To make it harder to break this by computer analysis, each site should use files composed of numerals spoken by different people, and maybe also add some intentional backround noise.

Visually impaired users and weblog Comments

Posted Nov 2, 2003 13:20 UTC (Sun) by kevinbsmith (guest, #4778) [Link]

> The second accessibility problem might be fixable by adding
> a similar method that replaces the image with a link to a
> sound file reciting a short sequence of random numbers
> (composed from snippets each saying a single numeral).
> Visually impaired people are likely to use an audio-capable browser.

Actually, a large number of visually impaired people use Braille output devices instead of spoken text. And of course audio "images" would still lock out lynx users.

Weblog Comments - A New Frontier for Spam

Posted Oct 30, 2003 10:04 UTC (Thu) by nchip (guest, #13292) [Link]

I like to play with an idea of having to solve a puzzle
before being allowed to post or log in - that would solve
the problem of stupid comments as well :-)

In real life webmasters seem to resort to blocking unwanted IP's
from accessing the HTTP-server, via handmade lists. What people don't
realize, is that there are already publicly accisbly lists
you can use to protect your webserver too:

http://www.blars.org/mod_access_rbl.html

Since spammers use open proxies or spam directly from
their own IP space, you can use existing RBL:s. It has
the nice side-effect of protecting your site from email
address crawling..

Ofcourse, the tricky question is to choose the right RBL(s) to
use, there are many RBL's with varying quality. The following
ones are a good starting point, with a very low false positive
rate:

Open proxy RBL's
http://opm.blitzed.org/
http://cbl.abuseat.org/

Spammer IP space RBL's
http://www.spamhaus.org/

Weblog Comments - A New Frontier for Spam

Posted Oct 30, 2003 10:06 UTC (Thu) by rdowner (guest, #3960) [Link]

Bill Thompson of BBC News Online covered this issue recently. His article contains the very depressing but accurate statement "any open medium can be so easily undermined by people with no scruples, no sense of responsibility and no idea of the damage they are doing." That, sadly, just about sums up all spammers. (http://news.bbc.co.uk/1/hi/technology/3210623.stm)

Richard Downer.

Weblog Comments - A New Frontier for Spam

Posted Oct 30, 2003 10:32 UTC (Thu) by james (subscriber, #1325) [Link] (2 responses)

So where are Google and the W3C in all this?

Imagine a "type=untrusted" tag added to the <a href... element (by the comment software, after the commenter has submitted the comment). It would mean that the URI quoted did not come from the principal author(s) of a page, and so should be treated with suspicion.

Google could then use it to severely downgrade the importance of the link, which would then make it that much less useful to spammers. ("How much" could be a delicate balancing act: possibly a Google that plain ignored all such links would be more useful than one that could be spammed and subverted this way.)

Browsers could use this as a hint that the link was possibly dodgy. They could turn off javascript links, and maybe show an alert if there was something "odd" about the link (say www.microsoft.com@167772161, where the number is the IP address and the www.microsoft.com is the username).

Actually, they probably should show alerts on that last one anyway...

It shouldn't be so difficult for the site engine to add that automatically: it's one line of sed. There aren't that many popular comment-based packages, and the option to make the site much less useful for spam should be popular with users.

James

Weblog Comments - A New Frontier for Spam

Posted Oct 30, 2003 10:51 UTC (Thu) by hingo (guest, #14792) [Link]

I suspect that google is already employing techniques to de-emphasize weblog-links. I
remember reading that they had problems with some sites suddenly popping high up in their
results, because a lot of weblogs were linking to the site because of some recent and
interesting event. (Consider it a google-variant of the slashdot effect if you will.)

How they are doing it I don't know, but what you are describing is exactly what google would
want to do anyway: A "real" link is more valuable than one found in a weblog, especially one
in a weblog-comment.

henrik

Weblog Comments - A New Frontier for Spam

Posted Oct 31, 2003 23:29 UTC (Fri) by giraffedata (guest, #1954) [Link]

the URI quoted did not come from the principal author(s) of a page, and so should be treated with suspicion.

I don't really see that that's reason for much suspicion. If lots of people are referring to web sites in weblog comments, that's generally as good reason for me to be interested in that web site as if keepers of weblogs are referring to it.

Bayesian Filters.

Posted Oct 30, 2003 13:15 UTC (Thu) by petebull (subscriber, #7857) [Link]

Mmmmmh, Bayesian-filtered slashdot. I kinda like that idea.

Weblog Comments - A New Frontier for Spam

Posted Oct 31, 2003 23:40 UTC (Fri) by lacostej (guest, #2760) [Link]

if the main intent for the spam poster to be recognized by search engines, we could define a specification 'a-la robots.txt' that only returns sub parts of web pages to search engines robots. Thus, for example, anonymous comment would automatically be not returned to the search engine, and it would defeat most of the spammer intent.

Of course, then we don't have legitimate anonymous content recorded by the search engine. But that could be a feature, not a limitation :)