It has been known for years that spammers harvest web sites for email
addresses to add to their lists. Various sites have responded by hiding or
obfuscating email addresses found on their pages; some people go to extreme
measures to keep their address from ever appearing on a page. One wonders
what they are worried about; your editor only receives a mere 3-4000 spams
per day to his highly-public email address, after all.
Suffice to say that without SpamAssassin LWN would likely have collapsed
under the flood years ago.
Some folks have decided that it is time to take a more active stance
against the harvesting of email addresses from web pages. The result is an
Apache module called mod_spambot;
version 0.47 was recently released. The
idea behind this module is to detect accesses by address harvesters and
shut them down. Unfortunately, the approach this module takes is too
simplistic to work in many situations.
mod_spambot is essentially a traffic throttling module. If a given site
pulls down too many pages in a given time period (default is 100 pages in
one hour), its access is cut off. There is also a "honeypot" option which
will, instead, feed the (presumed) harvester a set of pseudo-random pages
with bogus email addresses in them. This approach may well cut off some
spammers, but anybody who has maintained a busy web site can see a few
problems fairly quickly:
- This approach will also cut off others who may be grabbing large
numbers of pages from the site. Search engines come to mind, as do
archive sites or anybody wanting to mirror a portion of a site.
Cutting off people who thoughtlessly run a recursive wget to
grab an entire site has some appeal; "download the site" operations
account for a substantial part of LWN's bandwidth usage. But most
site operators do not want to pull the plug on search engines and the
like. mod_spambot allows the administrator to construct a whitelist,
but who wants to figure out how to whitelist every possible search
engine of interest?
- There are some very large networks out there hiding behind a
massive router and a single IP address. Traffic which looks like it
originates from a single host may, in fact, be generated by hundreds
of individual readers.
- Increasingly large amounts of traffic are generated by robots whose
sole purpose is to get a referrer URL onto a "top referrers" page
somewhere on the site. Purveyors of Internet gambling experiences and
particular types of imagery appear to like this approach to
marketing. The interesting thing is that these accesses come
simultaneously from a large number of IP addresses. These people,
clearly, are using a network of zombie machines for their attacks.
Spammers already use zombies to deliver their mail; it is hard to
believe that they would not use those machines for address harvesting
So throttling robots based on IP address will miss some attackers while
blocking legitimate users of the site. It would be nice to prevent one's
web site from being used as a resource by spammers, but this approach is
not, yet, the way to that end.
to post comments)