Honeypots

Posted Feb 14, 2025 21:15 UTC (Fri) by malmedal (subscriber, #56172)
In reply to: Honeypots by mb
Parent article: Fighting the AI scraperbot scourge

Yes, blocking will not help much, but once you have a bot you can redirect it over to a sacrificial server, and keep the entire botnet busy by giving lots of fake links. You can also throttle the network of the sacrificial server so it does not impact real traffic.

Honeypots

Posted Feb 14, 2025 21:20 UTC (Fri) by mb (subscriber, #50428) [Link] (13 responses)

How do you identify a bot as an actual bot that only hits you once per week?
It could be a human user.

Honeypots

Posted Feb 14, 2025 21:59 UTC (Fri) by malmedal (subscriber, #56172) [Link] (11 responses)

> How do you identify a bot as an actual bot that only hits you once per week?

Typically because it followed a honeypot link, at that point you give it a web-page consisting of only such links.

The idea is that the bot will spread these links to other members of the botnet so subsequent bots from other IPs will be immediately recognised and get the same treatment. Hopefully, over time should direct most of the botnet over to the sacrificial server and leave the real alone.

Honeypots

Posted Feb 14, 2025 22:05 UTC (Fri) by mb (subscriber, #50428) [Link] (10 responses)

>at that point you give it

But it's already over. You served the request and you spent the resources.
That is the problem.
The CPU/traffic load already happened once you identify the bot. And then it will basically never hit again, unless you keep a multi terabyte timeout-less database with the risk of putting your users into the terabyte ban database.

Honeypots

Posted Feb 14, 2025 22:27 UTC (Fri) by malmedal (subscriber, #56172) [Link] (9 responses)

> But it's already over.

It's not. The bot will report the links it found back to the rest of the botnet and then other bots will come for those links.

> multi terabyte timeout-less database

No database is needed.

Honeypots

Posted Feb 14, 2025 22:28 UTC (Fri) by mb (subscriber, #50428) [Link] (8 responses)

>and then other bots will come for those links.

And consume traffic and CPU.
Lost.

Honeypots

Posted Feb 14, 2025 23:01 UTC (Fri) by malmedal (subscriber, #56172) [Link] (7 responses)

> And consume traffic and CPU.

From the sacrificial server, yes. So the real one gets less load.

Honeypots

Posted Feb 14, 2025 23:15 UTC (Fri) by mb (subscriber, #50428) [Link] (6 responses)

>From the sacrificial server, yes

Which costs real non sacrificial money. Why would it cost less money than the "real" server?

This is a real problem.
It is a real problem for my machines, too.
And I really don't see a solution that is not
a) buy more resources or
b) potentially punish real users

This is a real threat to users. I am currently selecting b), because I think I can't win a).

Honeypots (and tarpits, oh my!)

Posted Feb 15, 2025 0:32 UTC (Sat) by dskoll (subscriber, #1630) [Link] (5 responses)

The sacrificial server can be less beefy than the real server because it doesn't have to generate real content that might involve DB lookups and such. And it can dribble out responses very slowly (like 10 bytes per second) to keep the bots connected but not tie up a whole lot of bandwidth, using something like this.

Honeypots (and tarpits, oh my!)

Posted Feb 15, 2025 0:49 UTC (Sat) by mb (subscriber, #50428) [Link] (4 responses)

If the "sacrificial servers" don't exhaust the bots, then the bots will just go back to the real servers.
Bot administrators are not stupid. Bots are optimized for maximal throughput, no matter what.

Honeypots (and tarpits, oh my!)

Posted Feb 15, 2025 1:58 UTC (Sat) by dskoll (subscriber, #1630) [Link]

Yes, sure, but you might be able to tie some of them up in the tar pit for a while. Ultimately, a site cannot defend against a DDOS on its own; it has to rely on its upstream provider(s) to do their part.

My reply was for the OP who asked how the sacrificial server could be run more cheaply than the real server.

Honeypots (and tarpits, oh my!)

Posted Feb 15, 2025 10:28 UTC (Sat) by malmedal (subscriber, #56172) [Link] (2 responses)

> If the "sacrificial servers" don't exhaust the bots, then the bots will just go back to the real servers.

Yes, obviously. That's why I called this a "mitigation", not a "cure".

LWN does not want to do things like captcha, js-challenges or putting everything behind a login, can you think of a better approach while adhering to the stated constraints?

Honeypots (and tarpits, oh my!)

Posted Feb 15, 2025 10:35 UTC (Sat) by mb (subscriber, #50428) [Link] (1 responses)

>can you think of a better approach while adhering to the stated constraints?

No. That was my original point.

Honeypots (and tarpits, oh my!)

Posted Feb 15, 2025 10:53 UTC (Sat) by malmedal (subscriber, #56172) [Link]

Then I don't understand what we are quarreling about? I think a sacrificial server is going to be a cheaper solution than expanding real capacity, I don't see a third option.

Honeypots

Posted Feb 20, 2025 17:22 UTC (Thu) by hubcapsc (subscriber, #98078) [Link]

The article mentions: "Watching the traffic on the site, one can easily see scraping efforts
that are fetching a sorted list of URLs in an obvious sequence, but the same IP address will
not appear twice in that sequence. "

It should be possible, then, to write a program that can see the scraping efforts in the traffic.
Respond by sending back 404s on the predicted sorted list for a minute or so. There's
probably other, better, ideas for responses :-) ...

Back in the good old days when spam was simpler, it was easy to see that you'd opened a
spam without reading it because it came from xyz789@someweirdplace and because of
how it was formated and several other tells... milters had just come out, and I wrote a
milter that caught a ton of spam with almost no false positives.