|
|
Subscribe / Log in / New account

Fighting the AI scraperbot scourge

By Jonathan Corbet
February 14, 2025
There are many challenges involved with running a web site like LWN. Some of them, such as finding the courage to write for people who know more about the subject matter than we do, simply come with the territory we have chosen. But others show up as an unwelcome surprise; the ongoing task of fending off bots determined to scrape the entire Internet to (seemingly) feed into the insatiable meat grinder of AI training is certainly one of those. Readers have, at times, expressed curiosity about that fight and how we are handling it; read on for a description of a modern-day plague.

Training the models for the generative AI systems that, we are authoritatively informed, are going to transform our lives for the better requires vast amounts of data. The most prominent companies working in this area have made it clear that they feel an unalienable entitlement to whatever data they can get their virtual hands on. But that is just the companies that are being at least slightly public about what they are doing. With no specific examples to point to, I nonetheless feel quite certain that, for every company working in the spotlight, there are many others with model-building programs that they are telling nobody about. Strangely enough, these operations do not seem to talk to each other or share the data they pillage from sites across the net.

The LWN content-management system contains over 750,000 items (articles, comments, security alerts, etc) dating back to the adoption of the "new" site code in 2002. We still have, in our archives, everything we did in the over four years we operated prior to the change as well. In addition, the mailing-list archives contain many hundreds of thousands of emails. All told, if you are overcome by an irresistible urge to download everything on the site, you are going to have to generate a vast amount of traffic to obtain it all. If you somehow feel the need to do this download repeatedly, just in case something changed since yesterday, your traffic will be multiplied accordingly. Factor in some unknown number of others doing the same thing, and it can add up to an overwhelming amount of traffic.

LWN is not served by some massive set of machines just waiting to keep the scraperbots happy. The site is, we think, reasonably efficiently written, and is generally responsive. But when traffic spikes get large enough, the effects will be felt by our readers; that is when we start to get rather grumpier than usual. And it is not just us; this problem has been felt by maintainers of resources all across our community and beyond.

In discussions with others and through our own efforts, we have looked at a number of ways of dealing with this problem. Some of them are more effective than others.

For example, the first suggestion from many is to put the offending scrapers into robots.txt, telling them politely to go away. This approach offers little help, though. While the scraperbots will hungrily pull down any content on the site they can find, most of them religiously avoid ever looking at robots.txt. The people who run these systems are absolutely uninterested in our opinion about how they should be accessing our site. To make this point even more clear, most of these robots go out of their way to avoid identifying themselves as such; they try as hard as possible to look like just another reader with a web browser.

Throttling is another frequently suggested solution. The LWN site has implemented basic IP-based throttling for years; even in the pre-AI days, it would often happen that somebody tried to act on a desire to download the entire site, preferably in less than five minutes. There are also systems like commix that will attempt to exploit every command-injection vulnerability its developers can think of, at a rate of thousands per second. Throttling is necessary to deal with such actors but, for reasons that we will get into momentarily, throttling is relatively ineffective against the current crop of bots.

Others suggest tarpits, such as Nepenthes, that will lead AI bots into a twisty little maze of garbage pages, all alike. Solutions like this bring an additional risk of entrapping legitimate search-engine scrapers that (normally) follow the rules. While LWN has not tried such a solution, we believe that this, too, would be ineffective. Among other things, these bots do not seem to care whether they are getting garbage or not, and serving garbage to bots still consumes server resources. If we are going to burn kilowatts and warm the planet, we would like the effort to be serving a better goal than that.

But there is a deeper reason why both throttling and tarpits do not help: the scraperbots have been written with these defenses in mind. They spread their HTTP activity across a set of IP addresses so that none reach the throttling threshold. In some cases, those addresses are all clearly coming from the same subnet; a certain amount of peace has been obtained by treating the entire Internet as a set of class-C subnetworks and applying a throttling threshold to each. Some operators can be slowed to a reasonable pace in this way. (Interestingly, scrapers almost never use IPv6).

But, increasingly, the scraperbot traffic does not fit that pattern. Instead, traffic will come from literally millions of IP addresses, where no specific address is responsible for more than two or three hits over the course of a week. Watching the traffic on the site, one can easily see scraping efforts that are fetching a sorted list of URLs in an obvious sequence, but the same IP address will not appear twice in that sequence. The specific addresses involved come from all over the globe, with no evident pattern.

In other words, this scraping is being done by botnets, quite likely bought in underground markets and consisting of compromised machines. There really is not any other explanation that fits the observed patterns. Once upon a time, compromised systems were put to work mining cryptocurrency; now, it seems, there is more money to be had in repeatedly scraping the same web pages. When one of these botnets goes nuts, the result is indistinguishable from a distributed denial-of-service (DDOS) attack — it is a distributed denial-of-service attack. Should anybody be in doubt about the moral integrity of the people running these systems, a look at the techniques they use should make the situation abundantly clear.

That leads to the last suggestion that often is heard: use a commercial content-delivery network (CDN). These networks are working to add scraperbot protections to the DDOS protections they already have. It may come to that, but it is not a solution we favor. Exposing our traffic (and readers) to another middleman seems undesirable. Many of the techniques that they use to fend off scraperbots — such as requiring the user and/or browser to answer a JavaScript-based challenge — run counter to how we want the site to work.

So, for the time being, we are relying on a combination of throttling and some server-configuration work to clear out a couple of performance bottlenecks. Those efforts have had the effect of stabilizing the load and, for now, eliminating the site delays that we had been experiencing. None of this stops the activity in question, which is frustrating for multiple reasons, but it does prevent it from interfering with the legitimate operation of the site. It seems certain, though, that this situation will only get worse over time. Everybody wants their own special model, and governments show no interest in impeding them in any way. It is a net-wide problem, and it is increasingly unsustainable.

LWN was born in the era when the freedom to put a site onto the Internet was a joy to experience. That freedom has since been beaten back in many ways, but still exists for the most part. If, though, we reach a point where the only way to operate a site of any complexity is to hide it behind one of a tiny number of large CDN providers (each of which probably has AI initiatives of its own), the net will be a sad place indeed. The humans will have been driven off (admittedly, some may see that as a good thing) and all that will be left is AI systems incestuously scraping pages from each other.

Index entries for this article
SecurityWeb


to post comments

Proof of work challenges?

Posted Feb 14, 2025 16:42 UTC (Fri) by chexo4 (subscriber, #169500) [Link] (36 responses)

There are some simple, client side JavaScript “captcha” libraries that use proof of work. Might help slow down the scrapers with legitimately obtained IPs. But I doubt it will affect the botnets.

Proof of work challenges?

Posted Feb 14, 2025 17:07 UTC (Fri) by corbet (editor, #1) [Link] (35 responses)

That is a subset of the JavaScript-based challenges mentioned in the article. The problem is that we have long gone out of our way to avoid forcing readers to enable JavaScript, and we hear regularly that this attention is appreciated. Being forced to change that approach by these people would be ... sad.

Proof of work challenges?

Posted Feb 14, 2025 17:32 UTC (Fri) by malmedal (subscriber, #56172) [Link] (18 responses)

How about a honey-pot link? Something that would be invisible to humans, so they shouldn't normally click on it, but you could do a captcha to allow humans to continue if they did so by mistake or curiosity?

Honeypots

Posted Feb 14, 2025 17:37 UTC (Fri) by corbet (editor, #1) [Link] (17 responses)

We have considered honeypots — indeed, some code was even written. But when you have thousands of systems each hitting two or three pages, tripping up the one that gets assigned to the honeypot link is not going to change the situation much.

Honeypots

Posted Feb 14, 2025 18:07 UTC (Fri) by malmedal (subscriber, #56172) [Link] (16 responses)

many honey-pots per page, like more than real links?

making each URL unique per user?

Hmm. Might be problematic because it's often used for tracking.

Maybe if you made it really obvious for any humans:
https://lwn.net/current/removethiswhenlinkingorbookmarking/<long unique string>

Honeypots

Posted Feb 14, 2025 19:06 UTC (Fri) by mb (subscriber, #50428) [Link] (15 responses)

How do you react, if a bot hits a honey pot?
The bot will only send you one request per week. So blocking the IP address is basically useless.

Honeypots

Posted Feb 14, 2025 21:15 UTC (Fri) by malmedal (subscriber, #56172) [Link] (14 responses)

Yes, blocking will not help much, but once you have a bot you can redirect it over to a sacrificial server, and keep the entire botnet busy by giving lots of fake links. You can also throttle the network of the sacrificial server so it does not impact real traffic.

Honeypots

Posted Feb 14, 2025 21:20 UTC (Fri) by mb (subscriber, #50428) [Link] (13 responses)

How do you identify a bot as an actual bot that only hits you once per week?
It could be a human user.

Honeypots

Posted Feb 14, 2025 21:59 UTC (Fri) by malmedal (subscriber, #56172) [Link] (11 responses)

> How do you identify a bot as an actual bot that only hits you once per week?

Typically because it followed a honeypot link, at that point you give it a web-page consisting of only such links.

The idea is that the bot will spread these links to other members of the botnet so subsequent bots from other IPs will be immediately recognised and get the same treatment. Hopefully, over time should direct most of the botnet over to the sacrificial server and leave the real alone.

Honeypots

Posted Feb 14, 2025 22:05 UTC (Fri) by mb (subscriber, #50428) [Link] (10 responses)

>at that point you give it

But it's already over. You served the request and you spent the resources.
That is the problem.
The CPU/traffic load already happened once you identify the bot. And then it will basically never hit again, unless you keep a multi terabyte timeout-less database with the risk of putting your users into the terabyte ban database.

Honeypots

Posted Feb 14, 2025 22:27 UTC (Fri) by malmedal (subscriber, #56172) [Link] (9 responses)

> But it's already over.

It's not. The bot will report the links it found back to the rest of the botnet and then other bots will come for those links.

> multi terabyte timeout-less database

No database is needed.

Honeypots

Posted Feb 14, 2025 22:28 UTC (Fri) by mb (subscriber, #50428) [Link] (8 responses)

>and then other bots will come for those links.

And consume traffic and CPU.
Lost.

Honeypots

Posted Feb 14, 2025 23:01 UTC (Fri) by malmedal (subscriber, #56172) [Link] (7 responses)

> And consume traffic and CPU.

From the sacrificial server, yes. So the real one gets less load.

Honeypots

Posted Feb 14, 2025 23:15 UTC (Fri) by mb (subscriber, #50428) [Link] (6 responses)

>From the sacrificial server, yes

Which costs real non sacrificial money. Why would it cost less money than the "real" server?

This is a real problem.
It is a real problem for my machines, too.
And I really don't see a solution that is not
a) buy more resources or
b) potentially punish real users

This is a real threat to users. I am currently selecting b), because I think I can't win a).

Honeypots (and tarpits, oh my!)

Posted Feb 15, 2025 0:32 UTC (Sat) by dskoll (subscriber, #1630) [Link] (5 responses)

The sacrificial server can be less beefy than the real server because it doesn't have to generate real content that might involve DB lookups and such. And it can dribble out responses very slowly (like 10 bytes per second) to keep the bots connected but not tie up a whole lot of bandwidth, using something like this.

Honeypots (and tarpits, oh my!)

Posted Feb 15, 2025 0:49 UTC (Sat) by mb (subscriber, #50428) [Link] (4 responses)

If the "sacrificial servers" don't exhaust the bots, then the bots will just go back to the real servers.
Bot administrators are not stupid. Bots are optimized for maximal throughput, no matter what.

Honeypots (and tarpits, oh my!)

Posted Feb 15, 2025 1:58 UTC (Sat) by dskoll (subscriber, #1630) [Link]

Yes, sure, but you might be able to tie some of them up in the tar pit for a while. Ultimately, a site cannot defend against a DDOS on its own; it has to rely on its upstream provider(s) to do their part.

My reply was for the OP who asked how the sacrificial server could be run more cheaply than the real server.

Honeypots (and tarpits, oh my!)

Posted Feb 15, 2025 10:28 UTC (Sat) by malmedal (subscriber, #56172) [Link] (2 responses)

> If the "sacrificial servers" don't exhaust the bots, then the bots will just go back to the real servers.

Yes, obviously. That's why I called this a "mitigation", not a "cure".

LWN does not want to do things like captcha, js-challenges or putting everything behind a login, can you think of a better approach while adhering to the stated constraints?

Honeypots (and tarpits, oh my!)

Posted Feb 15, 2025 10:35 UTC (Sat) by mb (subscriber, #50428) [Link] (1 responses)

>can you think of a better approach while adhering to the stated constraints?

No. That was my original point.

Honeypots (and tarpits, oh my!)

Posted Feb 15, 2025 10:53 UTC (Sat) by malmedal (subscriber, #56172) [Link]

Then I don't understand what we are quarreling about? I think a sacrificial server is going to be a cheaper solution than expanding real capacity, I don't see a third option.

Honeypots

Posted Feb 20, 2025 17:22 UTC (Thu) by hubcapsc (subscriber, #98078) [Link]


The article mentions: "Watching the traffic on the site, one can easily see scraping efforts
that are fetching a sorted list of URLs in an obvious sequence, but the same IP address will
not appear twice in that sequence. "

It should be possible, then, to write a program that can see the scraping efforts in the traffic.
Respond by sending back 404s on the predicted sorted list for a minute or so. There's
probably other, better, ideas for responses :-) ...

Back in the good old days when spam was simpler, it was easy to see that you'd opened a
spam without reading it because it came from xyz789@someweirdplace and because of
how it was formated and several other tells... milters had just come out, and I wrote a
milter that caught a ton of spam with almost no false positives.

Proof of work challenges?

Posted Feb 14, 2025 17:46 UTC (Fri) by karkhaz (subscriber, #99844) [Link] (12 responses)

Sorry to hear about this, it doesn't sound like fun to deal with. I definitely appreciate the lack of JavaScript and CDNs, but not to the point where it's actively impacting your business. You're using JavaScript to get paid for subscriptions, after all.

> avoid forcing readers to enable JavaScript

You already generate a different page for logged-in readers, right? Would it be easy to simply omit the JavaScript countermeasures for subscribers?

Proof of work challenges?

Posted Feb 14, 2025 17:50 UTC (Fri) by corbet (editor, #1) [Link] (11 responses)

You would be surprised how many people struggle with the JavaScript requirement for the credit-card form; I wish we could do away with that. (OTOH, doing it that way means that credit-card numbers never pass through our systems at all, and we like that).

We could definitely disable it for logged-in users (we already do that with some countermeasures). But you still have to get past the login form first.

Proof of work challenges?

Posted Feb 16, 2025 6:56 UTC (Sun) by wtarreau (subscriber, #51152) [Link] (4 responses)

We're doing that routinely with a number of our haproxy users whose job is to deal with attacks and contents scraping all the day (it really started with StackExchange 15 years ago: https://blog.serverfault.com/2010/08/26/1016491873/). There are indeed some tradeoffs to accept, but you're don't necessarily have to accept them all the time. Among the possible approaches are:
- sending a JS challenge only during periods of intense activity: the site needs to defend itself and triage good vs bad, so only in these periods it will need users to support JS. I understand that it's not much welcome here.
- session cookie to recognize users (usually goes in complement for the one above): it allows to check the activity per cookie and continue to allow good actors to read without problem while bad actors are rejected. Since some users might reject cookies, the option here is to slow down the first cookie-less request so that only them (and bots) are slowed down during periods of high activity. Also you can be certain that scrapers will learn cookies, so they can also be used to negatively flag them once identified as certainly malicious.
- redirect with URL parameter: that's an alternative to a cookie, but it usually requires to adapt the site to support passing the same argument to all internal links (remember sites with "/?...;JSESSIONID=xxx" ?
- you can count the number of URLs per IP (or IP range) and per period. A normal user will not read 1000 URLs in 10 minutes, even when actively searching. And since I guess the site doesn't have 100 million daily users, you can enlarge the source range on which you're applying the checks. Even if that covers an enterprise's proxies, you'll hardly have sufficient readers in an enterprise to visit 1000 URLs in 10 minutes. A scraper will often do way more than that, distributed over multiple addresses.

There are also approaches that consist in differentiating possibly good from possibly bad actors. You'll notice that bots that scrape the site's contents for example do not retrieve /favicon.ico because it's not linked to from the pages. Only browsers do. This can be used to tag incoming requests as "most likely suspicious" until /favicon.ico is seen. Same for some of the images like the logo on the top left, etc.

When downloads are spread over many clients, it's common for these clients to be dumb and to only download contents. These contents are then processed, links are extracted and sent to a central place where they're deduplicated, and distributed again to a bunch of clients. So there's usually no continuity in the URLs visited by a given client. Some sites such as blogs often contain a relation in their URLs that allows to figure if requests for a page's object mostly come from the same address (and session) as the main page. If objects from a same page come from 10 different addresses in a short time, there's definitely an abuse and you can flag both the session and these IPs as malicious and slow them down.

Overall, most importantly you must not block, only slow down. That allows you to deal with false positives (because there are still a lot in these situations) without hurting the site's accessibility significantly. Usually bots that retrieve contents have short timeouts because they're used to hurt web sites, and their goal is not to stay hooked waiting for a site, but to download, so with short timeouts they can reuse their resources to visit another place. It means that often, causing a few seconds pause from time to time can be sufficient to significantly slow them down and prevent them from hurting your site.

Also it's important to know what costs you more during such scraping sessions: bandwidth ? (then it's possible to rate-limit the data), CPU ? (then it's possible to rate-limit the requests), memory ? (then it's possible to limit the concurrent requests per block) etc. Let's just not act on things that could possibly degrade the site's experience if not needed. E.g. if the bandwidth is virtually unlimited, better act on other aspects, etc.

However, one important advice is to *not* publicly describe the methods you've chosen. StackExchange figured it by themselves after publishing the article above, they later had to tweak their setup because scrapers adapted. You'd be amazed to see that many scrapers are started by developers that visit the site before and during scraping to check what they're missing. If you describe your protections too much, they have all the info they need to bypass the limit. That's another benefit of the slow down by the way, from the scraper's perspective there's no difference between slowing down by policy and slowing down due to intense activity, and the lack of feedback to the attacker is crucial here.

Proof of work challenges?

Posted Feb 17, 2025 13:36 UTC (Mon) by paavaanan (subscriber, #153846) [Link] (1 responses)

Excellent sum-up..!

Proof of work challenges?

Posted Feb 20, 2025 10:32 UTC (Thu) by jtepe (subscriber, #145026) [Link]

I agree. Very valuable advice!

Proof of work challenges?

Posted Mar 25, 2025 0:13 UTC (Tue) by dagobayard (subscriber, #174025) [Link] (1 responses)

I disable favicons in my browser for performance reasons. (My net link is LTE fixed wireless.)

--
Ian

Proof of work challenges?

Posted Aug 30, 2025 14:46 UTC (Sat) by wtarreau (subscriber, #51152) [Link]

This is a perfect example of such heuristics being only usable to adjust probabilities and not to block.

Proof of work challenges?

Posted Feb 19, 2025 7:21 UTC (Wed) by marcH (subscriber, #57642) [Link] (5 responses)

> We could definitely disable it for logged-in users (we already do that with some countermeasures). But you still have to get past the login form first.

Not 100% sure what you meant here but I guess even the dumbest, scrapers will not download the (javascript-free) login form a million times per minute when they fail to find anything else on the unauthenticated, javascript-enabled site.

Login forms and scrapers

Posted Feb 19, 2025 10:39 UTC (Wed) by farnz (subscriber, #17727) [Link] (1 responses)

There's a separate issue for the scrapers to consider if they start engaging with login forms and the like; they end up on shakier legal ground.

If they stick to HTTP GET and downloading content that's "freely available upon request", their only issue is copyright law, and it looks like training a model on content may not be (in and of itself) a breach of copyright. This is a nice safe space to be; it's really hard to come up with language that will make use of your scraper criminal without also making it illegal to (for example) use an outdated browser to view content, since GET is supposed to be a mere read of the site, logically speaking, so you get implicit authorization for the purposes of things like the US's CFAA, or the UK's Computer Misuse Act in as far as the site doesn't block you.

If they issue a POST request, which is explicitly intended to alter the site's behaviour in some fashion, they get into a riskier place; a login form is gating access behind an agreement you made with the site operator, for example, and you've now written code that is designed to ignore that agreement and get access anyway. That, in turn, runs the risk of bringing the CFAA and similar laws in other countries into play, elevating use of your scraper from at most a civil matter to a criminal matter. And while you might feel well-funded enough to survive a civil challenge, a criminal case is a very different beast.

Login forms and scrapers

Posted Feb 19, 2025 14:02 UTC (Wed) by daroc (editor, #160859) [Link]

That's an interesting observation. Right now, our countermeasures seem to be doing just fine — we haven't seen any more big lag spikes since Jon rolled out the changes he discusses in the article — but maybe in the future we could consider having an interstitial page that requires a POST to get past. It could explain the issue with the scrapers, ask you to click a button confirming that you are a human, and then set a cookie so you don't see it on subsequent visits.

I'll keep that in my ideas folder in case the bot traffic steps up another notch.

Proof of work challenges?

Posted Feb 19, 2025 14:44 UTC (Wed) by corbet (editor, #1) [Link] (2 responses)

The dumbest scrapers will demonstrably bash their virtual heads against the 429 "slow down" failure code indefinitely. They will repeatedly hit HTTP URLs that have been returning permanent redirects for over a decade. I would not expect smarter responses to a login form.

Proof of work challenges?

Posted Feb 19, 2025 15:42 UTC (Wed) by marcH (subscriber, #57642) [Link] (1 responses)

> I would not expect smarter responses to a login form.

Right, so they will not understand the javascript-free login form much, not feel "frustrated" by it and will not scrape it more frequently than any other page. They will not scrape it more frequently in order to reach some non-existent, per-site quota either.

So, if all other pages are made inaccessible unless you're either logged in OR answer some javascript challenge, then the javascript-free login form is the only page they scrape, the entire site is just one page for them and that's a huge win from a load perspective. No?

And the site is still accessible without Javascript! You just need to be logged in.

What did I miss? (Besides an even better solution)

Proof of work challenges?

Posted Feb 19, 2025 15:56 UTC (Wed) by corbet (editor, #1) [Link]

The initial comment suggested using JavaScript to filter out bots, but omitting it for subscribers. I pointed out that even subscribers have to get past the login form to be recognized as such, and will thus run into JavaScript there.

It may come to that, but I do not want to put impediments in the way of people reading our stuff. The paywall is already a big one, but many people come into LWN via our archives, and putting barriers there would surely turn some of them away. We will put a lot of energy into exploring alternatives before we do that.

Proof of work challenges?

Posted Feb 15, 2025 4:37 UTC (Sat) by DimeCadmium (subscriber, #157243) [Link]

A suggestion, if ever you do find the need to implement JS challenges: Continue allowing people to sign up/log in without JS, and to use the site without JS once logged in. :)

(And yes, that attention is appreciated)

Proof of work challenges?

Posted Feb 16, 2025 4:23 UTC (Sun) by mirabilos (subscriber, #84359) [Link]

Yes, lynx user here, and I know I’m not the only one.

Thank you.

Proof of work challenges?

Posted Feb 28, 2025 14:17 UTC (Fri) by jqpg (guest, #176259) [Link]

hi. my wild thoughs:)

1- A fake index.html asking for simple math challenge not JS challange then goto real index.html
dont forget w3m like surfers as vieving images not available for them so challange can't contain images.
math challenge per IP, math challenge per EPOCH, limit retry so on ... combine try hard if you need.

2- Put *just only* old/some type contex behind such challenge.

3- Reject all clients other than that dont have some custom "http header" or some special "user agent" value.
Maybe you put that "specials" to your fake index.html

4- Every link on every page must be suffixed(concetanated) with some "special keyword" only visible to human *eyes* to access.

5- So on, so on ...

forgive me that being *too* clever and my english:)

Legal remedies

Posted Feb 14, 2025 17:22 UTC (Fri) by rgmoore (✭ supporter ✭, #75) [Link] (3 responses)

Everybody wants their own special model, and governments show no interest in impeding them in any way.
If the scrapers are using botnets, I sincerely doubt the government will be able to do much to stop them. Governments don't seem to be able to do much to stop botnets more generally, so there's no reason to think they'll do any better for this specific case.

Legal remedies

Posted Mar 27, 2025 13:29 UTC (Thu) by Velmont (guest, #46433) [Link] (2 responses)

I used to work at Hola for a while.

It made tons of money by selling proxying through residential IPs. Users used Hola VPN for free, by selling their 'ip' or use as a proxy for companies that wanted to scrape or check sites from 'real ips'. That service was called Luminati back then, but it is now Bright Data.

I believe the scrapers are using that company, or some competitor, though I guess Hola + Bright Data is probably the biggest provider around still. So that's why the requests seem to come from all around the world with no real system to the IPs, it is because they are *actually* coming from real residential homes and computers.

> 299. million free users
> Because we believe in democratizing access to online content, our free desktop and Android versions accomplish exactly this, while you contribute a small amount of resources to our peer-to-peer community network.

So legally obtained even, since it's in the TOS and even marketing material for Hola.

All very sad.

Legal remedies

Posted Mar 27, 2025 15:21 UTC (Thu) by paulj (subscriber, #341) [Link] (1 responses)

FWIW, as a normal, residential user, Hola is incredibly useful. Love it.

Legal remedies

Posted Mar 27, 2025 16:52 UTC (Thu) by zdzichu (subscriber, #17118) [Link]

Useful for what? Being a TOR exit node without an actual TOR?

Licencing pipe-dream

Posted Feb 14, 2025 17:59 UTC (Fri) by smoogen (subscriber, #97) [Link] (12 responses)

There are times where I wish that there was a way to GPL poison pill the data feeding AI. [This content may only be used in a generative AI if any user of said database makes directly available all content used for that database.] However this is so unworkable in practice, I just move back to thinking about easier pipe-dreams like FTL travel.

Licencing pipe-dream

Posted Feb 14, 2025 18:28 UTC (Fri) by notriddle (subscriber, #130608) [Link] (7 responses)

The GPL comes up surprisingly often for something that probably isn't relevant. Most of these bots put zero effort into checking the license of the data they download, and much of it is plain old All Rights Reserved copyright (including LWN's news articles, if I'm reading the fine print correctly, and a lot of the code on GitHub).

It's impossible to violate the GPL without first violating plain copyright ("You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to [four freedoms]."). If something else *does* grant you that permission, like how search engines and critical reviews are allowed to excerpt primary sources under fair use, then the GPL doesn't matter.

And if it turns out that training an LLM isn't fair use, then the GPL probably isn't permissive enough, because LLMs don't seem to be able to accurately fulfill the attribution requirement.

Licencing pipe-dream

Posted Feb 14, 2025 19:17 UTC (Fri) by daroc (editor, #160859) [Link]

Our news articles are not all rights reserved — they're available (after the subscriber period) under a Creative Commons Attribution-ShareAlike 4.0 license.

... which, admittedly, it should probably actually say somewhere in the site footer and in the FAQ, not just on some old 'about the site' pages.

During the subscriber period (one week after the first weekly edition in which the article appears), they are plain old all-rights-reserved, though.

Licencing pipe-dream

Posted Feb 14, 2025 23:01 UTC (Fri) by khim (subscriber, #9252) [Link] (3 responses)

> If something else *does* grant you that permission, like how search engines and critical reviews are allowed to excerpt primary sources under fair use, then the GPL doesn't matter.

Note that one of the reasons why courts declared that search engines have an implied license was the fact that they followed robots.txt rules. Exceptio probat regulam in casibus non exceptis principle gave them implied license.

But I wonder what these guys who ignore all these rules and use botnets say about all that.

Most likely something like “if we wouldn't do that then we are screwed, but if we do then small fine is the most they would push on us”… I suspect.

Licencing pipe-dream

Posted Feb 15, 2025 16:21 UTC (Sat) by kleptog (subscriber, #1183) [Link] (2 responses)

The EU AI regulation effectively codifies the use of robots.txt into law. But that doesn't help you if you can't determine who is doing it.

It makes you wonder: are all these people writing their own scrapers, or are they using some common scraper library that doesn't have robots.txt support. In the latter case you could fix the library to default to honouring the robots.txt and at least solve it for the people who just fire and forget.

Basic cra

Posted Feb 15, 2025 17:15 UTC (Sat) by Tobu (subscriber, #24111) [Link] (1 responses)

Here's an example of an image download tool that has resisted implementing robots.txt. Their README asks for nonstandard headers to opt out. PRs as simple as defining the User-Agent were not merged either.

Basic cra

Posted Feb 15, 2025 18:18 UTC (Sat) by dskoll (subscriber, #1630) [Link]

Wow, the maintainer of img2dataset is a piece of work...

I searched my logs for the default user-agent he uses to pretend to be something else, and it only ever hits images... never any real pages. So I've blocked that user-agent. It now gets 403 Forbidden.

Licencing pipe-dream

Posted Feb 16, 2025 0:25 UTC (Sun) by edeloget (subscriber, #88392) [Link] (1 responses)

> The GPL comes up surprisingly often for something that probably isn't relevant.

The GPL itself might not be relevant, but surely we can craft a specific license for that usage, one that 1) plays nice with all other licenses, including non-free ones 2) is highly viral by nature.

The goal would not be to make all published text freeely available, but to either force the distribution of the dataset in which the text is included or, if the dataset contains copyrighted materials that cannot be redistributed, forbid the integration of the work. The license clauses may only trigger when the work is integrated in a dataset and the license text could even be written so that it also contaminate adjacent datasets that are used in conjunction with a dataset already covered by this license.

Of couse, once the dataset is contaminated, you'd also better make sure that the models created using this dataset shall be made free (as in speech). A license that creates new freedoms for users - that do not seem impossible to do, doesn't it? A "Dataset Freedom License".

(Also, make sure that the type of work is not enforced by the license, so that you could protect anything with it, ranging from code to images to songs...)

With such a license, the scrapping bots are still an annoyance, but any indication that a particularly covered work ended in a specific dataset could be the cause of some spectacular court actions :) And I thinks that can make the dataset creator think twice.

Licencing pipe-dream

Posted Feb 16, 2025 7:55 UTC (Sun) by kleptog (subscriber, #1183) [Link]

> Of couse, once the dataset is contaminated, you'd also better make sure that the models created using this dataset shall be made free

That would require that copyright law accepts the idea that an AI model is a derivative of its input data in the copyright law sense. I really don't see that happening.

The most you can hope for is that there is a requirement that the input data was lawfully obtained, but I suspect the licence isn't going to be relevant.

Caveat: neither the legislatures in Europe nor the supreme court in the US have addressed the issue directly yet, so who knows. The EUs AI act so far supports this.

Licencing pipe-dream

Posted Feb 17, 2025 19:19 UTC (Mon) by ringerc (subscriber, #3071) [Link] (3 responses)

GenAI scrapers have decided for themselves, absent any evidence or reasonable legal theory, that feeding data into a model "washes" it of its license and conditions of use. If you can't get the exact same document out in it's entirety, apparently it's just fine to appropriate.

I'm not free from sin when it comes to self serving arguments about things I want to download and use under terms other than those the legal owners (if not creators) intended. But I'm also not running a multi billion dollar business on it.

Imagine for one glorious moment that the Napster theory of copyright infringement damages was applied to GenAI outfits.

Licencing pipe-dream

Posted Feb 18, 2025 9:42 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

> If you can't get the exact same document out in it's entirety, apparently it's just fine to appropriate.

For better or for worse, this is not *too* far removed from how courts would actually analyze the situation (except that courts in most jurisdictions would remove the "in its entirety" part and replace it with a big complicated gray area). See e.g. https://en.wikipedia.org/wiki/Substantial_similarity

Licencing pipe-dream

Posted Feb 18, 2025 11:14 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

> GenAI scrapers have decided for themselves, absent any evidence or reasonable legal theory, that feeding data into a model "washes" it of its license and conditions of use. If you can't get the exact same document out in it's entirety, apparently it's just fine to appropriate.

Except that that is - sort of - codified in law! After all, what GenAI scrapers are doing is *exactly* *the* *same* as you reading a book. The book is copyright, and you've just copied that material from the pages of the book into your brain! And if you can't regurgitate that book word for word, then that MAY be fine.

Which is - in Europe at least - EXACTLY how AI scrapers are treated. If the *output* bears little resemblance to the training data, then copyright is not violated. If it's regurgitated exactly, then copyright is violated. European law explicitly says slurping the data IN is not a copyright violation. It says nothing about regurgitating it ... and it says nothing about other laws, like trespassing to gain access to it.

The problem of course, is the people who don't give a damn about wasting OTHER PEOPLE's money, like LWN's. In other words, people who ignore restrictions like robots.txt. I think European law treats that like a door-lock - "don't come in without permission", but the trouble is enforcing it :-(

Cheers,
Wol

Licencing pipe-dream

Posted Feb 18, 2025 13:05 UTC (Tue) by pizza (subscriber, #46) [Link]

> The problem of course, is the people who don't give a damn about wasting OTHER PEOPLE's money, like LWN's. In other words, people who ignore restrictions like robots.txt. I think European law treats that like a door-lock - "don't come in without permission", but the trouble is enforcing it :-(

I've increasingly blocked more and more swaths of the internet from servers I control. Even after that, bandwidth usage has gone up by approximately two orders of magnitude versus 2020, and when you're paying for server resources entirely out of your own pocket, throwing more resources at the problem gets really hard to justify.

Problem is that the "legit" players are also playing this same "gorge yourself as quickly as possible'" game. For example, I've had to twice block Google's AI scraper for effectively performing DoS attacks. This behavior is distinct from their pre-ai scraper, which remains well-behaved.

[/grumble]

Treat it as an optimization challenge!

Posted Feb 14, 2025 18:02 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Another option is to treat it as a free stress test to check the optimizations. Hey, if life gives you lemons, then it's a good time to make lemonade!

Treat it as an optimization challenge!

Posted Feb 14, 2025 21:03 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (1 responses)

There is a difference between a stress test and a DDoS. Every service, no matter how "simple" and "optimized" it may claim to be, has a limit to how many queries per second it can serve before you have to start throwing hardware at the problem. In the case of LWN, I would expect that the service is pretty scalable, if optimized appropriately, but at some point you will need to open your wallet, and that is (roughly) the point where it stops being fun and games and starts being an "attack."

Treat it as an optimization challenge!

Posted Feb 14, 2025 22:49 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Oh, I know. I was helping my neighbor to run their small specialty store as a custom Wordpress site, and they had to migrate it to Etsy because the bot situation got untenable.

Not just hand out the old articles

Posted Feb 14, 2025 19:01 UTC (Fri) by HenrikH (subscriber, #31152) [Link] (7 responses)

Perhaps put content that is older than a few months back behind the paywall while allowing access if it is a direct link coming from a whitelisted search engine?

Not just hand out the old articles

Posted Feb 14, 2025 20:55 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

> while allowing access if it is a direct link coming from a whitelisted search engine?

They can just forge the Referer header. I would be surprised if they did not already do that anyway.

Not just hand out the old articles

Posted Feb 14, 2025 21:06 UTC (Fri) by corbet (editor, #1) [Link] (3 responses)

Putting older articles behind the paywall would defeat one of the primary purposes of LWN, as a long-term archive of community discussions, events, and decisions. There are <checks> 251 links to LWN in the 6.14-rc kernel source, we don't want to break them. A lot of our (legitimate) traffic comes from places other than search engines.

We also would not want to get into the business of deciding which search engines are legitimate. And even if we could do that properly, the referrer information is entirely under the control of the bots, there is no way to know that a hit actually came from a given search engine. So it's an interesting thought, but I don't see it as being workable.

Not just hand out the old articles

Posted Feb 15, 2025 20:02 UTC (Sat) by HenrikH (subscriber, #31152) [Link]

yeah I get it, LWN articles comes up extremely often when you search for some technical details so this are an extremely valuable resource. I guess that the problem with direct link vs site traversal is that you cannot determine which traversal engine is a real search engine and which one is some AI bot.

Well if not paywalled then very old articles could perhaps be rate limited, aka not per ip but for all articles as such, ofc problem then is that the bots will fill up the limit preventing the real readers from accessing the old articles... Yeah this stuff is hard.

Not just hand out the old articles

Posted Feb 19, 2025 14:12 UTC (Wed) by draco (subscriber, #1792) [Link] (1 responses)

Perhaps you make it so that when you're logged out, you get any page you have a direct link for, but all the links on that page are stripped out—except a link to the login page. You could also put in an banner or link to a page explaining the situation. (You could also do the whole paywall "above the fold" bit if that helps manage server load.)

That way, direct links work for archival purposes, but if you want to navigate the site, you need to log in.

Heck, based on https://lwn.net/Articles/1010764/ (farnz's CFAA comment), it wouldn't even need to be a real login, just a page where you agree to terms of service and get a cookie recording that after POSTing your agreement.

Not just hand out the old articles

Posted Apr 5, 2025 4:15 UTC (Sat) by yodermk (subscriber, #3803) [Link]

I like that idea a lot.

One issue might be search engines, but I suppose the main engines publish a list of their indexer IP ranges, and if a request is from one of them, you just include the links.

Not just hand out the old articles

Posted Feb 15, 2025 8:17 UTC (Sat) by nhippi (subscriber, #34640) [Link] (1 responses)

Paywall is overkill, loginwall is enough.

Not just hand out the old articles

Posted Feb 15, 2025 12:02 UTC (Sat) by Wol (subscriber, #4433) [Link]

Create a guest login, put the password on the front page, change it every month or so.

People don't have to create a login, remember a password, any of that crap. Yes it's a hassle, but if you change it every six weeks (maybe less) with an eight week "remember me" cookie, it's not much grief to real people.

Cheers,
Wol

Nepenthes should be able to work

Posted Feb 14, 2025 19:41 UTC (Fri) by Curan (subscriber, #66186) [Link]

With some extra work stuff like Nepenthes should work. Normal users will not click more than a few links out of curiosity. Which means if you start remembering the last generated links and another IP picks up there, you have another immediate candidate for throttling or the tarpit.

(If you want to be really nasty: redirect them to pages with code, that contains subtle errors, leading to eg. memory leaks. Though, going by the quality of all the code suggestions I've seen so far: they're on that way already.)

All that being said: defending against actual DDoS is hard. But very often your hosting provider. Mine does that automatically once they detect an attack. And they run their own on-premise system, no external CDNs, especially stuff like CloudFlare, are used.

Sympathy

Posted Feb 14, 2025 21:59 UTC (Fri) by tux3 (subscriber, #101245) [Link]

And a thank you for the hard work containing this plague being the scene.

I naively like the nepenthes idea from an optimization perspective: assume the bots cannot be blocked or throttled; serve them pages that are very cheap on the server.

They will continue making requests - hopefully not much faster than before - but ideally a large fraction of these away from the expensive path

Well... even that idea feels like a loss.
If we lived in a different world, this is where a higher structure might be equipped to cut down on this abuse. But spam and robocalls suggest we may be stuck living with that plague from now on.

these AI scrapers suck

Posted Feb 14, 2025 23:54 UTC (Fri) by NUXI (subscriber, #70138) [Link]

I was dealing with this recently on a much smaller site. The site has a webapp tends to trap web crawlers with a nearly endless supply of "unique" pages, so I put it in robots.txt a long time ago. But now these AI scrapers are ignoring that and getting stuck. The bots aren't really putting much load on the server, but having the bots just downloading stuff 24/7 for an extended period of time was consuming a lot of bandwidth.

I noticed they all had rather outdated Chrome user agents though. So now anything that ignores robots.txt and has an old enough Chrome user agent gets served up the EICAR test file.

Consider Adding a More Comprehensive robots.txt

Posted Feb 15, 2025 1:13 UTC (Sat) by pablotron (subscriber, #105494) [Link] (8 responses)

I took a look at the LWN robots.txt and it looks fairly sparse. You may want to consider using a more comprehensive robots.txt based on the following URL in addition to your other countermeasures:

https://robotstxt.com/ai

I modified my robots.txt several weeks ago based on the information in the URL above and the amount of LLM scraper traffic seems to have dropped since then.

Consider Adding a More Comprehensive robots.txt

Posted Feb 15, 2025 1:15 UTC (Sat) by pablotron (subscriber, #105494) [Link]

Re-reading the article, I see there is a paragraph about robots.txt that I missed the first time through.

Sorry about that...

Consider Adding a More Comprehensive robots.txt

Posted Feb 15, 2025 1:25 UTC (Sat) by mb (subscriber, #50428) [Link] (6 responses)

>To block in robots.txt:
>User-Agent: GPTBot

I blocked GPTBot via User-Agent as 403-Forbidden, because it ignores robots.txt. It's "good" that GPTBot still sends a User-Agent. But most "AI" idiots don't send a User-Agent.

Sad.

In over two decades I have never blocked one of the traditional search engines.
But "AI" idiots go crazy since a couple of months ago.

"AI" idiots stop it. You are destroying the Internet/WWW.

No user-agent = 403

Posted Feb 15, 2025 2:50 UTC (Sat) by DemiMarie (subscriber, #164188) [Link] (5 responses)

If there is no User-Agent I would just send a 403 Forbidden back.

No user-agent = 403

Posted Feb 15, 2025 3:08 UTC (Sat) by corbet (editor, #1) [Link] (3 responses)

It's not that there is no user agent... These things just pretend to be a desktop browser.

User-Agent strings

Posted Feb 15, 2025 17:58 UTC (Sat) by DemiMarie (subscriber, #164188) [Link] (2 responses)

Modern or outdated one?

User-Agent strings

Posted Feb 15, 2025 18:02 UTC (Sat) by corbet (editor, #1) [Link] (1 responses)

Take your pick - they pick a user agent that is meant to disguise the bot and blend into the rest of the site's traffic. It is a fully attacker-controlled string that really has nothing of value in it.

User-Agent strings

Posted Feb 16, 2025 7:05 UTC (Sun) by wtarreau (subscriber, #51152) [Link]

Usually they just capture a request from their browser during the test and copy-paste all of that into the program. This is also what makes the device detection engines quite popular: if the advertised browser doesn't match the one you detect, then it's highly suspicious, since something was changed in a way that doesn't match what regular browsers do.

No user-agent = 403

Posted Feb 15, 2025 17:02 UTC (Sat) by farnz (subscriber, #17727) [Link]

Last scraperbot author I spoke with carefully chose user agents that matched popular browser; the intent is that if you blocked them by user agent, you'd also be blocking all Apple Safari, Google Chrome, Mozilla Firefox and Microsoft Edge users on all platforms.

They go to a lot of effort to make it very hard to block scraperbots without catching a significant fraction of your user base in the cross-fire :-(

Simple no-JS CAPTCHA?

Posted Feb 15, 2025 4:34 UTC (Sat) by DemiMarie (subscriber, #164188) [Link] (1 responses)

The Fossil version control system (https://fossil-scm.org) uses a very simple CAPTCHA that requires no JS. It won't stop targeted attacks, but it will stop non-targeted bots.

Simple no-JS CAPTCHA?

Posted Mar 12, 2025 13:08 UTC (Wed) by sammythesnake (guest, #17693) [Link]

Nice idea, but they definitely need to work on the implementation - I couldn't make head of tail of the alleged be code. Looks like they tried ascii-art with a variable width font...?

Captchas for new IP addresses

Posted Feb 15, 2025 17:29 UTC (Sat) by Tobu (subscriber, #24111) [Link]

A captcha, and a cookie that signs the IP and datetime of successful captcha completion, might help with the spread over residential IP addresses? I don't think it has to require JS, just to make it uneconomical to disguise the crawl in this way. The signed cookie is there so that no storage is required on the server.

(Sorry for giving “free advice”, your work on the site is appreciated, and thank you for giving us an instructive update as well)

Freewall

Posted Feb 15, 2025 19:19 UTC (Sat) by sroracle (guest, #124960) [Link] (5 responses)

Drastic measure: require a login to view (older?) content. This doesn't mean a subscription is required, but it would tie access to a credential that could be revoked if abused.

Freewall

Posted Feb 16, 2025 2:03 UTC (Sun) by GoodMirek (guest, #101902) [Link]

This, and whitelist known good bots (search engines) by their IP addresses. E.g. Google has it here: https://developers.google.com/static/search/apis/ipranges...
We are dealing with the same issue.

Freewall

Posted Feb 20, 2025 15:40 UTC (Thu) by PastyAndroid (subscriber, #168019) [Link] (3 responses)

I'd generally accept something like this as an end-user. If I have to login to verify I'm human[1], that's fine.

Ideally, a solution other than CDN captchas would be ideal, from my perspective. My biggest gripe is always captchas on websites which use cloudflare which do not let me past no matter what I do because of my browsers privacy settings. It gives me the tick, then refreshes the page and asks me to endlessly re-do the capture - I stop visiting websites that do this. I generally see that page now and just close it without even attempting said captcha.

[1] I think I'm human anyway? I do question it sometimes ;-)

Freewall

Posted Feb 20, 2025 23:32 UTC (Thu) by Klaasjan (subscriber, #4951) [Link] (2 responses)

Dear PastyAndroid,
Considering your alias, your footnote [1] and the well-known Turing test, could you please reply to this message of mine to help in determining the answer? 😀

Freewall

Posted Feb 21, 2025 14:02 UTC (Fri) by PastyAndroid (subscriber, #168019) [Link] (1 responses)

Certainly!

If you wish to cease the test and dump the data at any given time please respond with "2". Before we begin our test, let's look at some possible outcomes.

Possible outcomes of a successful test:
- Participants can be fairly certain that each other are human.
- It's possible that I could be trained to be able to pass the test using advanced train[ERROR OCCURRED]

Warning: An error occurred at line 15,663 in safenet.py. Response 1 of 443 has stopped unexpectedly, please see the logs or further information. To disable this warning please toggle "CONFIG_IGNORE_AI_ERRORS" to false.

Freewall

Posted Feb 21, 2025 15:05 UTC (Fri) by Wol (subscriber, #4433) [Link]

I hope I'm watching a human pretending to be an AI.

Certainly much more pleasurable than thinking I'm watching an AI pretending to be human.

Time for this stuffed bird to fly :-)

Cheers,
Wol

Classful networking

Posted Feb 16, 2025 1:14 UTC (Sun) by aaronmdjones (subscriber, #119973) [Link] (7 responses)

> a certain amount of peace has been obtained by treating the entire Internet as a set of class-C subnetworks and applying a throttling threshold to each

A Class C network is one whose IP addresses begin with the bits "1 1 0". Thus, only IP addresses 192.0.0.0 - 223.255.255.255 are Class C. It is a common misconception to say that all /24 networks are Class C networks, only because the Class C network space was originally divided up into individual allocations whose size was /24.

Class A, B and C networks have also not existed for decades. CIDR killed classful networking when it was introduced by RFC 1519 in September 1993.

Classful networking

Posted Feb 16, 2025 1:47 UTC (Sun) by corbet (editor, #1) [Link] (6 responses)

The discerning reader might have noticed that I wrote "treating the Internet like...". The fact that you understood what I meant suggests that the analogy worked. So perhaps this lecture was not strictly necessary...?

Classful networking

Posted Feb 16, 2025 3:35 UTC (Sun) by aaronmdjones (subscriber, #119973) [Link]

I quoted it as-written, and it wasn't meant as a lecture :)

Classful networking

Posted Feb 16, 2025 3:40 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

> So perhaps this lecture was not strictly necessary...?

This is a site full of pedantic nerds. So yes, it was necessary.

Classful networking

Posted Feb 16, 2025 18:20 UTC (Sun) by draco (subscriber, #1792) [Link] (3 responses)

I'm sympathetic to both sides here. You're right that it was perfectly clear.

On the other hand, the idea of class A/B/C networks is still taught in networking courses, even though it has been irrelevant and unhelpful for decades now. The sooner we stop referencing it, the better.

Classful networking

Posted Feb 17, 2025 12:05 UTC (Mon) by khim (subscriber, #9252) [Link] (2 responses)

What would that change? They are still teaching network using OSI model. Stillborn model for a stillborn set of protocols (are any of them still in use? I think Microsoft used X.500 which turned into LDAP later).

Compared to that attempt to “learn how airplane works using blueprints from a steam train” the fact that they also teach A/B/C networks is minor.

You have to forget most of what you were taught and relearn how things actually work when you get networking-related job, anyway.

Classful networking

Posted Feb 19, 2025 19:58 UTC (Wed) by antiphase (subscriber, #111993) [Link] (1 responses)

The problem is people continue to use and further teach what they've learned unless they are corrected. It would be great if people forgot more.

I've worked with and continue to work with otherwise competent network admins who continue to make references to classful networking despite not being old enough to remember it nor having an understanding what it really means besides some inadequate proxy for subnet size, which CIDR replaces far more usefully.

It's long past time for it to be put to bed.

Classful networking

Posted Feb 19, 2025 20:13 UTC (Wed) by khim (subscriber, #9252) [Link]

> I've worked with and continue to work with otherwise competent network admins who continue to make references to classful networking

Are you really sure they refer to “classful networking”? In my experience for them “class C” is “network with /24 mask, that can be used with 192.168.x.0 or with 172.16.x.0” and “class B” is “network with /16 mask that needs 172.16.0.0 or 172.20.0.0, or maybe something in 10.x.0.0” while “class A” is “gagantic network, most likely 10.0.0.0” (the only one that may exist in such twisted world).

It's not related to “classful networking”, as it existed years ago… but not worse than craziness that we get when trying to invent 7 layers in a network that never had them.

> It would be great if people forgot more.

But they did! Problem is with you, not with them! They don't have any trouble understanding each other! Instead you are causing trouble when you bring things from an era long gone!

> It's long past time for it to be put to bed.

But why? Why remove useful term than helps in communication just because it's original meaning is no longer relevant?

One may as well [try to] attempt to ban use of “byte” as “8bit quanitity” – equally pointless and useless in a world where other kinds of “bytes” don't exist. Result would be the same: people would laugh at you and continue to use what is convenient for them.

Naïve question

Posted Feb 16, 2025 11:34 UTC (Sun) by Sesse (subscriber, #53779) [Link] (5 responses)

Hi LWN,

Thank you for being the tome of both old and new Linux knowledge. I've been on the receiving end of these AI bots myself, and one of the most relevant solutions was just… to make it fast enough that it doesn't matter? If you've got ~750k pages, is there a reason why you can't just generate them all and cache them in Varnish or similar? (I mean, this is about CPU time and not bandwidth, right?)

For my part, I moved the only problematic vhost (a Gitweb instance with just way too many potential pages to precompute) to IPv6-only, and traffic dropped dramatically overnight :-) But I do believe this would be too painful for LWN, of course.

(As for JavaScript, I'm sure a lot of people are very vocal about this. As an ever so slight balancing, here's a “I don't really care” voice.)

Naïve question

Posted Feb 17, 2025 13:55 UTC (Mon) by leigh (subscriber, #175596) [Link] (1 responses)

Personally, I really don't like the approach of 'just let it happen.' Feels gross, man.

Naïve question

Posted Feb 17, 2025 14:14 UTC (Mon) by Wol (subscriber, #4433) [Link]

Okay, it's going back a good few years, but "just let it happen" is a bad idea.

Firstly LWN is paying for that bandwidth. And secondly, what happens if the scraping overwhelms the bandwidth? Ancient case in point - a company on an ISDN link was filtering its mail based on message header. Even downloading JUST the header, spam was arriving at their ISP faster than they could bounce it.

Cheers,
Wol

Naïve question

Posted Feb 17, 2025 14:09 UTC (Mon) by daroc (editor, #160859) [Link] (2 responses)

We actually do have caching in the site code — and I've been doing a little bit of work to move the caching to Apache, which should remove some overhead.

But making sure to do that correctly takes time, especially since subtle problems with the correctness of the cache are only likely to make themselves known when a user notices a discrepancy.

Naïve question

Posted Feb 17, 2025 14:11 UTC (Mon) by Sesse (subscriber, #53779) [Link] (1 responses)

Caching is famously a hard problem, indeed. (I don't know exactly what your site mix looks like, but at least the mailing list archives must be rather static?)

Naïve question

Posted Feb 17, 2025 14:31 UTC (Mon) by daroc (editor, #160859) [Link]

Yes — both old articles and old mailing list threads are _mostly_ static. They change pretty rarely, but they do occasionally change, when someone makes a comment or replies to an old thread. We do get the occasional comment on articles from decades ago that spark a new discussion. Generally, stuff has a very rough exponential decay in changes over time.

So it's definitely not an insurmountable problem, just something to do with care.

Fighting the AI scraperbot scourge

Posted Feb 16, 2025 11:47 UTC (Sun) by arnout (subscriber, #94240) [Link]

In Buildroot, we have a site sources.buildroot.org that mirrors all of the sources of the packages we support. It is normally only used as a backup, i.e. if upstream is not available. At some point we got a message that the site had consumed more than half of the 40TB monthly bandwidth, which raised some alarm bells...

We could easily see from the User-Agent strings that it was a lot of AI-bots, because many of them _are_ honest in their User-Agent (though they still do seem to ignore robots.txt). We first blacklisted a lot of those User-Agent strings (returning a 403), but it didn't do much more than make a dent in the traffic.

Fortunately there is a very simple solution for us. The full download URLs are normally constructed from the Buildroot metadata, they don't appear as a URL anywhere on the web. So we simply turned off directory listing, and the scrapers no longer have anything to scrape! It's still a step back because people do sometimes use directory listing (e.g. to check which versions are available and which ones are missing), but it's something we can live with.

Have you seen maptcha?

Posted Feb 16, 2025 14:20 UTC (Sun) by hunger (subscriber, #36242) [Link] (2 responses)

Just an idea: https://maptcha.crown-shy.com/ is a captcha that improves open streetmap. I would not mind doing that a couple of times when interacting with LWN :-) Unfortunately it does not seem to be production ready yet...

They did a presentation at FOSDEM this year: https://fosdem.org/2025/schedule/event/fosdem-2025-5879-m...

I found the idea really interesting and it would IMHO be a good cultural fit for LWN.

Have you seen maptcha?

Posted Feb 16, 2025 19:30 UTC (Sun) by Henning (subscriber, #37195) [Link] (1 responses)

Unfortunately it requires javascript.. I get a blank page visiting the linked site with javascript disabled.
LWN works wonderfully without javascript and the only time I enable it is when I renew subscription. I have been bitten too many times by payment solutions breaking if you run a non-standard setup of your browser.

Have you seen maptcha?

Posted Feb 17, 2025 14:15 UTC (Mon) by daroc (editor, #160859) [Link]

Yes, our credit card payment system uses JavaScript. Which is obviously inconvenient, but it has the huge advantage that we don't need to process or store credit card information ourselves.

For those who wish to avoid JavaScript at any cost — we do also take payment for subscriptions by check, although only on US banks.

what is wrong with javascript?

Posted Feb 16, 2025 15:39 UTC (Sun) by PlaguedByPenguins (subscriber, #3577) [Link] (11 responses)

sorry for being naive, but is this just a philosophy thing, or is there a technical reason to avoid javascript?

it seems crazy for LWN to spend 100's of hours of valuable time and resources and to eventually still capitulate to the "ignore robots.txt" crowd if just using javascript could solve the problem. or am I missing something?

FWIW I have no problems with it...

if people really care, then I'd suggest an LWN accounts setting that permits javascript to be disabled.

what is wrong with javascript?

Posted Feb 16, 2025 17:26 UTC (Sun) by Paf (subscriber, #91811) [Link] (1 responses)

It's philosophy, about running unfree code. I don't find it compelling but some people really do.

what is wrong with javascript?

Posted Feb 17, 2025 14:17 UTC (Mon) by daroc (editor, #160859) [Link]

And even if we did write our own JavaScript under a permissive license (which is certainly an option), it would still be a hassle for some readers. LWN does actually have a little bit of JavaScript in the normal interface — on the search page, it's used for progressive enhancement to let you check or uncheck all the checkboxes at once. If you don't have JavaScript enabled, you won't see the option at all.

But you're right that if things get bad enough, we may be forced to come up with some kind of CAPTCHA system. Let's hope things don't get to that point.

what is wrong with javascript?

Posted Feb 18, 2025 0:14 UTC (Tue) by nickodell (subscriber, #125165) [Link] (3 responses)

it seems crazy for LWN to spend 100's of hours of valuable time and resources and to eventually still capitulate to the "ignore robots.txt" crowd if just using javascript could solve the problem.
Does Javascript solve this problem? It's reasonably straightforward to code a Python program which controls a headless browser and loads a web page. There are of course ways to fingerprint this setup, but none of them are foolproof.

what is wrong with javascript?

Posted Feb 18, 2025 9:38 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (2 responses)

Headless browsers consume more resources than (e.g.) curl.

You might argue that the attackers do not care, because they are (presumably) using a botnet anyway. But even if they are stealing the resources, they still have to allocate them, and they will likely prefer to target sites that do not require a headless browser over sites that do.

what is wrong with javascript?

Posted Feb 19, 2025 14:12 UTC (Wed) by Sesse (subscriber, #53779) [Link] (1 responses)

Given how impossible it is to read and navigate the average page today without JavaScript, it seems untenable to have a crawler now that isn't a browser in some form.

what is wrong with javascript?

Posted Feb 19, 2025 14:55 UTC (Wed) by excors (subscriber, #95769) [Link]

A long time ago I did some experiments with Googlebot and found that it didn't execute scripts - but it did parse scripts to some extent (not with a real JS parser, it was something much simpler), looked for string literals that might be URLs (relative or absolute), and speculatively crawled them. And it wasn't just indiscriminately looking for URL-like strings anywhere in the file; it was specifically looking in <a href>, <embed src>, and JS string literals, and nowhere else that I found.

That was probably over 15 years ago, and nowadays Googlebot uses a headless Chromium to crawl some pages (https://developers.google.com/search/docs/crawling-indexi...). I don't know if there are some cheaper crawlers that still just look for JS strings, to get some of the benefit for much less cost.

what is wrong with javascript?

Posted Feb 22, 2025 17:31 UTC (Sat) by anton (subscriber, #25547) [Link]

I have disabled JavaScript in the browser I usually use in order to reduce the attack surface.

A side benefit is that many web pages that are a waste of time display very little without JavaScript. E.g., Google search has become more and more enshittified in recent years, and a month or two ago they crowned this work by requiring JavaScript. So I looked at duckduckgo again, and I find that it now produces better search results than google did (in addition to working without JavaScript).

what is wrong with javascript?

Posted Feb 27, 2025 6:55 UTC (Thu) by brunowolff (guest, #71160) [Link] (3 responses)

I'd rather see the opposite. Except for the renewal page, break pages for browsers/bots that use javascript. This will encourage people to turn off javascript. Unfortunately the bots would adapt, but if the method isn't used by other sites it could work for a long time. There might be tricky things that you could do in the javascript to try to distinguish between humans and bots, so that people who leave javascript on by default don't have problems.

what is wrong with javascript?

Posted Feb 27, 2025 12:12 UTC (Thu) by farnz (subscriber, #17727) [Link] (2 responses)

The bots have never used JavaScript; the reason JavaScript is useful as a distinguishing factor between bots and humans is that some humans have working JavaScript, bots don't. If you break pages only for people who use JavaScript, you break lots of humans, and no bots.

The converse of this is that if you break pages only for people who disable JavaScript, you break lots of bots, and very few humans.

what is wrong with javascript?

Posted Feb 27, 2025 14:02 UTC (Thu) by brunowolff (guest, #71160) [Link] (1 responses)

The context here is bots trying to extract information from a site, not attack it. To do that on many sites they have to support and to some extent run javascript, unless the sites act different for them. The site owners might do this to for example make it easier for search engines to index them.
If the bots are running javascript code from your site, you have a way to attack them.

what is wrong with javascript?

Posted Feb 27, 2025 14:07 UTC (Thu) by farnz (subscriber, #17727) [Link]

Bots don't run JavaScript; to the extent they support it, they do so by having pre-written (by their owners) pattern matching code that says "this looks like Google JavaScript on a Google owned domain - extract this bit and crawl the resulting URL".

It's humans who run JavaScript - so you can attack humans, but not bots, via JavaScript.

Just some brainfarts

Posted Feb 16, 2025 20:01 UTC (Sun) by pomac (subscriber, #94901) [Link]

I have been thinking of things like this before... Given a nice amount of memory you could render and compress web pages and then serve them directly using f.ex nginxs built in memcached support.

If you set a tick for how often pages will be rerendered it could be used as the life time in memcached and thus limit CPU usage ...

Also in general, all the publicly available articles could be rendered in markdown and put in to git, serve it on kernel.org and mirror to GitHub, let githubs CDN handle them :)

Hands-off hosters?

Posted Feb 17, 2025 8:48 UTC (Mon) by taladar (subscriber, #68407) [Link] (1 responses)

> Interestingly, scrapers almost never use IPv6

Presumably that is because the kind of hands-off hosters used by scrapers like that invest the minimum amount of effort into their hosting offerings. IPv6 is still seen as something optional by many for some unfathomable reason so they won't invest time in learning how to offer it.

Hands-off hosters?

Posted Feb 17, 2025 13:00 UTC (Mon) by ccr (guest, #142664) [Link]

I wouldn't be so sure about this. I've been experiencing similar scraping issues on my own site (which hosts, among other things, source code repositories) during last year or so. This has been enough to sometimes cause congestion due to hundreds of simultaneous connection attempts.

In my case, I've also noted that the scraping bots do not seem to use IPv6, but also that vast majority of the scraping originated from cloud services of certain nationality. All three of those cloud providers at least document IPv6 capabilities, so I'm not sure if the "blame" can be placed on hosting alone .. of course the scrapers of LWN may be different from who bombard my site, so .. shrug. :)

Eventually, after adjusting throttling settings few times, I decided to simply drop all connections originating from those networks completely in netfilter. By networks, I mean complete ranges assigned to those cloud providers AS. Been quieter after that.

Much sympathy. Keep up the great work.

Posted Feb 17, 2025 20:09 UTC (Mon) by babywipes (subscriber, #104169) [Link]

Lots of folks have ideas in the comments above; I have none since this is out of my ken of technical ability.

I just wanted to post a sympathetic note and to thank you all for being one of the great websites on the internet that represent the, sadly, ever-shrinking population of folks who embody the original spirit of the internet. LWN is the best of the 'net and reminds me of that sense of discovery I had when I first got online in the 90s.

Have two sites: free access with DDoS protection/JS, authenticated subscriber access without

Posted Feb 18, 2025 9:28 UTC (Tue) by lambert.boskamp (subscriber, #131534) [Link]

Could you solve the problem by offering two different versions of the site?

The first, free version of LWN would provide anonymous access to some extent (as is the case today), but is protected by Cloudflare against DDoS and hence requires JavaScript. This is less convenient for drive-by users who e.g. click on historic LWN links found in kernel source, but it works. If any of these users is significantly annoyed by the captchas, they're free to think about becoming a subscriber.

The second version of the site would be for subscribers exclusively and is completely behind an authentication wall without any anonymous content at all. This version can keep JS-free as today. Subscribers can hence continue to use Lynx and be happy.

Have you tried ip banning the botnet?

Posted Feb 18, 2025 17:55 UTC (Tue) by IanKelling (subscriber, #89418) [Link]

We've recently had ddos scrapers on fsf run domains. We've spent lots of time identifying patterns in the logs, then temporarily banning all the ips matching that pattern and creating fail2ban rules, and using ipset when there are too many ips.

We wrote some about it here
https://www.fsf.org/bulletin/2024/fall/fsf-sysops-cleanin...

Simple suggestion

Posted Feb 18, 2025 22:00 UTC (Tue) by sageofredondo (subscriber, #157944) [Link] (1 responses)

Hate to suggest this, but as a stopgap, what about changing the site to require an account to view full older articles and to view shared stories?

Simple suggestion

Posted Feb 19, 2025 17:00 UTC (Wed) by smoogen (subscriber, #97) [Link]

The 'botnet' farms have people who set up accounts on websites day in and day out. They have been doing this for at least a decade. When a scraper or further malware infector rents a botnet, they get both a bunch of machines connected together via a peer2peer vpn and a set of logins and passwords for various sites they may be using. The tools seem to have all this written in so a scraper just has to set up the equivalent of `wget --mirror` on one side and the p2p vpn acts as a caching onion sending out requests from multiple ip addresses and logins. The same request may come from multiple clients so that it acts like a CDN to the botnet customer. AKA N hundred cloud systems, N thousand infected laptops all have the same data in case some get cleaned out.

To the affected sites it will look like hundreds of different users over a week got interested in parts of the site..

I think a lot of client code was written for advertising scams. You say your site has X amount of traffic with lots of users in this demographic mix. You then pay the botnet for those 'users' and serve tons of ads to bots and take a check from the advertisers. Because you want to show that 'real' content is being slurped the software was written to 'interact' in ways which will fool various captchas etc which advertisers may require to serve expensive ads. It just turns out to be useful if you want to slurp up the internet for 'free' also.

Micro-payments may be the only answer

Posted Feb 23, 2025 5:33 UTC (Sun) by jrw (subscriber, #69959) [Link] (8 responses)

As long as it is "free" to download as much data as possible, this will continue to be a problem. It seems to me that some form of micro-payments is the only solution. A real human user will not be bothered by tiny payments based on their actual usage, but an enterprise downloading a substantial portion of the entire Internet will have to think twice about their "spend". This is a problem created by greed and capitalism, so maybe we can mitigate it by appealing to the only thing that those who would misuse the commons can understand: the almighty dollar.

Micro-payments may be the only answer

Posted Feb 23, 2025 13:11 UTC (Sun) by farnz (subscriber, #17727) [Link] (7 responses)

There's already a tiny cost to downloading more data - your Internet link is charged based on capacity, and the more you want to download, the bigger the link has to be. It's just a really tiny cost right now, because Internet links are cheap.

The problem micropayments have repeatedly hit against is the cost of addressing fraud. If I can sneak an ad onto millions of pages at a cost of $0.001 per page, and have the ad collect a micropayment of $0.002 per viewer, I can make thousands of dollars of profit at the expense of a large number of defrauded consumers. Criminals are already quite happy to do similar tricks to serve malware from ads; they'd be even more happy to do it for a direct cash reward.

So you need a system (funded somehow) that will take me complaining that I don't recall visiting LWN.net or Ars Technica, and refuse to refund me, while still refunding the transactions that come from fraud. Thus far, no-one's successfully built such a system for the offline world, let alone the online world; transactions have a minimum amount not because there's no use for small transactions, but because the cost of fraud is proportional to the number of transactions, and trying to collect $0.002 per transaction to cover the cost of fraud is acceptable when transactions are $1 or more, but unacceptable when the transaction is of $0.001

Micro-payments may be the only answer

Posted Feb 23, 2025 14:02 UTC (Sun) by excors (subscriber, #95769) [Link]

> There's already a tiny cost to downloading more data - your Internet link is charged based on capacity, and the more you want to download, the bigger the link has to be.

...unless you use a botnet, as reported by this article, in which case someone else is paying for the internet link (and potentially for the storage and processing power to deduplicate and filter and compress the data before sending the valuable parts back to you). Then it becomes another instance of fraud, where the person benefiting from the activity is not the person paying for the activity, and increasing the cost of traffic will have no impact.

Micro-payments may be the only answer

Posted Feb 23, 2025 16:31 UTC (Sun) by raven667 (subscriber, #5198) [Link] (5 responses)

tldr; I don't know if I have a thought here, I'm just free associating this morning while the coffee kicks in and I procrastinate on cleaning and laundry :-)

I wonder how many executives at telcos even remember toll calling, but I'm sure they'd like to bring back billing based on netflow data or DNS if they can take their cut, in the same way that cable TV based ISPs have reintroduced bundled billing after the migration to streaming using SAML SSO AFAICT. So you could either use a sample based guesstimate that has more room for fraud, or have a strong identity federated with your ISP as a prerequisite for access, requiring SAML/OAuth2/OIDC to handle the accounting for billing. You might end up with a system bifurcated by wealth where "free" data is predatory nonsense trying to defraud / manipulate and paid data is higher quality, although catered to the biases of the more wealthy.

In some ways it represents a regression from the original ideology of the Internet, where it was based on a flat fee to cover the infrastructure costs, and you could use it as much or as little as you like for whatever purpose, but that was assuming a more peer-like relationship between collections of endpoints rather than the more strictly client/server networks we have today, where the cost to be a publisher was fairly low and mostly covered by excess capacity in already-paid-for infrastructure (networks and personal computers/workstations), and there was mechanisms of accountability for abuse, if you got out of pocket they could call your Dean and hold your degree hostage, threatening your potential future lifestyle, until you learned to behave. In a better world instead of pretending that the Internet is some cyberspace separate from governments, a land of FREEDOM (*eagles screaming*), we'd have sensible accountability measures as part of peering so networks are only connected when there is an agreed upon ToS. That might split the network into several zones of influence with small filtered pipes in between but it could more closely align with human behavior and human centered governance systems, eg. if you don't want this kind of scraping and there wasn't already a rule against it then you'd have a clear process to talk to your representative to negotiate a change to the ToS (laws) that could be enforced against those violating community norms.

Many people are laughing now because the mechanisms for feedback and adjustment to the rules by which society operates are so broken right now that what I've described seems like a utopian pipe dream, because the trend has been toward less accountability the more power and wealth you have, when we need stronger accountability for the effects of decisions that ultimately a person(s) somewhere are making. There is a person somewhere setting up a scraping botnet, there is a person somewhere using that tainted data, those people have names and faces, they live in a place with rules of some kind, they should be identified or cut off from the network.

Is that idea the end of anonymity, yeah kind of, it should only be attempted when there are working feedback mechanisms between the people making and enforcing the rules and the people subject to the rules so that there is mutual consent and ways to adjust to keep everyone in harmony and alignment, democracy is one of those mechanisms but there are certain to be many stable and effective ways for human societies to maintain healthy feedback loops.

There is just a limit as to what an individual website can do when the problem is the fundamental rules of the system that don't provide negative feedback on abusive behavior, like a broken TCP or bufferbloat which allows one jumbo flow to starve the rest of the link.

Ok, time to get a second cup of coffee and get off the Internet ;-)

Telco and cableco protections against fraud

Posted Feb 23, 2025 16:52 UTC (Sun) by farnz (subscriber, #17727) [Link] (3 responses)

Note that both telcos and cablecos protect against fraud in the same two ways: first, anyone they pay money to gets it on a long delay, and it's at the company's discretion whether they continue to let you take money from their customers via them. If you're a fraudulent content provider (my "criminal running ads that take money" example), they cut you off permanently, and they have complicated rules and procedures you have to follow to even get to a point where you can take money from them.

The second part of it is that they've got strong defences against consumer-side fraud; they can, to the extent that local law permits, refuse to serve you until you've repaid the cost of your fraud, including refusing to supply service to any address you live at.

The reason this basically doesn't work for the Internet is that I can fairly trivially obtain a service (e.g. like Tor, or a commercial "VPN" provider) that takes in encrypted data in one location, and releases it onto the Internet as-if I was based in a totally different location. Unless you enforce rules requiring that your telco can see inside all encrypted packets, there's no way for my telco to distinguish "good traffic" from "bad traffic" - it all looks the same on the wire, and that makes it impossible to identify the root sources of "bad traffic" and cut them off.

Telco and cableco protections against fraud

Posted Feb 25, 2025 4:14 UTC (Tue) by raven667 (subscriber, #5198) [Link] (2 responses)

Sure, VPN providers exist, they are effectively a second ISP because they can see your traffic just fine, selling data about users is surely one way they make money, Tor being an exception but Tor traffic is identifiable its different enough from web and the entry and exit IPs are generally discoverable even if the contents are opaque, they can know that you are using the service, how much and when. If you manage to piss off the right(wrong) people they will try to deanonymize your traffic or pop your client on Tor to figure out how to hold you accountable.

I

Telco and cableco protections against fraud

Posted Feb 25, 2025 10:15 UTC (Tue) by paulj (subscriber, #341) [Link]

The one known case of "deanonymisation of Tor traffic" (darkweb market site operators IIRC), the FBI (or... a partner, ahem) did it using a previously unknown, 0-day exploit of the Firefox browser in the Tor bundle. To my knowledge, there are no documented cases of Tor itself having its traffic deanonymised.

Telco and cableco protections against fraud

Posted Feb 25, 2025 10:33 UTC (Tue) by farnz (subscriber, #17727) [Link]

Indeed - but now you need a Great Firewall of China level of enforcement of anti-VPN rules to stop me "invisibly" leaving one jurisdiction and becoming apparently "present" in another.

This has been the Internet's biggest strength and weakness from day one; it's very hard to stop a motivated individual getting access to another country's Internet-accessible resources as-if they were in the other country. As a consequence, you have to assume that you're facing the criminals of the entire world, and you have no recourse via the law since there's no guarantee that the justice system in (say) an area undergoing a civil war has any interest in dealing with your complaint that someone in their area is doing something that affects you in a rich and peaceful area of the world - especially since the person you're complaining about may "simply" be a combatant in the war offering a relay service in exchange for military gear to support their war effort.

Micro-payments may be the only answer

Posted Feb 23, 2025 17:08 UTC (Sun) by excors (subscriber, #95769) [Link]

> I wonder how many executives at telcos even remember toll calling, but I'm sure they'd like to bring back billing based on netflow data or DNS if they can take their cut

That sounds rather like the net neutrality debate from the past couple of decades, where ISPs want to be able to throttle bandwidth and/or charge customers more depending on which web sites they access (and also charge content providers to get themselves onto the free-to-customers list), while customers and content providers don't want that because they'll be paying more for a worse service. The executives don't merely remember it, they've been continually fighting for it, and in the US they're now winning.

Understandable

Posted Feb 25, 2025 22:54 UTC (Tue) by ju3Ceemi (subscriber, #102464) [Link]

This AI behavior is understandable : LWN is awesome.
If I'd a vastly greater amount of brain-time, I'd read the whole content over and over again !

How about defence in depth?

Posted Feb 27, 2025 19:53 UTC (Thu) by davecb (subscriber, #1574) [Link] (1 responses)

IFF no real customer comes from a cloud service, exclude every request from them.
Similarly, IFF no browser used by humans fails to provide a User-Agent header, exclude them, too
Then, IFF the User-Agent doesn't match one of a list of real browsers, exclude them
(the latter requires a semi-automated mechanism, see below).

This is the classic "swiss cheese" model for avoiding accidents. If the accident is to happen, all the holes in all the slices must line up. Each individual slice was simple. Unless they were derived from one another, the holes also tended to be in different places.

A customer of mine used to use that technique, and reduced bad things substantially. One of the slices was a check on user-agent-equivalents. Every time a new user-agent-equivalent showed up, it was accepted until more than N were seen in a day, at which time the program emailed a human to get a decision. A more cautious customer required manual inspection of all user-agent-equivalents. I thought that was cooler, but it used a lot of humans (:-))

How about defence in depth?

Posted Mar 25, 2025 4:41 UTC (Tue) by dagobayard (subscriber, #174025) [Link]

At one of my past jobs I was the one unfortunate dealing with this. The context was advertising click fraud, as alluded to up thread.

I wish I had thought of the N approach, but alas, I'm a perfectionist, and I think in F_2 terms (true, false, black, white). I mechanized it to some extent but still I made myself personally eyeball every new UA string. The stress was too much and the job didn't last.

--
Ian

logged in

Posted Mar 5, 2025 3:40 UTC (Wed) by livi (guest, #173786) [Link]

Isn’t the solution to provide different experiences for logged in users? Log in and get the experience that exists now.

Users who aren’t logged in get hit with JS required challenges and whatever else.

other possible options

Posted Mar 16, 2025 14:09 UTC (Sun) by koollman (subscriber, #54689) [Link]

you can try things that will not hurt humans too much but might slow down scrapers.

Adding links/paths that humans are unlikely to follow helps with detection.
Adding some delays on rarely accessed pages (like out of cache systems) helps with limiting the impact on resources
Even a .2s delay is not too bad for an unknown first time user, but it is painful for a system doing millions of requests

Once you get enough details on the patterns used you can general have a set of 'known good', 'known bad' and 'unknown' clients


Copyright © 2025, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds