Fighting the AI scraperbot scourge
Training the models for the generative AI systems that, we are authoritatively informed, are going to transform our lives for the better requires vast amounts of data. The most prominent companies working in this area have made it clear that they feel an unalienable entitlement to whatever data they can get their virtual hands on. But that is just the companies that are being at least slightly public about what they are doing. With no specific examples to point to, I nonetheless feel quite certain that, for every company working in the spotlight, there are many others with model-building programs that they are telling nobody about. Strangely enough, these operations do not seem to talk to each other or share the data they pillage from sites across the net.
The LWN content-management system contains over 750,000 items (articles, comments, security alerts, etc) dating back to the adoption of the "new" site code in 2002. We still have, in our archives, everything we did in the over four years we operated prior to the change as well. In addition, the mailing-list archives contain many hundreds of thousands of emails. All told, if you are overcome by an irresistible urge to download everything on the site, you are going to have to generate a vast amount of traffic to obtain it all. If you somehow feel the need to do this download repeatedly, just in case something changed since yesterday, your traffic will be multiplied accordingly. Factor in some unknown number of others doing the same thing, and it can add up to an overwhelming amount of traffic.
LWN is not served by some massive set of machines just waiting to keep the scraperbots happy. The site is, we think, reasonably efficiently written, and is generally responsive. But when traffic spikes get large enough, the effects will be felt by our readers; that is when we start to get rather grumpier than usual. And it is not just us; this problem has been felt by maintainers of resources all across our community and beyond.
In discussions with others and through our own efforts, we have looked at a number of ways of dealing with this problem. Some of them are more effective than others.
For example, the first suggestion from many is to put the offending scrapers into robots.txt, telling them politely to go away. This approach offers little help, though. While the scraperbots will hungrily pull down any content on the site they can find, most of them religiously avoid ever looking at robots.txt. The people who run these systems are absolutely uninterested in our opinion about how they should be accessing our site. To make this point even more clear, most of these robots go out of their way to avoid identifying themselves as such; they try as hard as possible to look like just another reader with a web browser.
Throttling is another frequently suggested solution. The LWN site has implemented basic IP-based throttling for years; even in the pre-AI days, it would often happen that somebody tried to act on a desire to download the entire site, preferably in less than five minutes. There are also systems like commix that will attempt to exploit every command-injection vulnerability its developers can think of, at a rate of thousands per second. Throttling is necessary to deal with such actors but, for reasons that we will get into momentarily, throttling is relatively ineffective against the current crop of bots.
Others suggest tarpits, such as Nepenthes, that will lead AI bots into a twisty little maze of garbage pages, all alike. Solutions like this bring an additional risk of entrapping legitimate search-engine scrapers that (normally) follow the rules. While LWN has not tried such a solution, we believe that this, too, would be ineffective. Among other things, these bots do not seem to care whether they are getting garbage or not, and serving garbage to bots still consumes server resources. If we are going to burn kilowatts and warm the planet, we would like the effort to be serving a better goal than that.
But there is a deeper reason why both throttling and tarpits do not help: the scraperbots have been written with these defenses in mind. They spread their HTTP activity across a set of IP addresses so that none reach the throttling threshold. In some cases, those addresses are all clearly coming from the same subnet; a certain amount of peace has been obtained by treating the entire Internet as a set of class-C subnetworks and applying a throttling threshold to each. Some operators can be slowed to a reasonable pace in this way. (Interestingly, scrapers almost never use IPv6).
But, increasingly, the scraperbot traffic does not fit that pattern. Instead, traffic will come from literally millions of IP addresses, where no specific address is responsible for more than two or three hits over the course of a week. Watching the traffic on the site, one can easily see scraping efforts that are fetching a sorted list of URLs in an obvious sequence, but the same IP address will not appear twice in that sequence. The specific addresses involved come from all over the globe, with no evident pattern.
In other words, this scraping is being done by botnets, quite likely bought in underground markets and consisting of compromised machines. There really is not any other explanation that fits the observed patterns. Once upon a time, compromised systems were put to work mining cryptocurrency; now, it seems, there is more money to be had in repeatedly scraping the same web pages. When one of these botnets goes nuts, the result is indistinguishable from a distributed denial-of-service (DDOS) attack — it is a distributed denial-of-service attack. Should anybody be in doubt about the moral integrity of the people running these systems, a look at the techniques they use should make the situation abundantly clear.
That leads to the last suggestion that often is heard: use a commercial content-delivery network (CDN). These networks are working to add scraperbot protections to the DDOS protections they already have. It may come to that, but it is not a solution we favor. Exposing our traffic (and readers) to another middleman seems undesirable. Many of the techniques that they use to fend off scraperbots — such as requiring the user and/or browser to answer a JavaScript-based challenge — run counter to how we want the site to work.
So, for the time being, we are relying on a combination of throttling and some server-configuration work to clear out a couple of performance bottlenecks. Those efforts have had the effect of stabilizing the load and, for now, eliminating the site delays that we had been experiencing. None of this stops the activity in question, which is frustrating for multiple reasons, but it does prevent it from interfering with the legitimate operation of the site. It seems certain, though, that this situation will only get worse over time. Everybody wants their own special model, and governments show no interest in impeding them in any way. It is a net-wide problem, and it is increasingly unsustainable.
LWN was born in the era when the freedom to put a site onto the Internet
was a joy to experience. That freedom has since been beaten back in many
ways, but still exists for the most part. If, though, we reach a point
where the only way to operate a site of any complexity is to hide it behind
one of a tiny number of large CDN providers (each of which probably has AI
initiatives of its own), the net will be a sad place indeed. The humans
will have been driven off (admittedly, some may see that as a good thing)
and all that will be left is AI systems incestuously scraping pages from
each other.
Index entries for this article | |
---|---|
Security | Web |
Posted Feb 14, 2025 16:42 UTC (Fri)
by chexo4 (subscriber, #169500)
[Link] (36 responses)
Posted Feb 14, 2025 17:07 UTC (Fri)
by corbet (editor, #1)
[Link] (35 responses)
Posted Feb 14, 2025 17:32 UTC (Fri)
by malmedal (subscriber, #56172)
[Link] (18 responses)
Posted Feb 14, 2025 17:37 UTC (Fri)
by corbet (editor, #1)
[Link] (17 responses)
Posted Feb 14, 2025 18:07 UTC (Fri)
by malmedal (subscriber, #56172)
[Link] (16 responses)
making each URL unique per user?
Hmm. Might be problematic because it's often used for tracking.
Maybe if you made it really obvious for any humans:
Posted Feb 14, 2025 19:06 UTC (Fri)
by mb (subscriber, #50428)
[Link] (15 responses)
Posted Feb 14, 2025 21:15 UTC (Fri)
by malmedal (subscriber, #56172)
[Link] (14 responses)
Posted Feb 14, 2025 21:20 UTC (Fri)
by mb (subscriber, #50428)
[Link] (13 responses)
Posted Feb 14, 2025 21:59 UTC (Fri)
by malmedal (subscriber, #56172)
[Link] (11 responses)
Typically because it followed a honeypot link, at that point you give it a web-page consisting of only such links.
The idea is that the bot will spread these links to other members of the botnet so subsequent bots from other IPs will be immediately recognised and get the same treatment. Hopefully, over time should direct most of the botnet over to the sacrificial server and leave the real alone.
Posted Feb 14, 2025 22:05 UTC (Fri)
by mb (subscriber, #50428)
[Link] (10 responses)
But it's already over. You served the request and you spent the resources.
Posted Feb 14, 2025 22:27 UTC (Fri)
by malmedal (subscriber, #56172)
[Link] (9 responses)
It's not. The bot will report the links it found back to the rest of the botnet and then other bots will come for those links.
> multi terabyte timeout-less database
No database is needed.
Posted Feb 14, 2025 22:28 UTC (Fri)
by mb (subscriber, #50428)
[Link] (8 responses)
And consume traffic and CPU.
Posted Feb 14, 2025 23:01 UTC (Fri)
by malmedal (subscriber, #56172)
[Link] (7 responses)
From the sacrificial server, yes. So the real one gets less load.
Posted Feb 14, 2025 23:15 UTC (Fri)
by mb (subscriber, #50428)
[Link] (6 responses)
Which costs real non sacrificial money. Why would it cost less money than the "real" server?
This is a real problem.
This is a real threat to users. I am currently selecting b), because I think I can't win a).
Posted Feb 15, 2025 0:32 UTC (Sat)
by dskoll (subscriber, #1630)
[Link] (5 responses)
The sacrificial server can be less beefy than the real server because it doesn't have to generate real content that might involve DB lookups and such. And it can dribble out responses very slowly (like 10 bytes per second) to keep the bots connected but not tie up a whole lot of bandwidth, using something like this.
Posted Feb 15, 2025 0:49 UTC (Sat)
by mb (subscriber, #50428)
[Link] (4 responses)
Posted Feb 15, 2025 1:58 UTC (Sat)
by dskoll (subscriber, #1630)
[Link]
Yes, sure, but you might be able to tie some of them up in the tar pit for a while. Ultimately, a site cannot defend against a DDOS on its own; it has to rely on its upstream provider(s) to do their part.
My reply was for the OP who asked how the sacrificial server could be run more cheaply than the real server.
Posted Feb 15, 2025 10:28 UTC (Sat)
by malmedal (subscriber, #56172)
[Link] (2 responses)
LWN does not want to do things like captcha, js-challenges or putting everything behind a login, can you think of a better approach while adhering to the stated constraints?
Posted Feb 15, 2025 10:35 UTC (Sat)
by mb (subscriber, #50428)
[Link] (1 responses)
No. That was my original point.
Posted Feb 15, 2025 10:53 UTC (Sat)
by malmedal (subscriber, #56172)
[Link]
Posted Feb 20, 2025 17:22 UTC (Thu)
by hubcapsc (subscriber, #98078)
[Link]
The article mentions: "Watching the traffic on the site, one can easily see scraping efforts
It should be possible, then, to write a program that can see the scraping efforts in the traffic.
Back in the good old days when spam was simpler, it was easy to see that you'd opened a
Posted Feb 14, 2025 17:46 UTC (Fri)
by karkhaz (subscriber, #99844)
[Link] (12 responses)
> avoid forcing readers to enable JavaScript
You already generate a different page for logged-in readers, right? Would it be easy to simply omit the JavaScript countermeasures for subscribers?
Posted Feb 14, 2025 17:50 UTC (Fri)
by corbet (editor, #1)
[Link] (11 responses)
We could definitely disable it for logged-in users (we already do that with some countermeasures). But you still have to get past the login form first.
Posted Feb 16, 2025 6:56 UTC (Sun)
by wtarreau (subscriber, #51152)
[Link] (4 responses)
There are also approaches that consist in differentiating possibly good from possibly bad actors. You'll notice that bots that scrape the site's contents for example do not retrieve /favicon.ico because it's not linked to from the pages. Only browsers do. This can be used to tag incoming requests as "most likely suspicious" until /favicon.ico is seen. Same for some of the images like the logo on the top left, etc.
When downloads are spread over many clients, it's common for these clients to be dumb and to only download contents. These contents are then processed, links are extracted and sent to a central place where they're deduplicated, and distributed again to a bunch of clients. So there's usually no continuity in the URLs visited by a given client. Some sites such as blogs often contain a relation in their URLs that allows to figure if requests for a page's object mostly come from the same address (and session) as the main page. If objects from a same page come from 10 different addresses in a short time, there's definitely an abuse and you can flag both the session and these IPs as malicious and slow them down.
Overall, most importantly you must not block, only slow down. That allows you to deal with false positives (because there are still a lot in these situations) without hurting the site's accessibility significantly. Usually bots that retrieve contents have short timeouts because they're used to hurt web sites, and their goal is not to stay hooked waiting for a site, but to download, so with short timeouts they can reuse their resources to visit another place. It means that often, causing a few seconds pause from time to time can be sufficient to significantly slow them down and prevent them from hurting your site.
Also it's important to know what costs you more during such scraping sessions: bandwidth ? (then it's possible to rate-limit the data), CPU ? (then it's possible to rate-limit the requests), memory ? (then it's possible to limit the concurrent requests per block) etc. Let's just not act on things that could possibly degrade the site's experience if not needed. E.g. if the bandwidth is virtually unlimited, better act on other aspects, etc.
However, one important advice is to *not* publicly describe the methods you've chosen. StackExchange figured it by themselves after publishing the article above, they later had to tweak their setup because scrapers adapted. You'd be amazed to see that many scrapers are started by developers that visit the site before and during scraping to check what they're missing. If you describe your protections too much, they have all the info they need to bypass the limit. That's another benefit of the slow down by the way, from the scraper's perspective there's no difference between slowing down by policy and slowing down due to intense activity, and the lack of feedback to the attacker is crucial here.
Posted Feb 17, 2025 13:36 UTC (Mon)
by paavaanan (subscriber, #153846)
[Link] (1 responses)
Posted Feb 20, 2025 10:32 UTC (Thu)
by jtepe (subscriber, #145026)
[Link]
Posted Mar 25, 2025 0:13 UTC (Tue)
by dagobayard (subscriber, #174025)
[Link] (1 responses)
--
Posted Aug 30, 2025 14:46 UTC (Sat)
by wtarreau (subscriber, #51152)
[Link]
Posted Feb 19, 2025 7:21 UTC (Wed)
by marcH (subscriber, #57642)
[Link] (5 responses)
Not 100% sure what you meant here but I guess even the dumbest, scrapers will not download the (javascript-free) login form a million times per minute when they fail to find anything else on the unauthenticated, javascript-enabled site.
Posted Feb 19, 2025 10:39 UTC (Wed)
by farnz (subscriber, #17727)
[Link] (1 responses)
If they stick to HTTP GET and downloading content that's "freely available upon request", their only issue is copyright law, and it looks like training a model on content may not be (in and of itself) a breach of copyright. This is a nice safe space to be; it's really hard to come up with language that will make use of your scraper criminal without also making it illegal to (for example) use an outdated browser to view content, since GET is supposed to be a mere read of the site, logically speaking, so you get implicit authorization for the purposes of things like the US's CFAA, or the UK's Computer Misuse Act in as far as the site doesn't block you.
If they issue a POST request, which is explicitly intended to alter the site's behaviour in some fashion, they get into a riskier place; a login form is gating access behind an agreement you made with the site operator, for example, and you've now written code that is designed to ignore that agreement and get access anyway. That, in turn, runs the risk of bringing the CFAA and similar laws in other countries into play, elevating use of your scraper from at most a civil matter to a criminal matter. And while you might feel well-funded enough to survive a civil challenge, a criminal case is a very different beast.
Posted Feb 19, 2025 14:02 UTC (Wed)
by daroc (editor, #160859)
[Link]
I'll keep that in my ideas folder in case the bot traffic steps up another notch.
Posted Feb 19, 2025 14:44 UTC (Wed)
by corbet (editor, #1)
[Link] (2 responses)
Posted Feb 19, 2025 15:42 UTC (Wed)
by marcH (subscriber, #57642)
[Link] (1 responses)
Right, so they will not understand the javascript-free login form much, not feel "frustrated" by it and will not scrape it more frequently than any other page. They will not scrape it more frequently in order to reach some non-existent, per-site quota either.
So, if all other pages are made inaccessible unless you're either logged in OR answer some javascript challenge, then the javascript-free login form is the only page they scrape, the entire site is just one page for them and that's a huge win from a load perspective. No?
And the site is still accessible without Javascript! You just need to be logged in.
What did I miss? (Besides an even better solution)
Posted Feb 19, 2025 15:56 UTC (Wed)
by corbet (editor, #1)
[Link]
It may come to that, but I do not want to put impediments in the way of people reading our stuff. The paywall is already a big one, but many people come into LWN via our archives, and putting barriers there would surely turn some of them away. We will put a lot of energy into exploring alternatives before we do that.
Posted Feb 15, 2025 4:37 UTC (Sat)
by DimeCadmium (subscriber, #157243)
[Link]
(And yes, that attention is appreciated)
Posted Feb 16, 2025 4:23 UTC (Sun)
by mirabilos (subscriber, #84359)
[Link]
Thank you.
Posted Feb 28, 2025 14:17 UTC (Fri)
by jqpg (guest, #176259)
[Link]
1- A fake index.html asking for simple math challenge not JS challange then goto real index.html
2- Put *just only* old/some type contex behind such challenge.
3- Reject all clients other than that dont have some custom "http header" or some special "user agent" value.
4- Every link on every page must be suffixed(concetanated) with some "special keyword" only visible to human *eyes* to access.
5- So on, so on ...
forgive me that being *too* clever and my english:)
Posted Feb 14, 2025 17:22 UTC (Fri)
by rgmoore (✭ supporter ✭, #75)
[Link] (3 responses)
Posted Mar 27, 2025 13:29 UTC (Thu)
by Velmont (guest, #46433)
[Link] (2 responses)
It made tons of money by selling proxying through residential IPs. Users used Hola VPN for free, by selling their 'ip' or use as a proxy for companies that wanted to scrape or check sites from 'real ips'. That service was called Luminati back then, but it is now Bright Data.
I believe the scrapers are using that company, or some competitor, though I guess Hola + Bright Data is probably the biggest provider around still. So that's why the requests seem to come from all around the world with no real system to the IPs, it is because they are *actually* coming from real residential homes and computers.
> 299. million free users
So legally obtained even, since it's in the TOS and even marketing material for Hola.
All very sad.
Posted Mar 27, 2025 15:21 UTC (Thu)
by paulj (subscriber, #341)
[Link] (1 responses)
Posted Mar 27, 2025 16:52 UTC (Thu)
by zdzichu (subscriber, #17118)
[Link]
Posted Feb 14, 2025 17:59 UTC (Fri)
by smoogen (subscriber, #97)
[Link] (12 responses)
Posted Feb 14, 2025 18:28 UTC (Fri)
by notriddle (subscriber, #130608)
[Link] (7 responses)
It's impossible to violate the GPL without first violating plain copyright ("You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to [four freedoms]."). If something else *does* grant you that permission, like how search engines and critical reviews are allowed to excerpt primary sources under fair use, then the GPL doesn't matter.
And if it turns out that training an LLM isn't fair use, then the GPL probably isn't permissive enough, because LLMs don't seem to be able to accurately fulfill the attribution requirement.
Posted Feb 14, 2025 19:17 UTC (Fri)
by daroc (editor, #160859)
[Link]
... which, admittedly, it should probably actually say somewhere in the site footer and in the FAQ, not just on some old 'about the site' pages.
During the subscriber period (one week after the first weekly edition in which the article appears), they are plain old all-rights-reserved, though.
Posted Feb 14, 2025 23:01 UTC (Fri)
by khim (subscriber, #9252)
[Link] (3 responses)
Note that one of the reasons why courts declared that search engines have an implied license was the fact that they followed robots.txt rules. Exceptio probat regulam in casibus non exceptis principle gave them implied license. But I wonder what these guys who ignore all these rules and use botnets say about all that. Most likely something like “if we wouldn't do that then we are screwed, but if we do then small fine is the most they would push on us”… I suspect.
Posted Feb 15, 2025 16:21 UTC (Sat)
by kleptog (subscriber, #1183)
[Link] (2 responses)
It makes you wonder: are all these people writing their own scrapers, or are they using some common scraper library that doesn't have robots.txt support. In the latter case you could fix the library to default to honouring the robots.txt and at least solve it for the people who just fire and forget.
Posted Feb 15, 2025 17:15 UTC (Sat)
by Tobu (subscriber, #24111)
[Link] (1 responses)
Posted Feb 15, 2025 18:18 UTC (Sat)
by dskoll (subscriber, #1630)
[Link]
Wow, the maintainer of img2dataset is a piece of work...
I searched my logs for the default user-agent he uses to pretend to be something else, and it only ever hits images... never any real pages. So I've blocked that user-agent. It now gets 403 Forbidden.
Posted Feb 16, 2025 0:25 UTC (Sun)
by edeloget (subscriber, #88392)
[Link] (1 responses)
The GPL itself might not be relevant, but surely we can craft a specific license for that usage, one that 1) plays nice with all other licenses, including non-free ones 2) is highly viral by nature.
The goal would not be to make all published text freeely available, but to either force the distribution of the dataset in which the text is included or, if the dataset contains copyrighted materials that cannot be redistributed, forbid the integration of the work. The license clauses may only trigger when the work is integrated in a dataset and the license text could even be written so that it also contaminate adjacent datasets that are used in conjunction with a dataset already covered by this license.
Of couse, once the dataset is contaminated, you'd also better make sure that the models created using this dataset shall be made free (as in speech). A license that creates new freedoms for users - that do not seem impossible to do, doesn't it? A "Dataset Freedom License".
(Also, make sure that the type of work is not enforced by the license, so that you could protect anything with it, ranging from code to images to songs...)
With such a license, the scrapping bots are still an annoyance, but any indication that a particularly covered work ended in a specific dataset could be the cause of some spectacular court actions :) And I thinks that can make the dataset creator think twice.
Posted Feb 16, 2025 7:55 UTC (Sun)
by kleptog (subscriber, #1183)
[Link]
That would require that copyright law accepts the idea that an AI model is a derivative of its input data in the copyright law sense. I really don't see that happening.
The most you can hope for is that there is a requirement that the input data was lawfully obtained, but I suspect the licence isn't going to be relevant.
Caveat: neither the legislatures in Europe nor the supreme court in the US have addressed the issue directly yet, so who knows. The EUs AI act so far supports this.
Posted Feb 17, 2025 19:19 UTC (Mon)
by ringerc (subscriber, #3071)
[Link] (3 responses)
I'm not free from sin when it comes to self serving arguments about things I want to download and use under terms other than those the legal owners (if not creators) intended. But I'm also not running a multi billion dollar business on it.
Imagine for one glorious moment that the Napster theory of copyright infringement damages was applied to GenAI outfits.
Posted Feb 18, 2025 9:42 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link]
For better or for worse, this is not *too* far removed from how courts would actually analyze the situation (except that courts in most jurisdictions would remove the "in its entirety" part and replace it with a big complicated gray area). See e.g. https://en.wikipedia.org/wiki/Substantial_similarity
Posted Feb 18, 2025 11:14 UTC (Tue)
by Wol (subscriber, #4433)
[Link] (1 responses)
Except that that is - sort of - codified in law! After all, what GenAI scrapers are doing is *exactly* *the* *same* as you reading a book. The book is copyright, and you've just copied that material from the pages of the book into your brain! And if you can't regurgitate that book word for word, then that MAY be fine.
Which is - in Europe at least - EXACTLY how AI scrapers are treated. If the *output* bears little resemblance to the training data, then copyright is not violated. If it's regurgitated exactly, then copyright is violated. European law explicitly says slurping the data IN is not a copyright violation. It says nothing about regurgitating it ... and it says nothing about other laws, like trespassing to gain access to it.
The problem of course, is the people who don't give a damn about wasting OTHER PEOPLE's money, like LWN's. In other words, people who ignore restrictions like robots.txt. I think European law treats that like a door-lock - "don't come in without permission", but the trouble is enforcing it :-(
Cheers,
Posted Feb 18, 2025 13:05 UTC (Tue)
by pizza (subscriber, #46)
[Link]
I've increasingly blocked more and more swaths of the internet from servers I control. Even after that, bandwidth usage has gone up by approximately two orders of magnitude versus 2020, and when you're paying for server resources entirely out of your own pocket, throwing more resources at the problem gets really hard to justify.
Problem is that the "legit" players are also playing this same "gorge yourself as quickly as possible'" game. For example, I've had to twice block Google's AI scraper for effectively performing DoS attacks. This behavior is distinct from their pre-ai scraper, which remains well-behaved.
[/grumble]
Posted Feb 14, 2025 18:02 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Posted Feb 14, 2025 21:03 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
Posted Feb 14, 2025 22:49 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Feb 14, 2025 19:01 UTC (Fri)
by HenrikH (subscriber, #31152)
[Link] (7 responses)
Posted Feb 14, 2025 20:55 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link]
They can just forge the Referer header. I would be surprised if they did not already do that anyway.
Posted Feb 14, 2025 21:06 UTC (Fri)
by corbet (editor, #1)
[Link] (3 responses)
We also would not want to get into the business of deciding which search engines are legitimate. And even if we could do that properly, the referrer information is entirely under the control of the bots, there is no way to know that a hit actually came from a given search engine. So it's an interesting thought, but I don't see it as being workable.
Posted Feb 15, 2025 20:02 UTC (Sat)
by HenrikH (subscriber, #31152)
[Link]
Well if not paywalled then very old articles could perhaps be rate limited, aka not per ip but for all articles as such, ofc problem then is that the bots will fill up the limit preventing the real readers from accessing the old articles... Yeah this stuff is hard.
Posted Feb 19, 2025 14:12 UTC (Wed)
by draco (subscriber, #1792)
[Link] (1 responses)
That way, direct links work for archival purposes, but if you want to navigate the site, you need to log in.
Heck, based on https://lwn.net/Articles/1010764/ (farnz's CFAA comment), it wouldn't even need to be a real login, just a page where you agree to terms of service and get a cookie recording that after POSTing your agreement.
Posted Apr 5, 2025 4:15 UTC (Sat)
by yodermk (subscriber, #3803)
[Link]
One issue might be search engines, but I suppose the main engines publish a list of their indexer IP ranges, and if a request is from one of them, you just include the links.
Posted Feb 15, 2025 8:17 UTC (Sat)
by nhippi (subscriber, #34640)
[Link] (1 responses)
Posted Feb 15, 2025 12:02 UTC (Sat)
by Wol (subscriber, #4433)
[Link]
People don't have to create a login, remember a password, any of that crap. Yes it's a hassle, but if you change it every six weeks (maybe less) with an eight week "remember me" cookie, it's not much grief to real people.
Cheers,
Posted Feb 14, 2025 19:41 UTC (Fri)
by Curan (subscriber, #66186)
[Link]
(If you want to be really nasty: redirect them to pages with code, that contains subtle errors, leading to eg. memory leaks. Though, going by the quality of all the code suggestions I've seen so far: they're on that way already.)
All that being said: defending against actual DDoS is hard. But very often your hosting provider. Mine does that automatically once they detect an attack. And they run their own on-premise system, no external CDNs, especially stuff like CloudFlare, are used.
Posted Feb 14, 2025 21:59 UTC (Fri)
by tux3 (subscriber, #101245)
[Link]
I naively like the nepenthes idea from an optimization perspective: assume the bots cannot be blocked or throttled; serve them pages that are very cheap on the server.
They will continue making requests - hopefully not much faster than before - but ideally a large fraction of these away from the expensive path
Well... even that idea feels like a loss.
Posted Feb 14, 2025 23:54 UTC (Fri)
by NUXI (subscriber, #70138)
[Link]
I noticed they all had rather outdated Chrome user agents though. So now anything that ignores robots.txt and has an old enough Chrome user agent gets served up the EICAR test file.
Posted Feb 15, 2025 1:13 UTC (Sat)
by pablotron (subscriber, #105494)
[Link] (8 responses)
I modified my robots.txt several weeks ago based on the information in the URL above and the amount of LLM scraper traffic seems to have dropped since then.
Posted Feb 15, 2025 1:15 UTC (Sat)
by pablotron (subscriber, #105494)
[Link]
Sorry about that...
Posted Feb 15, 2025 1:25 UTC (Sat)
by mb (subscriber, #50428)
[Link] (6 responses)
I blocked GPTBot via User-Agent as 403-Forbidden, because it ignores robots.txt. It's "good" that GPTBot still sends a User-Agent. But most "AI" idiots don't send a User-Agent.
Sad.
In over two decades I have never blocked one of the traditional search engines.
"AI" idiots stop it. You are destroying the Internet/WWW.
Posted Feb 15, 2025 2:50 UTC (Sat)
by DemiMarie (subscriber, #164188)
[Link] (5 responses)
Posted Feb 15, 2025 3:08 UTC (Sat)
by corbet (editor, #1)
[Link] (3 responses)
Posted Feb 15, 2025 17:58 UTC (Sat)
by DemiMarie (subscriber, #164188)
[Link] (2 responses)
Posted Feb 15, 2025 18:02 UTC (Sat)
by corbet (editor, #1)
[Link] (1 responses)
Posted Feb 16, 2025 7:05 UTC (Sun)
by wtarreau (subscriber, #51152)
[Link]
Posted Feb 15, 2025 17:02 UTC (Sat)
by farnz (subscriber, #17727)
[Link]
They go to a lot of effort to make it very hard to block scraperbots without catching a significant fraction of your user base in the cross-fire :-(
Posted Feb 15, 2025 4:34 UTC (Sat)
by DemiMarie (subscriber, #164188)
[Link] (1 responses)
Posted Mar 12, 2025 13:08 UTC (Wed)
by sammythesnake (guest, #17693)
[Link]
Posted Feb 15, 2025 17:29 UTC (Sat)
by Tobu (subscriber, #24111)
[Link]
A captcha, and a cookie that signs the IP and datetime of successful captcha completion, might help with the spread over residential IP addresses? I don't think it has to require JS, just to make it uneconomical to disguise the crawl in this way. The signed cookie is there so that no storage is required on the server. (Sorry for giving “free advice”, your work on the site is appreciated, and thank you for giving us an instructive update as well)
Posted Feb 15, 2025 19:19 UTC (Sat)
by sroracle (guest, #124960)
[Link] (5 responses)
Posted Feb 16, 2025 2:03 UTC (Sun)
by GoodMirek (guest, #101902)
[Link]
Posted Feb 20, 2025 15:40 UTC (Thu)
by PastyAndroid (subscriber, #168019)
[Link] (3 responses)
Ideally, a solution other than CDN captchas would be ideal, from my perspective. My biggest gripe is always captchas on websites which use cloudflare which do not let me past no matter what I do because of my browsers privacy settings. It gives me the tick, then refreshes the page and asks me to endlessly re-do the capture - I stop visiting websites that do this. I generally see that page now and just close it without even attempting said captcha.
[1] I think I'm human anyway? I do question it sometimes ;-)
Posted Feb 20, 2025 23:32 UTC (Thu)
by Klaasjan (subscriber, #4951)
[Link] (2 responses)
Posted Feb 21, 2025 14:02 UTC (Fri)
by PastyAndroid (subscriber, #168019)
[Link] (1 responses)
If you wish to cease the test and dump the data at any given time please respond with "2". Before we begin our test, let's look at some possible outcomes.
Possible outcomes of a successful test:
Warning: An error occurred at line 15,663 in safenet.py. Response 1 of 443 has stopped unexpectedly, please see the logs or further information. To disable this warning please toggle "CONFIG_IGNORE_AI_ERRORS" to false.
Posted Feb 21, 2025 15:05 UTC (Fri)
by Wol (subscriber, #4433)
[Link]
Certainly much more pleasurable than thinking I'm watching an AI pretending to be human.
Time for this stuffed bird to fly :-)
Cheers,
Posted Feb 16, 2025 1:14 UTC (Sun)
by aaronmdjones (subscriber, #119973)
[Link] (7 responses)
A Class C network is one whose IP addresses begin with the bits "1 1 0". Thus, only IP addresses 192.0.0.0 - 223.255.255.255 are Class C. It is a common misconception to say that all /24 networks are Class C networks, only because the Class C network space was originally divided up into individual allocations whose size was /24.
Class A, B and C networks have also not existed for decades. CIDR killed classful networking when it was introduced by RFC 1519 in September 1993.
Posted Feb 16, 2025 1:47 UTC (Sun)
by corbet (editor, #1)
[Link] (6 responses)
Posted Feb 16, 2025 3:35 UTC (Sun)
by aaronmdjones (subscriber, #119973)
[Link]
Posted Feb 16, 2025 3:40 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link]
This is a site full of pedantic nerds. So yes, it was necessary.
Posted Feb 16, 2025 18:20 UTC (Sun)
by draco (subscriber, #1792)
[Link] (3 responses)
On the other hand, the idea of class A/B/C networks is still taught in networking courses, even though it has been irrelevant and unhelpful for decades now. The sooner we stop referencing it, the better.
Posted Feb 17, 2025 12:05 UTC (Mon)
by khim (subscriber, #9252)
[Link] (2 responses)
What would that change? They are still teaching network using OSI model. Stillborn model for a stillborn set of protocols (are any of them still in use? I think Microsoft used X.500 which turned into LDAP later). Compared to that attempt to “learn how airplane works using blueprints from a steam train” the fact that they also teach A/B/C networks is minor. You have to forget most of what you were taught and relearn how things actually work when you get networking-related job, anyway.
Posted Feb 19, 2025 19:58 UTC (Wed)
by antiphase (subscriber, #111993)
[Link] (1 responses)
I've worked with and continue to work with otherwise competent network admins who continue to make references to classful networking despite not being old enough to remember it nor having an understanding what it really means besides some inadequate proxy for subnet size, which CIDR replaces far more usefully.
It's long past time for it to be put to bed.
Posted Feb 19, 2025 20:13 UTC (Wed)
by khim (subscriber, #9252)
[Link]
Are you really sure they refer to “classful networking”? In my experience for them “class C” is “network with /24 mask, that can be used with 192.168.x.0 or with 172.16.x.0” and “class B” is “network with /16 mask that needs 172.16.0.0 or 172.20.0.0, or maybe something in 10.x.0.0” while “class A” is “gagantic network, most likely 10.0.0.0” (the only one that may exist in such twisted world). It's not related to “classful networking”, as it existed years ago… but not worse than craziness that we get when trying to invent 7 layers in a network that never had them. But they did! Problem is with you, not with them! They don't have any trouble understanding each other! Instead you are causing trouble when you bring things from an era long gone! But why? Why remove useful term than helps in communication just because it's original meaning is no longer relevant? One may as well [try to] attempt to ban use of “byte” as “8bit quanitity” – equally pointless and useless in a world where other kinds of “bytes” don't exist. Result would be the same: people would laugh at you and continue to use what is convenient for them.
Posted Feb 16, 2025 11:34 UTC (Sun)
by Sesse (subscriber, #53779)
[Link] (5 responses)
Thank you for being the tome of both old and new Linux knowledge. I've been on the receiving end of these AI bots myself, and one of the most relevant solutions was just… to make it fast enough that it doesn't matter? If you've got ~750k pages, is there a reason why you can't just generate them all and cache them in Varnish or similar? (I mean, this is about CPU time and not bandwidth, right?)
For my part, I moved the only problematic vhost (a Gitweb instance with just way too many potential pages to precompute) to IPv6-only, and traffic dropped dramatically overnight :-) But I do believe this would be too painful for LWN, of course.
(As for JavaScript, I'm sure a lot of people are very vocal about this. As an ever so slight balancing, here's a “I don't really care” voice.)
Posted Feb 17, 2025 13:55 UTC (Mon)
by leigh (subscriber, #175596)
[Link] (1 responses)
Posted Feb 17, 2025 14:14 UTC (Mon)
by Wol (subscriber, #4433)
[Link]
Firstly LWN is paying for that bandwidth. And secondly, what happens if the scraping overwhelms the bandwidth? Ancient case in point - a company on an ISDN link was filtering its mail based on message header. Even downloading JUST the header, spam was arriving at their ISP faster than they could bounce it.
Cheers,
Posted Feb 17, 2025 14:09 UTC (Mon)
by daroc (editor, #160859)
[Link] (2 responses)
But making sure to do that correctly takes time, especially since subtle problems with the correctness of the cache are only likely to make themselves known when a user notices a discrepancy.
Posted Feb 17, 2025 14:11 UTC (Mon)
by Sesse (subscriber, #53779)
[Link] (1 responses)
Posted Feb 17, 2025 14:31 UTC (Mon)
by daroc (editor, #160859)
[Link]
So it's definitely not an insurmountable problem, just something to do with care.
Posted Feb 16, 2025 11:47 UTC (Sun)
by arnout (subscriber, #94240)
[Link]
We could easily see from the User-Agent strings that it was a lot of AI-bots, because many of them _are_ honest in their User-Agent (though they still do seem to ignore robots.txt). We first blacklisted a lot of those User-Agent strings (returning a 403), but it didn't do much more than make a dent in the traffic.
Fortunately there is a very simple solution for us. The full download URLs are normally constructed from the Buildroot metadata, they don't appear as a URL anywhere on the web. So we simply turned off directory listing, and the scrapers no longer have anything to scrape! It's still a step back because people do sometimes use directory listing (e.g. to check which versions are available and which ones are missing), but it's something we can live with.
Posted Feb 16, 2025 14:20 UTC (Sun)
by hunger (subscriber, #36242)
[Link] (2 responses)
They did a presentation at FOSDEM this year: https://fosdem.org/2025/schedule/event/fosdem-2025-5879-m...
I found the idea really interesting and it would IMHO be a good cultural fit for LWN.
Posted Feb 16, 2025 19:30 UTC (Sun)
by Henning (subscriber, #37195)
[Link] (1 responses)
Posted Feb 17, 2025 14:15 UTC (Mon)
by daroc (editor, #160859)
[Link]
For those who wish to avoid JavaScript at any cost — we do also take payment for subscriptions by check, although only on US banks.
Posted Feb 16, 2025 15:39 UTC (Sun)
by PlaguedByPenguins (subscriber, #3577)
[Link] (11 responses)
it seems crazy for LWN to spend 100's of hours of valuable time and resources and to eventually still capitulate to the "ignore robots.txt" crowd if just using javascript could solve the problem. or am I missing something?
FWIW I have no problems with it...
if people really care, then I'd suggest an LWN accounts setting that permits javascript to be disabled.
Posted Feb 16, 2025 17:26 UTC (Sun)
by Paf (subscriber, #91811)
[Link] (1 responses)
Posted Feb 17, 2025 14:17 UTC (Mon)
by daroc (editor, #160859)
[Link]
But you're right that if things get bad enough, we may be forced to come up with some kind of CAPTCHA system. Let's hope things don't get to that point.
Posted Feb 18, 2025 0:14 UTC (Tue)
by nickodell (subscriber, #125165)
[Link] (3 responses)
Posted Feb 18, 2025 9:38 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
You might argue that the attackers do not care, because they are (presumably) using a botnet anyway. But even if they are stealing the resources, they still have to allocate them, and they will likely prefer to target sites that do not require a headless browser over sites that do.
Posted Feb 19, 2025 14:12 UTC (Wed)
by Sesse (subscriber, #53779)
[Link] (1 responses)
Posted Feb 19, 2025 14:55 UTC (Wed)
by excors (subscriber, #95769)
[Link]
That was probably over 15 years ago, and nowadays Googlebot uses a headless Chromium to crawl some pages (https://developers.google.com/search/docs/crawling-indexi...). I don't know if there are some cheaper crawlers that still just look for JS strings, to get some of the benefit for much less cost.
Posted Feb 22, 2025 17:31 UTC (Sat)
by anton (subscriber, #25547)
[Link]
A side benefit is that many web pages that are a waste of time display very little without JavaScript. E.g., Google search has become more and more enshittified in recent years, and a month or two ago they crowned this work by requiring JavaScript. So I looked at duckduckgo again, and I find that it now produces better search results than google did (in addition to working without JavaScript).
Posted Feb 27, 2025 6:55 UTC (Thu)
by brunowolff (guest, #71160)
[Link] (3 responses)
Posted Feb 27, 2025 12:12 UTC (Thu)
by farnz (subscriber, #17727)
[Link] (2 responses)
The converse of this is that if you break pages only for people who disable JavaScript, you break lots of bots, and very few humans.
Posted Feb 27, 2025 14:02 UTC (Thu)
by brunowolff (guest, #71160)
[Link] (1 responses)
Posted Feb 27, 2025 14:07 UTC (Thu)
by farnz (subscriber, #17727)
[Link]
It's humans who run JavaScript - so you can attack humans, but not bots, via JavaScript.
Posted Feb 16, 2025 20:01 UTC (Sun)
by pomac (subscriber, #94901)
[Link]
If you set a tick for how often pages will be rerendered it could be used as the life time in memcached and thus limit CPU usage ...
Also in general, all the publicly available articles could be rendered in markdown and put in to git, serve it on kernel.org and mirror to GitHub, let githubs CDN handle them :)
Posted Feb 17, 2025 8:48 UTC (Mon)
by taladar (subscriber, #68407)
[Link] (1 responses)
Presumably that is because the kind of hands-off hosters used by scrapers like that invest the minimum amount of effort into their hosting offerings. IPv6 is still seen as something optional by many for some unfathomable reason so they won't invest time in learning how to offer it.
Posted Feb 17, 2025 13:00 UTC (Mon)
by ccr (guest, #142664)
[Link]
In my case, I've also noted that the scraping bots do not seem to use IPv6, but also that vast majority of the scraping originated from cloud services of certain nationality. All three of those cloud providers at least document IPv6 capabilities, so I'm not sure if the "blame" can be placed on hosting alone .. of course the scrapers of LWN may be different from who bombard my site, so .. shrug. :)
Eventually, after adjusting throttling settings few times, I decided to simply drop all connections originating from those networks completely in netfilter. By networks, I mean complete ranges assigned to those cloud providers AS. Been quieter after that.
Posted Feb 17, 2025 20:09 UTC (Mon)
by babywipes (subscriber, #104169)
[Link]
I just wanted to post a sympathetic note and to thank you all for being one of the great websites on the internet that represent the, sadly, ever-shrinking population of folks who embody the original spirit of the internet. LWN is the best of the 'net and reminds me of that sense of discovery I had when I first got online in the 90s.
Posted Feb 18, 2025 9:28 UTC (Tue)
by lambert.boskamp (subscriber, #131534)
[Link]
The first, free version of LWN would provide anonymous access to some extent (as is the case today), but is protected by Cloudflare against DDoS and hence requires JavaScript. This is less convenient for drive-by users who e.g. click on historic LWN links found in kernel source, but it works. If any of these users is significantly annoyed by the captchas, they're free to think about becoming a subscriber.
The second version of the site would be for subscribers exclusively and is completely behind an authentication wall without any anonymous content at all. This version can keep JS-free as today. Subscribers can hence continue to use Lynx and be happy.
Posted Feb 18, 2025 17:55 UTC (Tue)
by IanKelling (subscriber, #89418)
[Link]
We wrote some about it here
Posted Feb 18, 2025 22:00 UTC (Tue)
by sageofredondo (subscriber, #157944)
[Link] (1 responses)
Posted Feb 19, 2025 17:00 UTC (Wed)
by smoogen (subscriber, #97)
[Link]
To the affected sites it will look like hundreds of different users over a week got interested in parts of the site..
I think a lot of client code was written for advertising scams. You say your site has X amount of traffic with lots of users in this demographic mix. You then pay the botnet for those 'users' and serve tons of ads to bots and take a check from the advertisers. Because you want to show that 'real' content is being slurped the software was written to 'interact' in ways which will fool various captchas etc which advertisers may require to serve expensive ads. It just turns out to be useful if you want to slurp up the internet for 'free' also.
Posted Feb 23, 2025 5:33 UTC (Sun)
by jrw (subscriber, #69959)
[Link] (8 responses)
Posted Feb 23, 2025 13:11 UTC (Sun)
by farnz (subscriber, #17727)
[Link] (7 responses)
The problem micropayments have repeatedly hit against is the cost of addressing fraud. If I can sneak an ad onto millions of pages at a cost of $0.001 per page, and have the ad collect a micropayment of $0.002 per viewer, I can make thousands of dollars of profit at the expense of a large number of defrauded consumers. Criminals are already quite happy to do similar tricks to serve malware from ads; they'd be even more happy to do it for a direct cash reward.
So you need a system (funded somehow) that will take me complaining that I don't recall visiting LWN.net or Ars Technica, and refuse to refund me, while still refunding the transactions that come from fraud. Thus far, no-one's successfully built such a system for the offline world, let alone the online world; transactions have a minimum amount not because there's no use for small transactions, but because the cost of fraud is proportional to the number of transactions, and trying to collect $0.002 per transaction to cover the cost of fraud is acceptable when transactions are $1 or more, but unacceptable when the transaction is of $0.001
Posted Feb 23, 2025 14:02 UTC (Sun)
by excors (subscriber, #95769)
[Link]
...unless you use a botnet, as reported by this article, in which case someone else is paying for the internet link (and potentially for the storage and processing power to deduplicate and filter and compress the data before sending the valuable parts back to you). Then it becomes another instance of fraud, where the person benefiting from the activity is not the person paying for the activity, and increasing the cost of traffic will have no impact.
Posted Feb 23, 2025 16:31 UTC (Sun)
by raven667 (subscriber, #5198)
[Link] (5 responses)
I wonder how many executives at telcos even remember toll calling, but I'm sure they'd like to bring back billing based on netflow data or DNS if they can take their cut, in the same way that cable TV based ISPs have reintroduced bundled billing after the migration to streaming using SAML SSO AFAICT. So you could either use a sample based guesstimate that has more room for fraud, or have a strong identity federated with your ISP as a prerequisite for access, requiring SAML/OAuth2/OIDC to handle the accounting for billing. You might end up with a system bifurcated by wealth where "free" data is predatory nonsense trying to defraud / manipulate and paid data is higher quality, although catered to the biases of the more wealthy.
In some ways it represents a regression from the original ideology of the Internet, where it was based on a flat fee to cover the infrastructure costs, and you could use it as much or as little as you like for whatever purpose, but that was assuming a more peer-like relationship between collections of endpoints rather than the more strictly client/server networks we have today, where the cost to be a publisher was fairly low and mostly covered by excess capacity in already-paid-for infrastructure (networks and personal computers/workstations), and there was mechanisms of accountability for abuse, if you got out of pocket they could call your Dean and hold your degree hostage, threatening your potential future lifestyle, until you learned to behave. In a better world instead of pretending that the Internet is some cyberspace separate from governments, a land of FREEDOM (*eagles screaming*), we'd have sensible accountability measures as part of peering so networks are only connected when there is an agreed upon ToS. That might split the network into several zones of influence with small filtered pipes in between but it could more closely align with human behavior and human centered governance systems, eg. if you don't want this kind of scraping and there wasn't already a rule against it then you'd have a clear process to talk to your representative to negotiate a change to the ToS (laws) that could be enforced against those violating community norms.
Many people are laughing now because the mechanisms for feedback and adjustment to the rules by which society operates are so broken right now that what I've described seems like a utopian pipe dream, because the trend has been toward less accountability the more power and wealth you have, when we need stronger accountability for the effects of decisions that ultimately a person(s) somewhere are making. There is a person somewhere setting up a scraping botnet, there is a person somewhere using that tainted data, those people have names and faces, they live in a place with rules of some kind, they should be identified or cut off from the network.
Is that idea the end of anonymity, yeah kind of, it should only be attempted when there are working feedback mechanisms between the people making and enforcing the rules and the people subject to the rules so that there is mutual consent and ways to adjust to keep everyone in harmony and alignment, democracy is one of those mechanisms but there are certain to be many stable and effective ways for human societies to maintain healthy feedback loops.
There is just a limit as to what an individual website can do when the problem is the fundamental rules of the system that don't provide negative feedback on abusive behavior, like a broken TCP or bufferbloat which allows one jumbo flow to starve the rest of the link.
Ok, time to get a second cup of coffee and get off the Internet ;-)
Posted Feb 23, 2025 16:52 UTC (Sun)
by farnz (subscriber, #17727)
[Link] (3 responses)
The second part of it is that they've got strong defences against consumer-side fraud; they can, to the extent that local law permits, refuse to serve you until you've repaid the cost of your fraud, including refusing to supply service to any address you live at.
The reason this basically doesn't work for the Internet is that I can fairly trivially obtain a service (e.g. like Tor, or a commercial "VPN" provider) that takes in encrypted data in one location, and releases it onto the Internet as-if I was based in a totally different location. Unless you enforce rules requiring that your telco can see inside all encrypted packets, there's no way for my telco to distinguish "good traffic" from "bad traffic" - it all looks the same on the wire, and that makes it impossible to identify the root sources of "bad traffic" and cut them off.
Posted Feb 25, 2025 4:14 UTC (Tue)
by raven667 (subscriber, #5198)
[Link] (2 responses)
I
Posted Feb 25, 2025 10:15 UTC (Tue)
by paulj (subscriber, #341)
[Link]
Posted Feb 25, 2025 10:33 UTC (Tue)
by farnz (subscriber, #17727)
[Link]
This has been the Internet's biggest strength and weakness from day one; it's very hard to stop a motivated individual getting access to another country's Internet-accessible resources as-if they were in the other country. As a consequence, you have to assume that you're facing the criminals of the entire world, and you have no recourse via the law since there's no guarantee that the justice system in (say) an area undergoing a civil war has any interest in dealing with your complaint that someone in their area is doing something that affects you in a rich and peaceful area of the world - especially since the person you're complaining about may "simply" be a combatant in the war offering a relay service in exchange for military gear to support their war effort.
Posted Feb 23, 2025 17:08 UTC (Sun)
by excors (subscriber, #95769)
[Link]
That sounds rather like the net neutrality debate from the past couple of decades, where ISPs want to be able to throttle bandwidth and/or charge customers more depending on which web sites they access (and also charge content providers to get themselves onto the free-to-customers list), while customers and content providers don't want that because they'll be paying more for a worse service. The executives don't merely remember it, they've been continually fighting for it, and in the US they're now winning.
Posted Feb 25, 2025 22:54 UTC (Tue)
by ju3Ceemi (subscriber, #102464)
[Link]
Posted Feb 27, 2025 19:53 UTC (Thu)
by davecb (subscriber, #1574)
[Link] (1 responses)
This is the classic "swiss cheese" model for avoiding accidents. If the accident is to happen, all the holes in all the slices must line up. Each individual slice was simple. Unless they were derived from one another, the holes also tended to be in different places.
A customer of mine used to use that technique, and reduced bad things substantially. One of the slices was a check on user-agent-equivalents. Every time a new user-agent-equivalent showed up, it was accepted until more than N were seen in a day, at which time the program emailed a human to get a decision. A more cautious customer required manual inspection of all user-agent-equivalents. I thought that was cooler, but it used a lot of humans (:-))
Posted Mar 25, 2025 4:41 UTC (Tue)
by dagobayard (subscriber, #174025)
[Link]
I wish I had thought of the N approach, but alas, I'm a perfectionist, and I think in F_2 terms (true, false, black, white). I mechanized it to some extent but still I made myself personally eyeball every new UA string. The stress was too much and the job didn't last.
--
Posted Mar 5, 2025 3:40 UTC (Wed)
by livi (guest, #173786)
[Link]
Users who aren’t logged in get hit with JS required challenges and whatever else.
Posted Mar 16, 2025 14:09 UTC (Sun)
by koollman (subscriber, #54689)
[Link]
Adding links/paths that humans are unlikely to follow helps with detection.
Once you get enough details on the patterns used you can general have a set of 'known good', 'known bad' and 'unknown' clients
Proof of work challenges?
That is a subset of the JavaScript-based challenges mentioned in the article. The problem is that we have long gone out of our way to avoid forcing readers to enable JavaScript, and we hear regularly that this attention is appreciated. Being forced to change that approach by these people would be ... sad.
Proof of work challenges?
Proof of work challenges?
We have considered honeypots — indeed, some code was even written. But when you have thousands of systems each hitting two or three pages, tripping up the one that gets assigned to the honeypot link is not going to change the situation much.
Honeypots
Honeypots
https://lwn.net/current/removethiswhenlinkingorbookmarking/<long unique string>
Honeypots
The bot will only send you one request per week. So blocking the IP address is basically useless.
Honeypots
Honeypots
It could be a human user.
Honeypots
Honeypots
That is the problem.
The CPU/traffic load already happened once you identify the bot. And then it will basically never hit again, unless you keep a multi terabyte timeout-less database with the risk of putting your users into the terabyte ban database.
Honeypots
Honeypots
Lost.
Honeypots
Honeypots
It is a real problem for my machines, too.
And I really don't see a solution that is not
a) buy more resources or
b) potentially punish real users
Honeypots (and tarpits, oh my!)
Honeypots (and tarpits, oh my!)
Bot administrators are not stupid. Bots are optimized for maximal throughput, no matter what.
Honeypots (and tarpits, oh my!)
Honeypots (and tarpits, oh my!)
Yes, obviously. That's why I called this a "mitigation", not a "cure".
Honeypots (and tarpits, oh my!)
Honeypots (and tarpits, oh my!)
Honeypots
that are fetching a sorted list of URLs in an obvious sequence, but the same IP address will
not appear twice in that sequence. "
Respond by sending back 404s on the predicted sorted list for a minute or so. There's
probably other, better, ideas for responses :-) ...
spam without reading it because it came from xyz789@someweirdplace and because of
how it was formated and several other tells... milters had just come out, and I wrote a
milter that caught a ton of spam with almost no false positives.
Proof of work challenges?
You would be surprised how many people struggle with the JavaScript requirement for the credit-card form; I wish we could do away with that. (OTOH, doing it that way means that credit-card numbers never pass through our systems at all, and we like that).
Proof of work challenges?
Proof of work challenges?
- sending a JS challenge only during periods of intense activity: the site needs to defend itself and triage good vs bad, so only in these periods it will need users to support JS. I understand that it's not much welcome here.
- session cookie to recognize users (usually goes in complement for the one above): it allows to check the activity per cookie and continue to allow good actors to read without problem while bad actors are rejected. Since some users might reject cookies, the option here is to slow down the first cookie-less request so that only them (and bots) are slowed down during periods of high activity. Also you can be certain that scrapers will learn cookies, so they can also be used to negatively flag them once identified as certainly malicious.
- redirect with URL parameter: that's an alternative to a cookie, but it usually requires to adapt the site to support passing the same argument to all internal links (remember sites with "/?...;JSESSIONID=xxx" ?
- you can count the number of URLs per IP (or IP range) and per period. A normal user will not read 1000 URLs in 10 minutes, even when actively searching. And since I guess the site doesn't have 100 million daily users, you can enlarge the source range on which you're applying the checks. Even if that covers an enterprise's proxies, you'll hardly have sufficient readers in an enterprise to visit 1000 URLs in 10 minutes. A scraper will often do way more than that, distributed over multiple addresses.
Proof of work challenges?
Proof of work challenges?
Proof of work challenges?
Ian
Proof of work challenges?
Proof of work challenges?
There's a separate issue for the scrapers to consider if they start engaging with login forms and the like; they end up on shakier legal ground.
Login forms and scrapers
Login forms and scrapers
The dumbest scrapers will demonstrably bash their virtual heads against the 429 "slow down" failure code indefinitely. They will repeatedly hit HTTP URLs that have been returning permanent redirects for over a decade. I would not expect smarter responses to a login form.
Proof of work challenges?
Proof of work challenges?
The initial comment suggested using JavaScript to filter out bots, but omitting it for subscribers. I pointed out that even subscribers have to get past the login form to be recognized as such, and will thus run into JavaScript there.
Proof of work challenges?
Proof of work challenges?
Proof of work challenges?
Proof of work challenges?
dont forget w3m like surfers as vieving images not available for them so challange can't contain images.
math challenge per IP, math challenge per EPOCH, limit retry so on ... combine try hard if you need.
Maybe you put that "specials" to your fake index.html
Legal remedies
Everybody wants their own special model, and governments show no interest in impeding them in any way.
If the scrapers are using botnets, I sincerely doubt the government will be able to do much to stop them. Governments don't seem to be able to do much to stop botnets more generally, so there's no reason to think they'll do any better for this specific case.
Legal remedies
> Because we believe in democratizing access to online content, our free desktop and Android versions accomplish exactly this, while you contribute a small amount of resources to our peer-to-peer community network.
Legal remedies
Legal remedies
Licencing pipe-dream
Licencing pipe-dream
Licencing pipe-dream
> If something else *does* grant you that permission, like how search engines and critical reviews are allowed to excerpt primary sources under fair use, then the GPL doesn't matter.
Licencing pipe-dream
Licencing pipe-dream
Here's an example of an image download tool that has resisted implementing robots.txt. Their README asks for nonstandard headers to opt out. PRs as simple as defining the User-Agent were not merged either.
Basic cra
Basic cra
Licencing pipe-dream
Licencing pipe-dream
Licencing pipe-dream
Licencing pipe-dream
Licencing pipe-dream
Wol
Licencing pipe-dream
Treat it as an optimization challenge!
Treat it as an optimization challenge!
Treat it as an optimization challenge!
Not just hand out the old articles
Not just hand out the old articles
Putting older articles behind the paywall would defeat one of the primary purposes of LWN, as a long-term archive of community discussions, events, and decisions. There are <checks> 251 links to LWN in the 6.14-rc kernel source, we don't want to break them. A lot of our (legitimate) traffic comes from places other than search engines.
Not just hand out the old articles
Not just hand out the old articles
Not just hand out the old articles
Not just hand out the old articles
Not just hand out the old articles
Not just hand out the old articles
Wol
Nepenthes should be able to work
Sympathy
If we lived in a different world, this is where a higher structure might be equipped to cut down on this abuse. But spam and robocalls suggest we may be stuck living with that plague from now on.
these AI scrapers suck
Consider Adding a More Comprehensive robots.txt
Consider Adding a More Comprehensive robots.txt
Consider Adding a More Comprehensive robots.txt
>User-Agent: GPTBot
But "AI" idiots go crazy since a couple of months ago.
No user-agent = 403
It's not that there is no user agent... These things just pretend to be a desktop browser.
No user-agent = 403
User-Agent strings
Take your pick - they pick a user agent that is meant to disguise the bot and blend into the rest of the site's traffic. It is a fully attacker-controlled string that really has nothing of value in it.
User-Agent strings
User-Agent strings
Last scraperbot author I spoke with carefully chose user agents that matched popular browser; the intent is that if you blocked them by user agent, you'd also be blocking all Apple Safari, Google Chrome, Mozilla Firefox and Microsoft Edge users on all platforms.
No user-agent = 403
Simple no-JS CAPTCHA?
Simple no-JS CAPTCHA?
Captchas for new IP addresses
Freewall
Freewall
We are dealing with the same issue.
Freewall
Freewall
Considering your alias, your footnote [1] and the well-known Turing test, could you please reply to this message of mine to help in determining the answer? 😀
Freewall
- Participants can be fairly certain that each other are human.
- It's possible that I could be trained to be able to pass the test using advanced train[ERROR OCCURRED]
Freewall
Wol
Classful networking
The discerning reader might have noticed that I wrote "treating the Internet like...". The fact that you understood what I meant suggests that the analogy worked. So perhaps this lecture was not strictly necessary...?
Classful networking
Classful networking
Classful networking
Classful networking
Classful networking
Classful networking
> I've worked with and continue to work with otherwise competent network admins who continue to make references to classful networking
Classful networking
Naïve question
Naïve question
Naïve question
Wol
Naïve question
Naïve question
Naïve question
Fighting the AI scraperbot scourge
Have you seen maptcha?
Have you seen maptcha?
LWN works wonderfully without javascript and the only time I enable it is when I renew subscription. I have been bitten too many times by payment solutions breaking if you run a non-standard setup of your browser.
Have you seen maptcha?
what is wrong with javascript?
what is wrong with javascript?
what is wrong with javascript?
what is wrong with javascript?
it seems crazy for LWN to spend 100's of hours of valuable time and resources and to eventually still capitulate to the "ignore robots.txt" crowd if just using javascript could solve the problem.
Does Javascript solve this problem? It's reasonably straightforward to code a Python program which controls a headless browser and loads a web page. There are of course ways to fingerprint this setup, but none of them are foolproof.
what is wrong with javascript?
what is wrong with javascript?
what is wrong with javascript?
I have disabled JavaScript in the browser I usually use in order to reduce the attack surface.
what is wrong with javascript?
what is wrong with javascript?
The bots have never used JavaScript; the reason JavaScript is useful as a distinguishing factor between bots and humans is that some humans have working JavaScript, bots don't. If you break pages only for people who use JavaScript, you break lots of humans, and no bots.
what is wrong with javascript?
what is wrong with javascript?
If the bots are running javascript code from your site, you have a way to attack them.
Bots don't run JavaScript; to the extent they support it, they do so by having pre-written (by their owners) pattern matching code that says "this looks like Google JavaScript on a Google owned domain - extract this bit and crawl the resulting URL".
what is wrong with javascript?
Just some brainfarts
Hands-off hosters?
Hands-off hosters?
Much sympathy. Keep up the great work.
Have two sites: free access with DDoS protection/JS, authenticated subscriber access without
Have you tried ip banning the botnet?
https://www.fsf.org/bulletin/2024/fall/fsf-sysops-cleanin...
Simple suggestion
Simple suggestion
Micro-payments may be the only answer
There's already a tiny cost to downloading more data - your Internet link is charged based on capacity, and the more you want to download, the bigger the link has to be. It's just a really tiny cost right now, because Internet links are cheap.
Micro-payments may be the only answer
Micro-payments may be the only answer
Micro-payments may be the only answer
Note that both telcos and cablecos protect against fraud in the same two ways: first, anyone they pay money to gets it on a long delay, and it's at the company's discretion whether they continue to let you take money from their customers via them. If you're a fraudulent content provider (my "criminal running ads that take money" example), they cut you off permanently, and they have complicated rules and procedures you have to follow to even get to a point where you can take money from them.
Telco and cableco protections against fraud
Telco and cableco protections against fraud
Telco and cableco protections against fraud
Indeed - but now you need a Great Firewall of China level of enforcement of anti-VPN rules to stop me "invisibly" leaving one jurisdiction and becoming apparently "present" in another.
Telco and cableco protections against fraud
Micro-payments may be the only answer
Understandable
If I'd a vastly greater amount of brain-time, I'd read the whole content over and over again !
How about defence in depth?
Similarly, IFF no browser used by humans fails to provide a User-Agent header, exclude them, too
Then, IFF the User-Agent doesn't match one of a list of real browsers, exclude them
(the latter requires a semi-automated mechanism, see below).
How about defence in depth?
Ian
logged in
other possible options
Adding some delays on rarely accessed pages (like out of cache systems) helps with limiting the impact on resources
Even a .2s delay is not too bad for an unknown first time user, but it is painful for a system doing millions of requests