LWN: Comments on "LWN site tour 2025"

Comments

Fowl — Thu, 06 Mar 2025 09:34:15 +0000

... and I can now see this has been implemented! Thanks!

Comments

Fowl — Fri, 14 Feb 2025 03:32:42 +0000

I agree that comment thread collapsing is very useful!

One relatively minor enhancement I wish for is the ability to collapse (all the comments under) entire articles on the unread comments page. Occasionally the comments for an article get... numerous.

No JS is a big plus

PeeWee — Wed, 12 Feb 2025 22:40:45 +0000

As a new subscriber I have to say that I like your policy on the JavaScript front. And it is also very refreshing to see this kind of "old school", no nonsense design. It is very functional, to the point and focused on what actually matters: the content. As the saying goes, if you are too concerned with optics you are trying to distract from the (lack of) content.

Comments

lamawithonel — Mon, 10 Feb 2025 02:49:25 +0000

I've really liked the thread collapse feature you added last year. I'm hopping that's another area of focus, comments, but I'm happy if it's slow. I like the conservative approach you've taken over the years. Thanks for all the efforts in both directions, improvements and holding fast!

Nonsense requests from content-gobbling site scrapers

mb — Sat, 08 Feb 2025 16:18:44 +0000

For me most of the bot load comes from bots "indexing" the cgit web interface. There's basically an infinite amount of data in there, so it never ends. Even though cgit is completely blocked in robots.txt.

Currently something that identifies itself as "openai.com/gptbot" ignores my Crawl-delay and also the cgit blocking. Therefore I'm forced to block it completely either on IP level or via User-Agent.
It downloaded almost 40k files today.

Nonsense requests from content-gobbling site scrapers

apoelstra — Sat, 08 Feb 2025 15:47:27 +0000

> - No limit for link-following depth. It goes in circles without end.

You can sometimes crash such bots by just having an infinitely deep directory tree. If you are serving a directory-indexed folder from nginx (or, I assume, Apache), you can do this by simply doing `mkdir do-not-follow; cd do-not-follow; ln -s .. do-not-follow`.

And yes, on my site there are also only a small number of IPs that do this sort of thing so it's sufficient to manually block them (in fact, it's sufficient to just totally ignore them since they're not wasting a lot of resources in total :)).

I imagine the situation is pretty bad for something like LWN with hundreds of thousands of links widely spread across mailing lists and the wider Internet.

Strip trailing spaces

hrw — Fri, 07 Feb 2025 05:55:59 +0000

KSDB looks nice so I went to check for my commits (28 in total).

Let search trim trailing spaces. On Android there is a space added after autocomplete so KSDB searched for "Juszkiewicz " and failed.

Nonsense requests from content-gobbling site scrapers

ejr — Fri, 07 Feb 2025 00:44:22 +0000

Y'all rock.

And, yes, the epub version definitely works via free software on a PineNote. ;) The site does as well when on a network, also via free software as much as any of these can be.

Nonsense requests from content-gobbling site scrapers

corbet — Thu, 06 Feb 2025 20:35:48 +0000

Just to be clear: I know of no instances where ordinary, human readers have been mistaken for bots, we do go out of our way to avoid that.

Nonsense requests from content-gobbling site scrapers

daroc — Thu, 06 Feb 2025 20:31:11 +0000

Oh, yes. The polite bots read our robots.txt and respect the directives there. It's not exactly great how many of them there are, but they're not causing problems for us because we set the crawl delay to something the site can handle.

It's the bots that never look at robots.txt, and keep hitting the site even when they get a 'Rate Limit Exceeded' error that are the real problem. But — possibly because we _do_ serve 'Rate Limit Exceeded' errors with impunity, when a non-logged in user tries to load more than a handful of pages per second — we see the same bots coming from a large variety of IP addresses. Part of the challenge is that each IP address only makes a small number of requests, so we need to be able to identify them quickly.

Incidentally — if anyone without a LWN.net account is reading this, one of the benefits of getting a free account, even without a subscription, is that the site code is less likely to classify you as a bot. Not that this should come up very often, since we try hard not to hit our human readers with false positives, but if it does for some reason you can fix it by signing in.

Nonsense requests from content-gobbling site scrapers

mb — Thu, 06 Feb 2025 20:09:26 +0000

On my site these bots show a completely anti-social behavior that is against all bot standards:

- No rate limiting at all.
- Completely ignoring robots.txt
- Fake User-Agent that mimics Safari browser.
- No limit for link-following depth. It goes in circles without end.
- Plus all the things you said.

They are only a handful of source IP addresses for me, so I block them.
But I guess I'm just lucky with that.

I don't know who they are and what they want to do.
But if I would not have taken action my machine would be completely overloaded.

Nonsense requests from content-gobbling site scrapers

daroc — Thu, 06 Feb 2025 19:27:47 +0000

We get a surprising number of requests to our HTTP site that immediately get redirected to HTTPS; usually, you might expect this to happen approximately once per client before they pick up on the permanent redirect. In actuality, it's about 20% of requests made to the site, often by the same client multiple times in a row. Usually, they're requests for old articles from years ago — which human readers certainly read too, but someone coming in repeatedly on port 80 for old, unpopular articles is probably a robot.

Other "fun" behaviors include: requesting an article, and then requesting to view each comment individually, when they were just rendered on the article's page; requesting "dead" URLs, like the HTTP version of the site or old URLs that have other redirects from the site being reorganized over time; etc.

All of these have picked up since the start of January, to the point where we've had a handful (4-5, I think) of instances where the site has had a lag spike. It's nothing that we can't deal with, necessarily, but we don't want to have to deal with it because it takes time away from all the other parts of keeping LWN.net running.

That's not getting into the non-AI bots, which mostly try to figure out if we have any shell or SQL injection vulnerabilities by repeatedly trying to log in with nonsense usernames. Or the bots that try to do credit-card fraud and get shot down for not having a valid transaction in their request.

Site performance

corbet — Thu, 06 Feb 2025 17:21:24 +0000

No PHP at all, happily.

There is quite a bit of caching built into the LWN site code. But when you have tens of thousands of AI-scraper sites all hitting the server, there is only so much that caching will help. We've put in some countermeasures that seem to have stabilized the situation for now... but this is a net-wide problem, and I would expect it to get worse.

Site performance

jengelh — Thu, 06 Feb 2025 17:18:23 +0000

>we are facing [...] site scrapers trying to feed AI models. While we look for ways to block them to preserve site performance, we are avoiding adding any counter-measures that inconvenience our human readers.

Is this some kind of {overly dynamic webpages cluttered with altmode media—images, font files, huge stylesheets, you name it} problem for which LWN is too plaintext to be overly concerned with?

I get that LWN is running *some* form of dynamic page generation somewhere, but how much ‘PHP’ is there really to worry about, and could a static page cache not also address that?

Nonsense requests from content-gobbling site scrapers

amw — Thu, 06 Feb 2025 13:59:34 +0000

Just out of curiosity, would it be possible to elaborate a bit on "nonsense requests from content-gobbling site scrapers trying to feed AI models"? What form do these requests take?

Kernel Index

ncultra — Thu, 06 Feb 2025 13:18:43 +0000

The Kernel Index is much appreciated!

https://lwn.net/Kernel/Index/

RSS with content?

legoktm — Wed, 05 Feb 2025 22:42:50 +0000

In case it's useful, 404media recently rolled out full-content RSS feeds for subscribers; each URL has a ?key=... for authentication.

https://www.404media.co/404-media-now-has-a-full-text-rss...

RSS with content?

corbet — Wed, 05 Feb 2025 19:51:51 +0000

True, we could implement a solution along those lines. Stay tuned (but I can't promise when).

RSS with content?

Cyberax — Wed, 05 Feb 2025 19:42:36 +0000

A typical solution is to create a personal link with a long random token that encodes the authentication.

RSS with content?

mb — Wed, 05 Feb 2025 19:40:35 +0000

An API-token can be used that (if leaked) only affects the ability to read the RSS feed.
Such tokens can typically be generated from the user's account menu.

In this case it would basically be a personalized and maybe re-generatable RSS URL retrieved by the user from the account menu.

Dark mode

npws — Wed, 05 Feb 2025 19:23:52 +0000

Thanks for the hint. I used a Chrome plugin for LWN so far, but that breaks quite a few things.

RSS with content?

corbet — Wed, 05 Feb 2025 19:22:55 +0000

That has been occasionally requested, and I've looked into it. Authentication and RSS don't really go well together, about the only way to do it in most readers seems to be to put the username and password, in plain text, in the fetch URL. That seems ... not entirely elegant. I wish there were a better solution.

RSS with content?

Cyberax — Wed, 05 Feb 2025 19:19:31 +0000

Do you have an RSS feed with the contents of the articles, obviously for subscribers only?

For now, I'm using FreshRSS capability to download the linked RSS article with the supplied cookies for auth. But it'd be nice to get the full content.

Perhaps via a personal URL?