LWN: Comments on "Defending mounted filesystems from the root user"

Defending mounted filesystems from the root user

adobriyan — Mon, 11 Mar 2024 05:09:14 +0000

> for preventing users from shooting themselves in the foot when trying to write an image to (the wrong) disk.

or developers from running fio job with wrong filename=
:-(

Defending mounted filesystems from the root user

lathiat — Mon, 11 Mar 2024 02:26:57 +0000

Ignoring all of the TOCTOU issues, this same functionality seems super helpful for preventing users from shooting themselves in the foot when trying to write an image to (the wrong) disk. Though would probably need to also protect the full device not just the partitions.

This would be a nice default anyway, even if it has some kind of override method for the weird cases.

Defending mounted filesystems from the root user

taladar — Wed, 30 Aug 2023 09:45:26 +0000

At that point you are potentially working on a fictional version of your filesystem that doesn't exist on disk for months at a time. Considering persistent storage is the main purpose of a hard disk that doesn't seem like a good idea.

Defending mounted filesystems from the root user

matthias — Tue, 29 Aug 2023 03:15:59 +0000

> Surely it would be reasonable for the kernel to have partition table drivers with an API for manipulating them. Presumably calls to modify or delete a mounted partition would fail, while calls that add new partitions or modify/delete unmounted partitions could succeed.

In a way, this is the case. Think of the on disk partition table as a configuration "file" that tells the kernel how to configure its internal partition table. The API allows to re-read this configuration after userspace has changed it on disk. And it will fail, it the kernel thinks this is not safe to do. Back in the days, it was entirely impossible to re-read a partition table if any filesystem on the disk was mounted, always requiring a reboot if one modified the partition table on the primary disk. Nowadays it is a bit more permissive.

You just have to mentally differ between the kernel partition table (which is always in RAM) and the partition table on disk. Changing the latter one is no issue at all, as it will only be used by the kernel when explicitly told so or on the next boot. And this design makes a lot of sense. You can do modifications on disk that are only safe to apply after the next boot and then reboot. If the only ways of changing the on disk partition table where by means of an API that directly manipulates the internal partition table such changes would always require to boot another OS (rescue CD etc.).

Defending mounted filesystems from the root user

jwarnica — Mon, 28 Aug 2023 14:03:52 +0000

We could do all of those things, or we could accept that some kernel being operated by some human running on real or virtual hardware requires a level of trust of that human.

It's a weird mental model that "root is special, protect it". See: https://xkcd.com/1200/ In a more enterprisy sense: consider that some app team has full permissions to /var/lib/pgsql, but the OS team has root, so the app team needs to open a ticket to restart the server. Yah! I guess the app team isn't able to put a NIC in promiscuous mode, but who isn't using switches?

Presume that which ever human runs the kernel has access to everything; that is either tolerable trust or a massive breach depending on the organizational requirement. And then protect the kernels running, from each other. Harden the VM layer, harden the network layer, harden the APIs.

Defending mounted filesystems from the root user

geofft — Mon, 28 Aug 2023 00:31:09 +0000

I was mostly replying to the comment talking about historical filesystems - if we need to write an ext5 that makes this sort of robustness easier, that's certainly an option. We are, in any case, mostly just talking about lockdown/secure boot configs - it seems pretty reasonable to say, the upstream kernel will only let you use a very small set of filesystems when lockdown is on, and if distros want to take the risk of filesystems that haven't been audited, they can patch this check out. A lot of people run without kernel lockdown enabled (perhaps enforcing secure boot in some other way, like verifying a signature on an entire read-only rootfs) and won't be impacted by this. The people who do run with kernel lockdown should be able to expect that the kernel isn't just writing off ring-0 escalation risks.

But also I don't think punting major filesystems to FUSE is really out of the question. It was the vision of the microkernels of the '90s, which failed not because there was anything fundamentally wrong with microkernels but because overhead was high. We've learned a lot about writing efficient software that spans multiple address spaces since then (it's in many senses similar to HPC work or GPU programming), and also the physical computers are way faster. As I mentioned, without an actual benchmark, I think saying that this just has to be done in kernelspace is premature optimization.

(We also know a lot more about software fault isolation now than we did in the '90s - we could use something like eBPF or wasm or Native Client to keep these filesystems in the kernel but limit the impact of bugs.)

We Linux folks rightly make fun of Windows for having done font rendering in the kernel for so long and having had a bunch of ring-0 privilege escalation bugs as a result. It made sense in the '90s when they cared a lot about font rendering performance and basically not at all about malicious fonts; it doesn't make sense today. I don't think filesystems are a fundamentally different story.

Defending mounted filesystems from the root user

kmeyer — Sun, 27 Aug 2023 23:32:58 +0000

I mean, we're talking about ext4, xfs, and btrfs. Punting them to FUSE + LKL isn't really an option, IMO.

Defending mounted filesystems from the root user

Baughn — Sun, 27 Aug 2023 14:04:02 +0000

> * We could say "if you want to modify your kernel, either don't enable secureboot, or reimage your kernel with the appropriate changes pre-configured."

I have a computer that doesn’t boot with Secureboot disabled. They seem to be getting more common.

At the moment, I’m still able to use it as a regular computer thanks to Linux not locking itself down hard enough to stop me modifying the kernel. If a rule like that was added, then i suppose it’s game over.

Defending mounted filesystems from the root user

calumapplepie — Fri, 25 Aug 2023 22:49:17 +0000

> Yet another problem is that, according to Ts'o, the syzbot developers are unwilling to turn on this configuration option unless disabling it would be hidden behind a new CONFIG_INSECURE option (to indicate that doing so would make the system insecure). Ts'o objected to that positioning ""because that's presuming a threat model that we have not all agreed is valid"".

The threat model of "root is evil" is apparently a valid one supported by kernel_lockdown(7). However, it isn't valid unless the filesystem is booted up in lockdown mode: if it isn't, root can just use kmem and such. At a minimum, we could gate edits to devices containing mounted filesystems and on touching the SCISI_GENERIC device behind a requirement that the kernel isn't locked down. Kernel lockdown already breaks a number of things; what's a few more?

Alternatively, start with a CONFIG_LOCKDOWN_STRICT option, which when enabled tightens the restrictions of lockdown to prohibit such things as mounted block-device writes. For those users who require a root -> kernel barrier, they can enable that option, and with it some more restrictions that might break semi-niche application code. Yes, I'm considering online resizing to be 'semi-niche'. If you really require the security guarantees of a strict lockdown, you enable the config; otherwise, leave it disabled.

For those users who are just running a distro kernel, which enables CONFIG_LOCKDOWN_LSM but not STRICT because they want all the features available, this means that (for a period of time) they will be vulnerable to novel attacks using this threat model. However, the goal will be to move this patch into the basic CONFIG_LOCKDOWN eventually; thus fixing all such bugs. As we do so, we can add additional hardening behind CONFIG_LOCKDOWN_STRICT, for instance disabling a wider variety of sysfs files or locking down old drivers. You can also remove the ability to disable the lockdown LSM on the command line; a command line which can be edited for the next boot by root on most machines.

This two-phase mechanism ensures that those who want a strict lockdown will need to deal with the breakage that it causes in userspace. Those who don't need a strict lockdown, but enable lockdown anyways for hardening get to benefit over time from the work of those who need a stricter mode. It's similar to the realtime stuff; if you want a realtime kernel, you have to configure yourself a realtime kernel. If you want a kernel that actually blocks all ring0 compromise, then you have to build it yourself.

In other words: There are some folks who actually want this threat model secured, and many more who don't really care but appreciate the hardening it produces. Differentiate between the two with config options, document the difference in all the places that talk about lockdown, and let those who want it strictly secured deal with the breakage and performance regressions from it.

TLDR: Make the security model of the kernel a kconfig option, and limit features for those using the root-is-evil threat model until those features can be made secure.

Defending mounted filesystems from the root user

calumapplepie — Fri, 25 Aug 2023 21:24:04 +0000

Yes... they would fix the filesystem issuing a destructive write. That is a bug. That is bad.

A filesystem failing to handle concurrent modification is less of a bug.

Defending mounted filesystems from the root user

smammy — Fri, 25 Aug 2023 02:19:41 +0000

It's always struck me as a little odd that we interact with partition tables by raw block device access from userspace. Surely it would be reasonable for the kernel to have partition table drivers with an API for manipulating them. Presumably calls to modify or delete a mounted partition would fail, while calls that add new partitions or modify/delete unmounted partitions could succeed.

Defending mounted filesystems from the root user

Karellen — Thu, 24 Aug 2023 14:16:33 +0000

you're just deferring "something edited my FS" problems from "direct memory access" to "when I load from disk next time".

I get that. I just don't see why it's a problem. Surely checking for consistency and deciding what to do if there's a problem is easier at mount time than it is while the filesystem is in use?

Defending mounted filesystems from the root user

mathstuf — Thu, 24 Aug 2023 12:33:42 +0000

Let's say you bring in all of the metadata from the FS into memory and work from there. If you don't edit any of them, there's no need to write. However, in a situation where the backing store is editable by some other mechanism (network-mounted block device, direct writes, whatever, these can be written without noticing (say, swapping two inodes in a directory listing). Without writeback, you're just deferring "something edited my FS" problems from "direct memory access" to "when I load from disk next time".

Defending mounted filesystems from the root user

farnz — Thu, 24 Aug 2023 10:29:32 +0000

The only significant gotcha is that which implementation to use (userspace or kernelspace) is not about the filesystem in use, but rather about the degree to which the backing storage and the user are trustworthy.

In one system, I might want to use both the kernelspace implementation of xfs for my root FS, using something like fs-verity to protect against a malicious root user, and the userspace implementation for home directories. For added fun, I might want the userspace implementation to run multiple instances, so that an exploit is less likely to affect other instances (only affects other instances if it can be used to write to the backing store); this comes in handy with something along the lines of Android's user-per-application model, where I won't be able to mutate in-memory state that affects another application.

Defending mounted filesystems from the root user

Karellen — Thu, 24 Aug 2023 07:12:20 +0000

Anything less and you're just deferring discovering the bogus writes until the next mount time.

Why is that a problem?

Defending mounted filesystems from the root user

zeno_kdab — Wed, 23 Aug 2023 17:21:42 +0000

I'll agree that it does seem theoretically possible to do so. Though I am doubtful that it is a good idea, besides the already mentioned concern of practical feasibility.

Imho either you trust your hardware, and don't want your FS drivers to be slowed down by being implemented super defensively, always rechecking everything etc. Or you don't trust, but then you should be fine taking the perf hit by using FUSE or a VM to isolate the hardware handling from your host kernel.

Having said that, I always dream about a new OS kernel that transcends the monolithic/micro-dichotomy by easily allowing to move all kinds of driver into userspace and back ;)

Defending mounted filesystems from the root user

leromarinvit — Wed, 23 Aug 2023 17:13:10 +0000

Sure, perfection is impossible for anything sufficiently complex. What I really meant was fixing the issues one knows about, and having the attack vector in mind when designing and writing new code. Not saying the current maintainers have to tackle all that in addition to everything they're already doing - it's clear that adding more work either leads to everything moving at a slower pace, or someone needs to step in and fund more developers. (Someone volunteering is of course also possible, but I have a hard time imagining that "I can use root privileges to make the kernel do funny things" is many people's most important itch to scratch.)

I also should probably have qualified the "never crash" with "in a way that potentially allows privilege escalation". If removable media were by default mounted using something like lklfuse, that would IMHO be a big step in the right direction. But I think this should be mainlined, or decoupled from the actual driver code so much that it can use arbitrary kernel images or modules. Using different versions of the same fs driver (with a different set of features and bugs), potentially interchangeably on the same device, sounds like a recipe for compatibility issues.

Defending mounted filesystems from the root user

draco — Wed, 23 Aug 2023 14:03:59 +0000

Not necessarily. As an analogy, let's say that the block device is cloud storage. The cloud storage has different threats than the rest of the computer.

It's fair to say that in a scenario where you're computing in malicious environments that you must be able to trust some of your hardware — if you can't trust the CPU itself, you're doomed, sure. But with a trusted computing core and IOMMU, you can (in principle) mitigate malicious I/O if you write the drivers defensively.

Defending mounted filesystems from the root user

Wol — Wed, 23 Aug 2023 09:20:02 +0000

> Problem is, traditionally, you couldn't actually design an OS where root could only do things like the above. You also needed an interface for doing more complicated stuff, and especially for doing things in kernelspace (loading modules, debugging, enabling realtime scheduling, etc.). There are a few ways around this, at least that I can think of:

Going back to Pr1mos, the ONLY thing that was hard-coded into the OS (and even that could be patched out) was that user "system" could edit the root of the permissions tree. And not really even that - it simply set over-ride permissions, which I would often use when testing stuff ...

SPAC <system> wol:none
SPAC <data> wol:none

then I would run loads of stuff in testing that could cause carnage if I'd made a mistake, secure in the knowledge that the live system was not even visible to my program.

Cheers,
Wol

Defending mounted filesystems from the root user

epa — Wed, 23 Aug 2023 06:36:14 +0000

The difficulty is that you'd probably have to break up CAP_SYS_ADMIN for this to work

Exactly right. CAP_SYS_ADMIN is the "big kernel lock" of permissions. Or it's fcntl(). Or any other design that started out as a reasonable idea but became more and more overloaded and treated as a receptacle for anything and everything.

Defending mounted filesystems from the root user

mcassaniti — Wed, 23 Aug 2023 05:03:27 +0000

Disabling the ability to modify the whole block device means that new partitions cannot be created live (think expanding a VM disk), not can a partition be extended. While systemd isn't everyone's favourite, systemd-sysupdate can change the partition table and overwrite non-mounted partitions as part of an A/B update process. It's likely not the only tool to do so either.

Defending mounted filesystems from the root user

Cyberax — Wed, 23 Aug 2023 03:47:21 +0000

> From their perspective, I imagine the trusted component is a pretty large subset of the operating system, and I doubt they draw the line exactly at ring 0.

Microsoft has a notion of "protected processes" that block every access to themselves, even from the Administrator user. Linux doesn't really have a similar thing. The root user can trivially ptrace any process.

Defending mounted filesystems from the root user

NYKevin — Wed, 23 Aug 2023 02:57:30 +0000

The traditional design of Unix is that root (or whatever uid=0 is called) can do a relatively small set of things:

* Open any file regardless of permissions.
* Impersonate any user with setuid(2) (or some equivalent).
* Send any signal to any process, and make other adjustments to the process's state (such as renicing it).
* Mount and unmount filesystems.
* And probably a few other highly standardized actions (i.e. *not* Linux-specific things) I've forgotten about.

Problem is, traditionally, you couldn't actually design an OS where root could only do things like the above. You also needed an interface for doing more complicated stuff, and especially for doing things in kernelspace (loading modules, debugging, enabling realtime scheduling, etc.). There are a few ways around this, at least that I can think of:

* We could try to partition off the kernelspace-modifying actions into a separate user, as you suggest, or at least into a separate set of capabilities(7) or the like. The difficulty is that you'd probably have to break up CAP_SYS_ADMIN for this to work, so it would be a lot of code churn. Ultimately, I think the existing capabilities would have to be fundamentally redesigned for this to make sense. It is not enough to split off a permission here and a permission there - we have to think logically about the transitive closure of everything that a process with capability X can ever do, directly or indirectly, and the current design does not even attempt to do that. And then we have to think about all possible combinations of capabilities, or at least all combinations that can plausibly interact with each other to escalate privileges.
* We could say "if you want to modify your kernel, either don't enable secureboot, or reimage your kernel with the appropriate changes pre-configured." The effect would be to disable the kernelspace-modifying actions altogether, and maybe even patch out their codepaths entirely so that they can't be used as ROP gadgets, but only in secureboot-enabled kernels (so that people who "just want a normal kernel" and don't want to put up with this sort of thing can ignore it). The main difficulty here is that, to my understanding, much of the existing "pre-configure your system" tooling currently lives in userspace (e.g. systemd). You'd probably need to provide a rich set of kernelspace configuration options that can be set before the system is first booted, and I'm not sure how feasible that is.
* We could partition off all of the "dangerous" permissions into a series of daemons like systemd and polkit, and administer the system by asking those daemons nicely to do it for us. That would extend secureboot trust to a much wider array of system services, which is probably undesirable (now your systemd has to be secureboot-signed?). OTOH, it's not like Microsoft maintains a strong segregation between the Windows NT kernel and the modern Windows userspace. From their perspective, I imagine the trusted component is a pretty large subset of the operating system, and I doubt they draw the line exactly at ring 0.

Defending mounted filesystems from the root user

zorg24 — Tue, 22 Aug 2023 20:06:58 +0000

The issue of automounting USB drives was actually discussed in a previous article https://lwn.net/Articles/939097/

Defending mounted filesystems from the root user

mathstuf — Tue, 22 Aug 2023 17:19:56 +0000

That sounds like a futile endeavor to me. Sure, make things *better*, but *crash-proof* when you're in the realm of forced TOCTOU races (that iSCSI situation given above), deliberately bad actors messing with inode pointers/refcounts that may cause page cache confusions, or whatever else could possibly go wrong when you're in an absolutely uncontrollable and hostile environment…

I don't know…the trust line has to go somewhere here. For example, Rust is not safe against `/proc/self/mem` editing. I'm not sure what one *could* do in the face of such power because the only thing you have is "my registers are not accessible" and "the program counter will keep moving".

Note that I am usually all about defensive programming and covering bases, but I also don't interface with hardware directly and have some baseline level of viable behavior. The tales I've heard here (and from linked blogs, etc.) make me happy about my course so far. I am extremely grateful for those that do that work, but I do not envy their jobs.

Defending mounted filesystems from the root user

willy — Tue, 22 Aug 2023 16:13:48 +0000

... unless it's GFS2 or OCFS which are designed for exactly that use case ;-)

Defending mounted filesystems from the root user

DemiMarie — Tue, 22 Aug 2023 16:08:28 +0000

Exactly! And big companies (Google, Oracle, Red Hat, etc) need to hire more people to meet that goal.

Defending mounted filesystems from the root user

zeno_kdab — Tue, 22 Aug 2023 16:03:44 +0000

If the attacker has enough physical access to plug in a malicious SATA or NVMe device, isn't it rather too late to worry about security? I'd think at that point there are plenty hardware based attacks possible that no OS could defend against anyway.

For external devices maybe an idea would be to just use unprivileged FUSE to mount? It seems rather unlikely to have a use case where you need maximum FS performance but at the same time can't trust your hardware...

Defending mounted filesystems from the root user

leromarinvit — Tue, 22 Aug 2023 14:10:38 +0000

Trying to detect writes to a device (or the page cache) behind the file system's back is probably a futile endeavor. But making sure to never crash, no matter what any given read returns (and no matter if that's consistent with what was read elsewhere), seems like a goal that should be attainable in principle (even if, as many have said, it's a lot of work).

Defending mounted filesystems from the root user

magfr — Tue, 22 Aug 2023 13:43:58 +0000

For added fun you can put the file system on an iSCSI device, mount it from two computers concurrently, and then start writing from both.

I do not expect the kernel to handle that scenario.

Defending mounted filesystems from the root user

mathstuf — Tue, 22 Aug 2023 13:39:18 +0000

But then you have to write all of it back because it might have changed on disk behind your back. Anything less and you're just deferring discovering the bogus writes until the next mount time.

Defending mounted filesystems from the root user

SLi — Tue, 22 Aug 2023 12:12:40 +0000

All this discussion makes me think about one of the things I'll eventually maybe get up to doing once I've done all the other important things in the world.

To me, filesystems are in many ways an exceptionally nicely contained thing. They largely follow a well defined, narrow API with well defined semantics. Exceptions to it are probably fairly easy to express. Regardless of the filesystem, you can say things like "if I write a file, then read it back without other writes to the same file, I should get the same data (or an error) back".

That is, they seem exceptionally amenable to formal specification and analysis, and from that perspective, how they are designed today seems quite ad hoc. It shouldn't be as hard as with many other systems to actually formally define the operations (up to what gets written to the disk where) and verify that the requires properties hold, as well as do a lot of analysis on performance etc., play with different design ideas without needing to convert and boot kernels, etc. You could treat tolerance to bogus data in the same way, allowing a conscious decision on exactly how you are allowed to fail in different situations.

Now I'm not saying that should necessarily be the same as the code that gets executed (or even generated from it), but parts of it could well be if desired. Verifying the design should give quite a bit of confidence, and effort could be directed at the performance critical parts.

Defending mounted filesystems from the root user

pizza — Tue, 22 Aug 2023 11:57:30 +0000

>Nothing else knows what is conforming and what isn't.

And it's often only possible to tell if a given on-disk metedata structure is "conformant" after loading *every other* bit of metadata into memory and effectively doing a full consistency/fsck pass. Of course you're still vulnerable to stuff being written to disk behind your back, so the only way to handle that is to always keep the full metadata in memory, and never re-read anything from disk.

Defending mounted filesystems from the root user

epa — Tue, 22 Aug 2023 10:39:22 +0000

It may be naive, but I think the right approach is to break out the things that can be done by uid 0 into capabilities (yes, adding more capabilities -- they really should not be a scarce resource) and then introduce a slightly lower-privileged "admin" user, uid 1, which can do most of the things you'd do as root, but not the most dangerous low-level stuff. And that might allow you to guarantee that "admin" cannot break out of a Secure Boot kernel, which sounds like a more reasonable threat model than trying to retrofit security restrictions on what was traditionally meant to be an unlimited-power God-mode user account.

Defending mounted filesystems from the root user

khim — Tue, 22 Aug 2023 09:48:25 +0000

In a sane world we would have both. FUSE-filesystem to deal with USB or other untrusted sources and in-kernel implementation for root fs.

Defending mounted filesystems from the root user

pbonzini — Tue, 22 Aug 2023 08:53:43 +0000

A SATA-to-USB adapter is basically a SoC that implements both physical interfaces, plus some software that does SCSI-to-ATA emulation (because USB storage is based on SCSI). Likewise for microSD readers, except it's SCSI-to-SD of course. In both cases the cost of the hardware wildly dominates since software only has to be written once.

Defending mounted filesystems from the root user

geert — Tue, 22 Aug 2023 07:45:56 +0000

IIRC, the underlying transport for SATA and USB storage is very similar.

I can easily imagine a small and cheap device with a USB host and a USB device connector, which sits between the computer and a USB memory stick, introducing (not so) random corruptions to data read from the memory stick to attack the host.

Defending mounted filesystems from the root user

ebiggers — Tue, 22 Aug 2023 04:41:01 +0000

This article misses an important point, which is that the specific issue being discussed is writes to the block device's **page cache** while the filesystem is mounted. It's virtually impossible for filesystems to maintain memory safety in that case. Whereas it's possible (but difficult) for filesystems to maintain memory safety when their underlying storage changes.

It is helpful to not conflate these two cases. This makes it clear why it's useful to e.g. forbid writes to /dev/sda1 while still allowing /dev/sda. Even just forbidding buffered writes would solve this problem; O_DIRECT writes could still be allowed.

Defending mounted filesystems from the root user

geofft — Tue, 22 Aug 2023 02:04:25 +0000

I think sufficiently-Byzantine filesystems just need to get deprivileged, honestly.

https://github.com/lkl/linux is a fork of the Linux kernel that turns all the interesting routines into a library, with a couple of neat tech demos of what you can do with it - including a FUSE wrapper for the filesystems in the kernel. So any filesystem that's already been implemented once, in the kernel, now has a userspace version.

There's also other ways to do it, such as UML or hardware-assisted virtualization.

Yes, you will lose some performance. I think the triangle of security, performance, and nicheness is a "pick two" situation - if you want both security and performance, you will need to attract enough interest and enthusiasm to pick up the work, possibly defining newer and easier-to-handle on-disk formats as the work happens. Otherwise, you can use an old implementation that made sense in the '90s at full performance with the security of the '90s, or you can use it at the performance of the '90s (which should be enough, honestly!).

(I'd also be very curious to see what the actual performance loss is even for day-to-day filesystems, and whether there are things that can be done to address performance like reviving the zero-copy FUSE patchset. I think I actually do very few things that are ridiculously sensitive to filesystem performance per se: most of the time I'm either working with large single files like giant CSVs or git pack files or game textures, for which the filesystem is essentially a constant factor and it's the raw I/O performance that matters, or reading and writing lots of small files like source code, which can mostly stay in the VFS cache, in theory. Applications that care very much about disk performance, like databases, tend to make a large contiguously-allocated single file anyway - and they subdivide it in userspace.)

Defending mounted filesystems from the root user

geofft — Tue, 22 Aug 2023 01:49:20 +0000

Basically, filesystems are bigger than packets.

The reason the attacks here are about data in the superblock and not e.g. within an inode is because you can reasonably cache a little bit of data in from the block device and then validate it once you've read it. Maybe you load a page worth of data, and then you validate its layout, and then you can use the validated page. For instance, maybe there's a uint32 that specifies how long the filename is, which is restricted by spec to something more reasonable like 1024 bytes. If you've already copied the data into memory you trust, you can check it and then have other functions use it directly without worrying about them doing a kmalloc(4G).

For a network protocol parser (at any layer), that's all it does! It's received some bytes from the network into RAM, and then the authoritative copy of the data is in your own trusted RAM for you to handle as you like. You can parse it and interpret it and pass it on, or you can drop it. Then you get more bytes from the network. Even if you're receiving a large amount of data, you're handling one packet at a time, and each packet becomes fully yours when you receive it.

For filesystems, you have terabytes of data that you're repeatedly going back to. There's a lot of structure of superblock to directories to inodes to data. Not all of those blocks stay in memory. So maybe you've read a superblock once, determined that it's valid, and then it changes and for whatever reason the superblock is no longer in memory. Then the next function down the line might not get the same bytes that you validated. You can't copy the entire filesystem into RAM up front because half the point of a filesystem is to be bigger than what you can fit in RAM. You can't parse things as you receive them because you're doing random access.

You _can_ revalidate data each time you need it, but the argument being made is that writing code this way is a very unnatural and unpleasant experience.