|
|
Log in / Subscribe / Register

Practical uses for a null filesystem

By Jonathan Corbet
March 12, 2026
One of the first changes merged for the upcoming 7.0 release was nullfs, an empty filesystem that cannot actually contain any files. One might logically wonder why the kernel would need such a thing. It turns out, though, that there are places where a null filesystem can come in handy. For 7.0, nullfs will be used to make life a bit easier for init programs; future releases will likely use nullfs to increase the isolation of kernel threads from the init process.

Making root actually pivotable

The process of bootstrapping a computer involves a number of tricky steps, one of which is locating and mounting the root filesystem. That task might involve digging through text files, contacting other systems over the network, assembling RAID volumes, and more. As a result, there really needs to be a temporary root filesystem, with usable commands, before the real root filesystem can be mounted. That initial filesystem usually comes in the form of an initramfs image that is bundled with the kernel binary.

That initial filesystem is the first root filesystem, but there comes a time when it must be replaced with the real root. The kernel provides a system call, pivot_root(), for just that purpose. It will cause a new filesystem to become the root filesystem, and will cause any existing processes to move from the old root to the new. There is only one tiny little problem: pivot_root() cannot be used for this purpose for the actual root filesystem. From the man page:

The rootfs (initial ramfs) cannot be pivot_root()ed. The recommended method of changing the root filesystem in this case is to delete everything in rootfs, overmount rootfs with the new root, attach stdin/stdout/stderr to the new /dev/console, and exec the new init(1). Helper programs for this process exist; see switch_root(8).

Even with the availability of a helper, the need for this kind of workaround is seen by some as mildly inelegant. (The system call can be used in other contexts, such as setting up a new root for a container).

The solution is to base the filesystem tree on an empty filesystem — nullfs — upon which both the temporary and permanent root filesystems can be mounted. pivot_root() can then be used to place the permanent root below the temporary one in the mount stack, allowing the latter to be unmounted. The system can then continue the bootstrap without the need for the above-described workaround. The nullfs implementation, as found in 7.0, enables this type of operation.

Isolating kernel threads

There are other possible uses for nullfs, though; consider the case of kernel threads. The first two processes created by a booting Linux kernel are init and kthreadd. The init process will serve as the ultimate ancestor for all user-space processes created thereafter. Instead, kthreadd is the parent for all of the kernel threads that will be spawned over the life of the system. When the ps command shows you a process with a name like "[rcu_tasks_rude_kthread]", you'll know that it is an ill-mannered child of kthreadd.

Christian Brauner (who implemented nullfs), pointed out an interesting relationship between those two processes in this patch series cover letter. They share the same initial fs_struct file-descriptor table, meaning that they each have access to the other's filesystem state. When init forks, it explicitly disconnects its children from its file table, but kthreadd does not do that. As a result, every running kernel thread shares full access to init's filesystem state, and vice versa. Normally, all of those processes mutually trust each other, but the potential for mischief (and unfortunate bugs) is real.

Brauner's proposed solution is to isolate kernel threads from the initial file table fs_struct and, instead, to run each of them with a nullfs instance as its root filesystem. In that way, a kernel thread can no longer interfere with the init process; indeed, it has no access to the filesystem at all. This seems like a bit of worthwhile kernel hardening, given that most kernel threads have no need for filesystem access. As an added bonus, pivot_root() no longer needs to forcibly change the root filesystem for all of the running kernel threads, since they no longer need to be moved to the new root.

The init process, too, is separated from the initial fs_struct and given one of its own, just like any other fully independent process. At that point, the initial fs_struct is entirely unused — most of the time.

Separating kernel threads from the filesystem is almost always the right thing to do; as Brauner noted: "Offloading fs work to kthreads is really nasty [...] It's a broken concept." It is also unnecessary most of the time; after all, rcu_tasks_rude_kthread can happily go about its job of inconsiderately interrupting CPUs to force RCU grace periods without accessing any files. But there are other kernel threads that do, indeed, need occasional access to the filesystem. Enabling this access is why kernel threads have long retained their connection to the initial fs_struct. For cases like that, the patch series adds a mechanism to temporarily give a kernel thread access to the filesystem:

    scoped_with_init_fs() {
    	/* code here can perform filesystem operations */
    }

The use of the scoped guard ensures that the access is only provided within the indicated scope, with no possibility of that access being left in place accidentally.

There are roughly a dozen kernel threads that have to be patched to use this new mechanism. The production of core dumps, for example, naturally requires filesystem access. Unix-domain sockets need to be able to look up names in the filesystem, firmware loading must be able to find and open the file containing the firmware, the devtmpfs filesystem must be able to provide the filesystem content, and so on. So there are a number of holes punched into the wall separating kernel threads from the filesystem, but they are small, localized, and easy to find.

Brauner is careful to not set expectations too high for this work at this point: "Is it crazy? Yes. Is it likely broken? Yes. Does it at least boot? Yes." It is a significant change to some of the deepest code within the kernel, code that has had its current form for a long time. The existence of surprises seems almost certain. But, so far, nobody has questioned the goals or direction of this patch series. Greater isolation for kernel threads (and init) thus seems likely to show up in a future kernel release.

Index entries for this article
KernelFilesystems/nullfs
KernelKernel threads
KernelReleases/7.0


to post comments

Variant for chroot ?

Posted Mar 13, 2026 4:34 UTC (Fri) by wtarreau (subscriber, #51152) [Link] (5 responses)

What I'd like to see would be a variant of chroot() which would automatically create such a nullfs, enter it and lazy-unmount it so that once the caller dies, it's automatically unmounted. It would allow to completely isolate the caller. I'm using chroot() to empty directories with no permissions as an isolation for daemons and it works reasonably well because you cannot easily escape it (nowhere to mkdir+chroot+cd again). Of course the daemon must not hold any FD pointing to an external directory! In this case we could imagine something like this:

#define NULLFS_DIR (const char *)1
chroot(NULLFS_DIR); // mount nullfs and mark it lazy-umount (will it work before chdir()?)
chdir("/");

Variant for chroot ?

Posted Mar 13, 2026 5:25 UTC (Fri) by tamiko (subscriber, #115350) [Link] (2 responses)

It sounds like you're looking for the $ unshare helper program. Unsharing the current mount namespace and putting a child process into it has exactly the semantics you're looking for: when the child exits the namespace is removed as well.

The new nullfs mount has the advantage that you can start with a truly empty mount namespace instead of mounting over and sealing. That's at least my current understanding.

But out of curiosity: do you know "bubblewrap"? It is a fantastic helper tool to createnlightweight sandboxes via namespaces. Best of all, you can run it as an unprivileged user. And it can do all of the sandboxing you're talking about.

Variant for chroot ?

Posted Mar 13, 2026 9:42 UTC (Fri) by wtarreau (subscriber, #51152) [Link] (1 responses)

> It sounds like you're looking for the $ unshare helper program. Unsharing the current mount namespace and putting a child process into it has exactly the semantics you're looking for: when the child exits the namespace is removed as well.

I'm already using it for other stuff, and could indeed call unshare(CLONE_FS) in the program. But I seem to remember that abstract unix socket paths are affected by unshare(CLONE_FS). This could be an acceptable tradeoff for most use cases though.

> But out of curiosity: do you know "bubblewrap"?

No I don't.

> It is a fantastic helper tool to createnlightweight sandboxes via namespaces. Best of all, you can run it as an unprivileged user. And it can do all of the sandboxing you're talking about.

I'm really talking about doing the sandboxing from within the deamon itself. Normally my programs boot, parse config files, load libraries etc, then chroot(), chdir() and drop privileges. Here I could indeed do unshare() instead of the first two steps and it would also work for unprivileged users, but maybe with an abns limitation that I'd need to recheck. Thanks for raising this hint I had just forgotten about!

Variant for chroot ?

Posted Mar 13, 2026 13:57 UTC (Fri) by qyliss (subscriber, #131684) [Link]

> But I seem to remember that abstract unix socket paths are affected by unshare(CLONE_FS). This could be an acceptable tradeoff for most use cases though.

According to network_namespaces(7), those are scoped to network namespace.

Variant for chroot ?

Posted Mar 14, 2026 1:35 UTC (Sat) by geofft (subscriber, #59789) [Link] (1 responses)

> enter it and lazy-unmount it so that once the caller dies, it's automatically unmounted

This is already the behavior of, say, a tmpfs in a mount namespace. It's reference-counted and once the last process referencing it dies, the tmpfs is cleaned up. You can see this very clearly by creating a large file and looking at memory usage before/after a kill -9 of the owning process.

Mostly for amusement purposes, I wrote a mkdtemp() alternative that works by creating a child process in a new unprivileged user+mount namespace and having it pass back a handle to a tmpfs. https://ldpreload.com/p/verytmp.c

A friend ported it to Rust and has a slightly more detailed writeup: https://blinsay.com/posts/verytmp/

(nullfs appears to not be mountable inside a userns, largely because it's a global singleton. So you don't actually need to garbage-collect it, but I guess this also means you might not be a way to actually get to it from a regular userspace process.)

Variant for chroot ?

Posted Mar 16, 2026 9:24 UTC (Mon) by taladar (subscriber, #68407) [Link]

At first I thought mounting nullfs might be safer to allow for a user than most other filesystems since it can not replace the content it is mounted over, just hide it, but I guess there is the edge case of config.d style directories where you could cut out some parts of the configuration for some setuid root tool like sudo that way.

pivot_root()?

Posted Mar 13, 2026 5:34 UTC (Fri) by pabs (subscriber, #43278) [Link] (7 responses)

What blocks pivot_root() from being able to pivot away from the initramfs to the real rootfs?

pivot_root()?

Posted Mar 13, 2026 20:49 UTC (Fri) by jbroadus (subscriber, #115605) [Link] (6 responses)

Brief explanation in the patch cover letter:

"Currently pivot_root() doesn't work on the real rootfs because it cannot be unmounted. Userspace has to do a recursive removal of the
initramfs contents manually before continuing the boot."

So maybe picot_root() could be modified but wouldn't know how to clean up the old filesystem?

pivot_root()?

Posted Mar 13, 2026 22:26 UTC (Fri) by Nikratio (subscriber, #71966) [Link] (5 responses)

I'm confused by this as well. The manpage quote says "The rootfs (initial ramfs) cannot be pivot_root()ed" and the cover letter says that "pivot_root() doesn't work on the real rootfs". So it seems there are cases where pivot_root does work - namely when the root filesystem is not of type rootfs but something like ext3. Is that true? If so, then what makes rootfs so different that pivot_root can't work there? And wouldn't it be easier to just fix that instead of them nullfs workaround?

pivot_root()?

Posted Mar 14, 2026 0:06 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (3 responses)

My (possibly wrong) interpretation is that this is not a filesystem-level restriction, but a mounting restriction. In other words, the problem is not that there's some specific "rootfs" filesystem that is unable to be pivoted. Rather, the problem is that, post-boot, the mount system is unable to represent a state where nothing is mounted. Presumably, the pivot_root() syscall would need to transition through such a state, and working around this limitation is either impossible or not worth the complexity.

But this is pure guesswork on my part.

pivot_root()?

Posted Mar 14, 2026 5:07 UTC (Sat) by intelfx (subscriber, #130118) [Link]

Another guesswork-grade interpretation is that the "real rootfs" is declared and/or constructed statically somewhere within Linux, and simply cannot be destroyed as a conseqence of that.

But I am indeed confused as well by this limitation. Surely removing the underlying cause would have been easier than switching the entire mechanism to use nullfs.

pivot_root()?

Posted Mar 14, 2026 13:54 UTC (Sat) by Nikratio (subscriber, #71966) [Link] (1 responses)

> My (possibly wrong) interpretation is that this is not a filesystem-level restriction, but a mounting restriction. In other words, the problem is not that there's some specific "rootfs" filesystem that is unable to be pivoted. Rather, the problem is that, post-boot, the mount system is unable to represent a state where nothing is mounted. Presumably, the pivot_root() syscall would need to transition through such a state, and working around this limitation is either impossible or not worth the complexity.

I think that interpretation boils down to "the pivot _root system call fundamentally does not work", since pivoting ing the root filesystem is the one and only thing this system call is supposed to do. I highly doubt this would have been added if it never works.

pivot_root()?

Posted Mar 16, 2026 17:41 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

Per the man page, it changes the root filesystem *for the current mount namespace.* As described under NOTES, this is primarily useful for setting up containers. It was at one point also used during boot, but this is (apparently) no longer possible.

pivot_root()?

Posted Mar 16, 2026 13:55 UTC (Mon) by abatters (✭ supporter ✭, #6932) [Link]

I believe pivot_root() was used with the old-style initrd, but it doesn't work with the new-style initramfs. See initrd(4). Note that the old initrd code is being removed: https://lwn.net/Articles/1057769/

Is this sort of a "template filesystem?"

Posted Mar 13, 2026 10:54 UTC (Fri) by jpeisach (subscriber, #181966) [Link]

Could future filesystems pretty much use this as a place to start? Of course they still have to implement file operations, but it's something, I guess.

Confusion with 4.4BSD's nullfs

Posted Mar 13, 2026 17:48 UTC (Fri) by jrtc27 (subscriber, #107748) [Link]

Although I get why this name makes sense in isolation, this seems unnecessarily confusing to use. 4.4BSD's nullfs (still present in FreeBSD today) is what Linux calls a bind mount, and now that name is being used for a very different thing.

Is unprivileged use possible?

Posted Mar 14, 2026 1:07 UTC (Sat) by alip (subscriber, #170176) [Link] (1 responses)

Many container managers, including sydbox, offer functionality to mount a scratch tmpfs or ramfs and build the mounts on top of that. Can we now instead use nullfs for this root mount? That'd be a nice plus imho.

Is unprivileged use possible?

Posted Mar 14, 2026 1:12 UTC (Sat) by alip (subscriber, #170176) [Link]

Another useful functionality is to mount nullfs over critical paths to mask such as /boot, /proc/sys/kernel etc.


Copyright © 2026, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds