Namespace reference counting and listns()
The original namespace type, now called mount namespaces, was introduced by Al Viro as just "namespaces" in 2001 (they were briefly covered in LWN at the time). UTS namespaces (which provide a different view of the system's host name), process-ID namespaces (managing the visibility of processes), and IPC namespaces (controlling the view of the System V inter-process communication features) followed as part of the 2.6.19 release in 2006. Each namespace type was added when the need arose and somebody was moved to implement it. As the use of namespaces has grown, though, some of the problems in their implementation have become more apparent.
Reference counts
For example, namespaces have complicated lifecycle requirements. A namespace must obviously continue to exist as long as there are any processes that are running within it. For many years, a namespace would automatically be deleted once the last process running within it exited. Over time, the ability to keep an empty namespace around (by opening a file descriptor referencing it or bind-mounting it into the filesystem) was added. Some namespaces (user namespaces, for example) are hierarchical; if a hierarchical namespace contains children, that namespace will, once again, remain within the system.
The kernel uses a reference count on each namespace to know when it is no longer in use. C is not an object-oriented language, so there is no class hierarchy for namespaces; each namespace type is a different structure. There is a "superclass", of sorts, in the ns_common structure, which all of the other namespace structure types include; it was first added by Viro to the 3.19 release in 2014. The reference count was moved into struct ns_common by Brauner for 5.9 in 2020. That structure went through some significant changes in the 6.18 merge window to reach its current form, which includes a reference count now named __ns_ref.
This count tracks all references to a given namespace; that includes all of the types listed above, but there are others as well. The kernel will often create internal references to namespaces that can cause them to persist for a period after they are no longer used and, in theory, no longer visible to user space. There is an interesting interaction with a different feature added for 6.18, though: the ability to refer to namespaces with file handles. A file handle is an opaque binary cookie that can be used to open an object without locating it in the filesystem; see the open_by_handle_at() man page for details.
Since a file handle is not contained within a filesystem, it can outlive the object to which it refers. Or, in the case of a namespace, it can outlive the visibility of the object to which it refers. A namespace may have gone completely out of use and exist only because of internal kernel references that will, presumably, go away soon; normally, user space would not be able to open this namespace. But that changes if user space has a file handle referring to the namespace; opening that file handle will result in "resurrecting" the namespace when it was otherwise on its way out.
That is the sort of API quirk that nobody asked for and nobody intended to implement. If, however, it is allowed to continue to exist, somebody — attackers if nobody else — will surely find a way to depend on it. Eliminating this quirk is the first objective of Brauner's series.
Normally, a reference-counted structure contains a single reference count; these patches go against that practice by adding a second count, called __ns_ref_active, to struct ns_common. This count tracks the number of "active" references, which are essentially the references visible to user space. It can be thought of as a subset of __ns_ref, in that any change to __ns_ref_active should be accompanied by an equal change to __ns_ref. Internal references created by the kernel, though, will only increment __ns_ref. In the end, __ns_ref still manages the lifetime of the namespace, while __ns_ref_active manages its visibility to user space.
So, for example, an attempt to open a namespace with open_by_handle_at() will fail if __ns_ref_active is zero, even if the namespace itself still exists within the kernel. It will no longer be possible to use file handles to bring namespaces back from the dead.
Listing namespaces
As kernel developers added namespaces over the last 24 years, none of them ever quite got around to implementing a way to see which namespaces are active in the system. In current kernels, the only way to get a complete list of namespaces is to go rummaging through the /proc/PID/ns directories for every process in the system. Needless to say, that is less than optimally efficient. It also still doesn't get a full list, since any namespaces that are empty but which are kept around with, for example, a bind mount, are not present in any process's /proc directory and will consequently be missed.
The obvious answer is to add a new system call that allows iterating through the active namespaces; in this series, that is listns():
struct ns_id_req {
__u32 size; /* sizeof(struct ns_id_req) */
__u32 spare; /* Reserved, must be 0 */
__u64 ns_id; /* Last seen namespace ID (for pagination) */
__u32 ns_type; /* Filter by namespace type(s) */
__u32 spare2; /* Reserved, must be 0 */
__u64 user_ns_id; /* Filter by owning user namespace */
};
ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
size_t nr_ns_ids, unsigned int flags);
A caller starts by filling in an ns_id_req structure describing the information request. The size field is the size of the structure itself, allowing for expansion in the future if need be. The ns_type field is a bitmask of the namespace types of interest; MNT_NS for mount namespaces, for example, or NET_NS for network namespaces. If only children of a given namespace are of interest, the parent's namespace ID can be put into user_ns_id. The ns_id field should be zero for the first call.
The actual listns() call takes a pointer to that structure, an array (ns_ids) to store the returned namespace IDs, and the length of that array (nr_ns_ids). The flags field must be zero in the current implementation. This call will fill in the ns_ids array with matching namespace IDs, returning the number of IDs that were put there. If the number of matching namespaces is too large to fit in the provided array, a subsequent listns() call can pick up where this one left off by placing the final ID returned by the previous call in the ns_id field of the ns_id_req structure.
Needless to say, listns() will only return namespaces that have an __ns_ref_active count greater than zero. For the curious, there are several examples of how listns() can be used in the above-linked cover letter. It is also worth noting that, while this patch series is long, patches 22 through 72 are all tests for the new functionality; they also provide examples of how it is expected to be used.
This series is in its fourth revision; the rate of change so far suggests
that there might be another round or two in store before it is ready to go,
but there do not appear to be any fundamental objections to this work.
While Brauner has not indicated when he plans to send these changes
upstream, it seems reasonable to expect to see them during the 6.19 merge
window.
| Index entries for this article | |
|---|---|
| Kernel | Namespaces |
| Kernel | System calls/listns() |
Posted Nov 3, 2025 16:34 UTC (Mon)
by hDF (subscriber, #121224)
[Link] (2 responses)
Posted Nov 3, 2025 16:44 UTC (Mon)
by corbet (editor, #1)
[Link] (1 responses)
Posted Nov 3, 2025 16:55 UTC (Mon)
by johill (subscriber, #25196)
[Link]
In netlink, to solve this, a "generation" number is included in each partial dump (and updated by other code if the list changes.) If it changed anywhere while you were dumping, you know the list might be incomplete or list stale objects. In many cases, this doesn't really matter since an object might as well disappear just _after_ you created the list, etc.
Posted Nov 3, 2025 21:19 UTC (Mon)
by smurf (subscriber, #17840)
[Link] (1 responses)
> patches 22 through 72 are all tests
If only everybody was that diligent.
Posted Nov 4, 2025 3:58 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link]
Posted Nov 4, 2025 8:24 UTC (Tue)
by Fowl (subscriber, #65667)
[Link]
(I guess figuring that out was part of why this wasn’t a /dev file or directory)
Posted Nov 5, 2025 8:41 UTC (Wed)
by SPYFF (subscriber, #131114)
[Link]
Posted Nov 5, 2025 11:19 UTC (Wed)
by paulj (subscriber, #341)
[Link] (3 responses)
Isn't it entirely the expected and long-standing behaviour of filehandles? A user-space programme can acquire a file-handle for something, the object itself can be removed from any indexes that make it visible (e.g., where the file-handle references a file, the programme itself or another can delete the file-name entr{y,ies} from the filesystem), and the programme could if it wishes make the referred to object visible again (adding a file-name in the filesystem again for the file-handle).
This is entirely normal, expected behaviour with file-handles, the referenced object, and whatever generally queryable indices there may be of such objects (e.g. filesystems).
I think, at best, the author means this *normal behaviour* is - for some reason - not desired for namespaces. And the reason why namespaces are special, compared to other objects, should then have been explained.
Posted Nov 5, 2025 13:55 UTC (Wed)
by corbet (editor, #1)
[Link] (2 responses)
Posted Nov 5, 2025 15:26 UTC (Wed)
by paulj (subscriber, #341)
[Link] (1 responses)
Posted Nov 5, 2025 15:46 UTC (Wed)
by corbet (editor, #1)
[Link]
Pagination in syscalls?
getdents() has had pagination since forever; it's pretty much necessary for any interface that could return an arbitrary amount of data.
Pagination in syscalls?
Pagination in syscalls?
Give the man a medal
Give the man a medal
Namespace namespace
Finally!
How is this an API quirk? It's normal and expected behaviour with file-handles in Unix.
Please, do not confuse file handles with file descriptors. A file handle is best thought of, perhaps, as an alternative name for the file. Just as the possession of a file's name does not cause its persistence, holding a file handle doesn't keep a file around.
How is this an API quirk? It's normal and expected behaviour with file-handles in Unix.
How is this an API quirk? It's normal and expected behaviour with file-handles in Unix.
They are new with regard to namespaces, but file handles are a fairly old concept; I think they had their origin with NFS many years ago.
How is this an API quirk? It's normal and expected behaviour with file-handles in Unix.
