|
|
Log in / Subscribe / Register

Task-level io_uring restrictions

By Jonathan Corbet
January 19, 2026
The io_uring subsystem is more than an asynchronous I/O interface for Linux; it is, for all practical purposes, an independent system-call API. It has enabled high-performance applications, but it also brings challenges for code built around classic, Unix-style system calls. For example, the seccomp() sandboxing mechanism does not work with it, causing applications using seccomp() to disable io_uring outright. Io_uring maintainer Jens Axboe is seeking to improve that situation with a rapidly evolving patch series adding a new restrictive mechanism to that subsystem.

The core feature of seccomp() is restricting access to system calls; an installed filter can examine each system call (along with its arguments) made by a thread and decide whether to allow the call to proceed or not. The operations provided by io_uring are analogous to system calls, so one might well want to restrict them in the same way. But seccomp() has no visibility into — and thus no way to control — operations requested via io_uring. Running a program under seccomp() and allowing it access to io_uring almost certainly gives that program a way to bypass the sandboxing entirely.

As it turns out, io_uring itself supports a mechanism that allows the placement of limits on io_uring operations; LWN covered an early version of this feature in 2020. To create an operation-restricted ring, a process fills in an array of io_uring_restriction structures:

    struct io_uring_restriction {
	__u16 opcode;
	union {
		__u8 register_op; /* IORING_RESTRICTION_REGISTER_OP */
		__u8 sqe_op;      /* IORING_RESTRICTION_SQE_OP */
		__u8 sqe_flags;   /* IORING_RESTRICTION_SQE_FLAGS_* */
	};
	/* Some reserved fields omitted */
    };

While the term "restriction" is used throughout the API, what these structures are doing is describing the allowed operations. Each has a sub-operation code affecting what is allowed:

  • IORING_RESTRICTION_REGISTER_OP allows a specific registration operation — an operation that affects the ring itself. These operations include registering files or buffers, setting the clock to use, and even imposing these restrictions, among many others.
  • IORING_RESTRICTION_SQE_OP enables an operation that can be queued in the ring; these include all of the I/O and networking operations supported by io_uring. The io_uring_enter() man page has a list of available operations.
  • IORING_RESTRICTION_SQE_FLAGS_ALLOWED sets the list of operation flags that are allowed to appear in io_uring operations; these flags, listed in the io_uring_enter() man page, control the sequencing of operations, use of registered buffers, and more.
  • IORING_RESTRICTION_SQE_FLAGS_REQUIRED creates a set of flags that must appear in each operation. Making a flag required implicitly sets it as being allowed as well.

The array of these structures can be installed with an IORING_REGISTER_RESTRICTIONS operation, after which it will be effective on the ring. This restriction mechanism is not as capable as what seccomp() can do; it cannot look at operation arguments, for example. But it is fast enough to not interfere with the performance goals of io_uring, and is sufficient to wall off significant parts of the API.

There is, however, a significant limitation to the current restriction mechanism: restrictions can only be applied to an existing ring, and that ring must be in the disabled state at the time. It works well for an application that, for example, needs to create a ring, add restrictions, then pass it into a container. It falls short, though, for use cases that want to allow io_uring in general, but with a specified subset of operations. Axboe's work is intended to address this limitation by allowing restrictions to be applied to a task rather than to a specific ring.

Specifically, this work started by adding a new operation, IORING_REGISTER_RESTRICTIONS_TASK, that can accept the same list of io_uring_restriction structures. That list will be stored with the calling task itself, though, rather than with a specific ring, and the restrictions will be applied to all rings subsequently created by that task. The list is applied to children during a fork, so the restrictions will apply to all child processes created after they are set up. These restrictions thus govern any rings created in the future, without the controlling task having to participate in that creation.

Once the restrictions have been set, they are immutable, with a couple of exceptions. The IORING_REG_RESTRICTIONS_MASK flag allows restrictions to be tightened further by removing allowed operations and flags, or by adding new required flags. The process that initially added the restrictions retains the power to modify them or remove them entirely. That process's children, instead, will remain stuck with the restrictions that were created for them.

At least, that was the state of things as of the second RFC version of the patch set. The third version made a number of changes, starting with the removal of IORING_REG_RESTRICTIONS_MASK and any other ability to change the restrictions once they have been put into place. The bigger change, though, was the addition of more flexible filtering using, inevitably, a set of BPF programs. Interestingly, that flexibility was reduced somewhat in later versions, as will be seen.

The current BPF implementation is a bit of a proof of concept. Among other things, it currently only properly filters the IORING_OP_SOCKET operation, which is the io_uring equivalent to the socket() system call. Operations can be controlled, but registration requests are not currently included in the BPF mechanism.

There is a new registration operation, IORING_REGISTER_BPF_FILTER, which adds a new BPF program to a ring; the program is associated with a specific IORING_ operation code. It will be invoked after the initial preparation for a new operation has been done; as a result, any structures provided by user space as part of the operation will have been copied into the kernel and will be available for the program to inspect. That gives these filters an advantage over seccomp(), which generally cannot access data in user space that is passed to the kernel via pointers.

The program will also be passed context specific to the operation in question; for IORING_OP_SOCKET, that context includes the address family, socket type, and protocol provided by user space. A non-zero return value from the BPF program allows the operation to proceed; otherwise it will be blocked. There can be multiple BPF programs attached to any given operation; they will be invoked in sequence, and any one of them can block an operation. While the current patch set does not implement this behavior, Axboe has said that he intends to change the behavior to "deny by default" in the future; if BPF is in use, then an operation will be disallowed unless a BPF program explicitly allows it.

By the time the patch set reached version 5 (with the "RFC" tag removed) things had changed again in an interesting way. There are two versions of BPF in the kernel, the "extended BPF" that is normally just called "BPF" in recent times, and "classic BPF", which is the earlier, BSD-derived variant that was designed for packet filtering. Classic BPF is far less capable and lacks compiler support; there have been no new users of it added to the kernel for years. But the current version of the io_uring patches now uses classic BPF rather than extended BPF.

Axboe noted that the usability of the feature is reduced by this change: "This obviously comes with a bit of pain on the usability front, as you now need to write filters in cBPF bytecode". The change was driven by the fact that classic BPF can be used by unprivileged processes, while extended BPF requires privilege (specifically, the CAP_BPF capability). For the desired use case of sandboxing containers, accessibility without privilege is important. It is worth noting that seccomp() also still uses classic BPF, for the same reason. The hooks for extended BPF are still there, but cannot be used.

As one might surmise, this patch set seems to be evolving quickly, and may well have changed again by the time you read this. It seems clear, though, that it will soon be possible to control access to io_uring at a level that, previously, has not been possible. Just as brakes allow a car to go faster, fine-grained control may make io_uring available in contexts where, until now, it has been blocked.

Index entries for this article
KernelBPF/io_uring
Kernelio_uring/Security
KernelReleases/7.0


to post comments

A crazy thought

Posted Jan 19, 2026 18:48 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (28 responses)

Here's a crazy thought. What if we use an industry-proven sandboxed environment for it, the one that is battle-tested by the most aggressive Internet environment?

Hint: WASM.

Has the eBPF saga not taught you anything? You'll end up reinventing it _again_.

A crazy thought

Posted Jan 19, 2026 20:07 UTC (Mon) by ballombe (subscriber, #9523) [Link] (7 responses)

... and you'll end up making the exact same comment _again_.

A crazy thought

Posted Jan 19, 2026 21:02 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

Well, yes. It'll be a fun recursive WTF.

A crazy thought

Posted Jan 20, 2026 7:14 UTC (Tue) by edeloget (subscriber, #88392) [Link] (5 responses)

At this time, I feel that eBPF is mature enough to not being systematicly compared to WASM. It's also evolving faster. It's also more and more used outside the space of the Linux kernel (mostly in the network area).

If you really want a différent yet proven VM for the kernel, why WASM? Why not the Java VM? Or the Lua VM? Or any other VM that has been there for years?

Since the eBPF VM is already here, I think that anybody who wants another VM in the kernel will have to implement it, instead of requiring everyone to do it. Isn't that the idea behind Open Source ?

A crazy thought

Posted Jan 20, 2026 7:32 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

Java VM is not well-suited for this, it needs classloading, object hierarchy, gc, and so on.

LUA would have been great, and it has been used successfully in BSDs in kernelspace.

> It's also evolving faster.

No, it's racing to become competitive with WASM circa 5 years ago. I'm now eagerly awaiting multithreading in eBPF.

A crazy thought

Posted Jan 20, 2026 17:23 UTC (Tue) by ballombe (subscriber, #9523) [Link] (3 responses)

> No, it's racing to become competitive with WASM circa 5 years ago. I'm now eagerly awaiting multithreading in eBPF.

WASM is still 32bit-only and progress on that have been slow.

A crazy thought

Posted Jan 20, 2026 18:50 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

A crazy thought

Posted Jan 21, 2026 17:09 UTC (Wed) by ballombe (subscriber, #9523) [Link] (1 responses)

This is huge! Is there a version of emscripten that support LP64 available ?

A crazy thought

Posted Jan 24, 2026 20:35 UTC (Sat) by ballombe (subscriber, #9523) [Link]

In fact it works already, with emscripten 4.0.37 provided node is upgraded to 2.24 and -s MEMORY64 is added to CFLAGS. Congrats to everyone involved.

A crazy thought

Posted Jan 19, 2026 20:19 UTC (Mon) by aszs (subscriber, #50252) [Link] (13 responses)

WASM prevents code from accessing memory outside its sandbox but it doesn't provide memory safety within its sandbox. To provide safety guarantees similar to BPF you would need to make some pretty big changes to WASM. For example, see https://arxiv.org/abs/2208.13583

A crazy thought

Posted Jan 19, 2026 21:03 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (12 responses)

Yes, and? eBPF also has the same property. You can create a large array and simulate malloc/free inside this array, with out-of-bounds access.

eBPF code won't be able to get _out_ of it, but it can still logically misbehave.

A crazy thought

Posted Jan 19, 2026 23:12 UTC (Mon) by aszs (subscriber, #50252) [Link] (11 responses)

talk about "Yes, and?"! :) Why would anyone do something like that? The point is that BPF has facilities to validate access to kernel structures both statically (through its verifier) and dynamically (using dynptrs) and WASM doesn't.

A crazy thought

Posted Jan 19, 2026 23:29 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (10 responses)

> Why would anyone do something like that?

And why would anyone write code in WASM that has use-after-free and all that?

WASM also can validate access to the kernel structures through dynptrs, you just expose the bpf_dynptr_write/bpf_dynptr_read as helpers to the WASM runtime. There is absolutely no difference here between them.

The only remaining difference is the verifier. And it has been neutered to the point of irrelevance with the addition of bpf_loop/bpf_foreach and other helper functions. So I can trivially make correct eBPF programs that require practically unbounded time.

A crazy thought

Posted Jan 20, 2026 0:06 UTC (Tue) by aszs (subscriber, #50252) [Link] (9 responses)

> And why would anyone write code in WASM that has use-after-free and all that?

I'll leave that as an exercise for the reader...

Anyway, you can't call malloc or free in bpf programs and the bpf verifier statically tracks the lifetime of the pointers the program does have access to, so it guarantees use-after-frees can't happen.

That said, a WASM-to-BPF byte code transpiler would be awesome, i'm surprised no one has made one yet.

A crazy thought

Posted Jan 20, 2026 1:28 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (8 responses)

> Anyway, you can't call malloc or free in bpf programs

You absolutely can: https://docs.ebpf.io/linux/kfuncs/bpf_arena_alloc_pages/ Here's a mandatory LWN article about it: https://lwn.net/Articles/961941/

> hat said, a WASM-to-BPF byte code transpiler would be awesome, i'm surprised no one has made one yet.

The transpiler is trivial, but the runtime environments are too different for it to be useful.

A crazy thought

Posted Jan 20, 2026 3:59 UTC (Tue) by aszs (subscriber, #50252) [Link] (7 responses)

> You absolutely can: https://docs.ebpf.io/linux/kfuncs/bpf_arena_alloc_pages/ Here's a mandatory LWN article about it: https://lwn.net/Articles/961941/

Sure, there's lots of ways to directly and indirectly allocate memory through bpf apis. But those pointers are tracked by the verifier to prevent unsafe access (use after free, pointer arithmetic, etc.). The wasm bytecode verifier doesn't do this, so any theoretical in-kernel wasm program couldn't safely access kernel memory directly.

A crazy thought

Posted Jan 20, 2026 6:14 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

> The wasm bytecode verifier doesn't do this

Yes, it does. WASM, just like eBPF, by construction can't access memory outside of its sandbox.

And just like in eBPF, WASM doesn't actually even _have_ pointers. It instead has indices in memory blocks, and they are always range-checked. Go on and try to read the spec: https://webassembly.github.io/spec/core/syntax/modules.ht...

It is very common for old C/C++ programs to re-create the malloc/free on top of a large pre-allocated block of memory in WASM and to use indexes to emulate the C/C++ pointers. And it's possible for C/C++ code in WASM to have out-of-bounds reads/writes, just like on the real hardware. So if you compile OpenSSL into WASM, it will still be vulnerable to a Heartbleed-like bug.

However, the out-of-bounds access in WASM still can NOT escape the memory blocks. They are compiled into index-based array operations, and they are always range-checked.

And of course, exactly the same scenario can happen in eBPF with arenas. The eBPF verifier won't do anything to help.

I just leaving this here

Posted Jan 20, 2026 7:20 UTC (Tue) by NHO (subscriber, #104320) [Link] (1 responses)

I just leaving this here

Posted Jan 20, 2026 7:35 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

In other words: bugs happen. Duh. Same for the eBPF, it also has had its share of CVEs.

The kernel developers themselves don't trust eBPF to be safe, so it's limited to the root user only.

A crazy thought

Posted Jan 20, 2026 14:13 UTC (Tue) by aszs (subscriber, #50252) [Link] (3 responses)

>> Yes, it does. WASM, just like eBPF, by construction can't access memory outside of its sandbox.

my first comment:

"WASM prevents code from accessing memory outside its sandbox but it doesn't provide memory safety within its sandbox. To provide safety guarantees similar to BPF you would need to make some pretty big changes to WASM. For example, see https://arxiv.org/abs/2208.13583"

We are literally going in circles. How would your in-kernel wasm jit handle the task_struct pointer returned by bpf_get_current_task_btf()? Or modifying xdp packets with apis like https://docs.ebpf.io/linux/helper-function/bpf_xdp_adjust... And do stuff like that safely without overhead?

A crazy thought

Posted Jan 20, 2026 18:48 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> We are literally going in circles. How would your in-kernel wasm jit handle the task_struct pointer returned by bpf_get_current_task_btf()?

Would you read the WASM spec, please? The task_struct will be modeled as a WASM structure type: https://webassembly.github.io/spec/core/syntax/types.html... It's accessible via _references_ only ( https://webassembly.github.io/spec/core/syntax/types.html... ):

> Reference types are opaque, meaning that neither their size nor their bit pattern can be observed. Values of reference type can be stored in tables but not in memories.

It will NOT be accessible using pointer operations from any memory blocks. That's how browsers expose complex objects into the WASM world.

Just like in eBPF (does that phrase sound familiar by now?).

> And do stuff like that safely without overhead?

Yes. Just like in eBPF.

Again, this is not something theoretical or exotic. That's how browsers work RIGHT NOW.

A crazy thought

Posted Jan 21, 2026 17:35 UTC (Wed) by notriddle (subscriber, #130608) [Link] (1 responses)

It’s probably worth pointing out that the features you’re talking about are less than a year old. https://webassembly.org/news/2025-09-17-wasm-3.0/

Not necessarily a problem, but it does mean that the idea of wasm being “battle-tested” is wrong. The features you think they should rely on are new.

A crazy thought

Posted Jan 21, 2026 18:41 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

The typed references are new, but even the first version of WASM had something similar: https://www.w3.org/TR/2019/REC-wasm-core-1-20191205/#exte...

External types were represented as tables and functions with similar limitations, they were treated as opaque objects without any ways to get to their exact binary representation (never mind modify them).

A crazy thought

Posted Jan 20, 2026 0:25 UTC (Tue) by ibukanov (subscriber, #3942) [Link] (5 responses)

WASM is much harder to protect against Spectre and similar hardware bugs than BPF.

A crazy thought

Posted Jan 20, 2026 1:09 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

No, it's not. You can do exactly the same constant blinding during code generation. It would require additional development, because it's not how people are combating SPECTRE in user-space, but nothing terribly unusual. Just another pass in the code generator.

A crazy thought

Posted Jan 20, 2026 17:46 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (3 responses)

All of these "one only needs to…" and yet…no one is doing it?

A crazy thought

Posted Jan 20, 2026 19:00 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Why would you? SPECTRE is not a major issue in userspace that WASM needs to take care of. Browsers prevent timing attacks by isolating different trust domains into different processes.

A crazy thought

Posted Jan 20, 2026 21:27 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (1 responses)

I'm talking about the wider WASM-over-BPF commentary.

A crazy thought

Posted Jan 20, 2026 22:36 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

I understand. My point is that no WASM runtime right now _needs_ to have SPECTRE mitigations due to the way browsers work. So nobody is working on them.

And yes, I looked at the eBPF JIT and Wasmtime and Cranelift, so I roughly understand the scope of work. It's a fair amount, but nothing extraordinary.

Also I have a deep suspicion that with all the kfuncs available to eBPF, you can find enough speculation gadgets there.

Surely opening files is the key operation to sandbox

Posted Jan 20, 2026 9:37 UTC (Tue) by epa (subscriber, #39769) [Link] (1 responses)

I haven't used io_uring (my experience goes as far as select()) but I would expect that the I/O you really need to go fast is read() and write(). Opening the file is a comparatively rare event. Yet surely most seccomp filters strictly control what files you can open (and whether for read-only, append-only, or read-write) and don't do much to filter the operations on an already open filehandle.

Can't you open the files conventionally, with calls that are filtered by seccomp in the usual way, and then pass those filehandles into io_uring, via an entry point that allows only read and write operations? Seccomp would allow that restricted form of io_uring only. That would surely cover 80% of it.

Surely opening files is the key operation to sandbox

Posted Jan 20, 2026 13:21 UTC (Tue) by alip (subscriber, #170176) [Link]

Another option is to have an API to emulate io_uring through seccomp-unotify (like Syd) or seccomp-trap (like gVisor).

Imagine...

Posted Jan 22, 2026 5:15 UTC (Thu) by milesrout (subscriber, #126894) [Link] (3 responses)

...a world in which the Linux kernel could have all this security theatre deleted. No more seccomp, no more LSMs. Imagine how much code and overhead could be deleted, how much complexity could be avoided.

None of it is necessary. We have security mechanisms in place already. They're called users and permissions. Don't run random untrusted code you downloaded from the internet without checking it.

Imagine...

Posted Jan 22, 2026 5:40 UTC (Thu) by mb (subscriber, #50428) [Link] (2 responses)

A pretty horrible world, I would say.

Not every detail about these mechanisms is good and perfect of course.
But overall it's a good idea to lock down applications to what they actually need rather than exposing them to everything that is available.

This has nothing to do with running untrusted code.
Also fully trusted and reviewed code shall be locked down just to reduce and mitigate the remaining bugs.

Imagine...

Posted Jan 22, 2026 7:10 UTC (Thu) by milesrout (subscriber, #126894) [Link] (1 responses)

Remember the counterfactual is a simpler world. Simpler kernel, simpler userspace code. There is a lot of complexity in these layers of "security in depth" and complexity is the source of bugs.

Imagine...

Posted Jan 22, 2026 7:50 UTC (Thu) by Wol (subscriber, #4433) [Link]

The problem with a simpler world is simple (as in braindead) users.

I just gave my users a completely new system with the instruction "please test". Thanks to Chinese Whispers, they used it live ... and it contained a painful flaw. It's okay being 95% correct, but if that other 5% is something that's rather noticeable to your customers - ouch! :-(

And apparently it didn't cross ANYbody's mind to think "this is the first time we've used it, and we're using it live". Especially as my default rule is "if you don't know what you're doing, do the same as last week" - the number of times I have to repeat that rule, it frustrates the hell out of me!

Cheers,
Wol

Why can't seccomp() check io_uring-generated syscalls?

Posted Feb 7, 2026 22:29 UTC (Sat) by reillyeon (subscriber, #51146) [Link] (1 responses)

This feels like a dumb question but despite the number of articles mentioning that io_uring "bypasses" seccomp() I feel like no one has explained why these two subsystems can't work together. Perhaps my mental model of how io_uring operates is incorrect, but as I understand it each io_uring operation is equivalent to a system call. Why can't these be checked, perhaps even with the exact same code as the equivalent system call, when the operation is executed? If we can solve the complexities of checking the properties of a system call why can't we do the same for an io_uring operation? Is a completely parallel restriction mechanism really required?

Why can't seccomp() check io_uring-generated syscalls?

Posted Feb 8, 2026 0:23 UTC (Sun) by corbet (editor, #1) [Link]

Io_uring really is an entirely separate interface to kernel-provided functionality. It has things like descriptorless files, for example. Many io_uring operations mirror normal system calls, but others do not.

One could probably find a way to extend seccomp() to handle it, but it would probably not be a smaller project than what is being proposed here.


Copyright © 2026, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds