A generic ring buffer for the kernel

By Jonathan Corbet
June 6, 2024

The kernel's user-space ABI does not lack for ring buffers; they have been defined for subsystems like BPF, io_uring, perf, and tracing, for example. Naturally, each of those ring buffers is unique, with no common interface between them. The natural response to this ABI proliferation is, of course, to add yet another ring buffer as the generic option; that is the intent of this patch series from Kent Overstreet adding a new set of system calls for ring buffers.

A ring buffer is simply a circular buffer, maintained in memory, that is shared between user space and the kernel. One side of a data stream writes data into the buffer, while the other consumes it; as long as the buffer neither overflows nor underflows, data can be transferred with no system calls at all. The addition of a ring buffer can, thus, enable highly efficient data transfer in situations where the data rates are relatively high. Overstreet thinks that other kernel subsystems could benefit from ring-buffer interfaces, and would like to make it possible to add those interfaces without reinventing the wheel.

The user-space interface

On the user-space side, with the proposed interface, a ring buffer is always associated with a file descriptor, so the first step will be to open the object (such as a file, a device, or a pipe) to which the ring buffer will be attached. Then, the ring buffer is established with the new system call:

    int ringbuffer(unsigned int fd, int rw, u32 size, void **buffer);

This call will request the creation of a ring buffer associated with the open file descriptor fd; the rw should be set to zero for read access, or anything else for write. The requested size of the buffer is passed in size. On a successful return, the address for the buffer will be stored in buffer.

Note that it is up to the subsystem implementing the ring buffer to decide whether to respond to this call by creating a new buffer, or by mapping an existing buffer into the caller's address space. The intent in the latter case is to allow a single ring buffer to be shared between multiple processes.

There is no flags argument in this version of the system call; that seems likely to change before any eventual merging into the mainline.

The returned pointer will be aimed at a region of memory starting with a copy of this structure:

    struct ringbuffer_ptrs {
	__u32		head;
	__u32		tail;
	__u32		size;	/* always a power of two */
	__u32		mask;	/* size - 1 */
	__u32		data_offset;
    };

The buffer itself will begin at data_offset from the returned pointer; in practice, that offset will likely always be one page. The actual size of the buffer will be stored in size; as noted, it will always be a power of two for reasons that will soon become clear. The head offset is the writing position, while tail is the read position. If head and tail are equal, the buffer is empty.

Both head and tail are incremented unconditionally when data is added or consumed; there are no tests for wrapping around. Instead, these values must be ANDed with the mask value to obtain an actual offset into the buffer. This technique eliminates the need for branches in the increment path, but it also means that a little care is needed when working with the head and tail values. Testing for emptiness is easy, since the two values will always be equal in this case. A proper test for fullness looks like:

    if ((head - tail) >= size)
        /* buffer is full, can't write to it */

(This test has been simplified; a properly written test must use atomic operations to avoid memory-ordering problems; see this test program for an example).

A process writing to a ring buffer can continue to feed data into it until the buffer fills, at which point it must stop. The writer could poll the tail value to detect when the reader has caught up and more space is available, but that would not be efficient. A similar situation applies to a reader that has emptied the buffer. Either way, the process can wait for the situation to change with:

    int ringbuffer_wait(unsigned int fd, int rw);

Where fd is the file descriptor associated with the ring buffer, and rw indicates whether read or write access is needed. Either way, the calling process will block until the situation with the ring changes. Indicating such a change is the responsibility of user space as well; a writer that has put data into a previously empty buffer, or a reader that has consumed data from a previously full buffer should call:

    int ringbuffer_wakeup(unsigned int fd, int rw);

In the cover letter, Overstreet suggests that ringbuffer_wait() and ringbuffer_wakeup() might eventually be replaced with futex operations. This patch also mentions the need to improve ringbuffer_wait() to allow for a timeout or to specify the minimum amount of data that must be written to (or read from) the buffer before a wakeup should happen.

The kernel side

To add ring-buffer support to a subsystem, the new ringbuffer() function in the file_operations structure should be provided:

    struct ringbuffer *(*ringbuffer)(struct file *file, int rw);

This function will be called in response to a ringbuffer() call from user space. There is a whole set of meticulously undocumented support functions available to drivers for working with ring buffers, including ringbuffer_alloc() to create one, ringbuffer_read_iter() and ringbuffer_write_iter() to move data out of or into a ring buffer, etc. This patch provides a test driver showing the basics of implementing a ring buffer in kernel space.

Other than the test device, there are no in-kernel users submitted with this patch set. In the cover letter, Overstreet says that the first such user will be the filesystems in user space (FUSE) subsystem, but pipes and sockets may follow thereafter. The performance benefits of the ring-buffer approach are said to be significant: "reading/writing from the ringbuffer 16 bytes at a time is ~7x faster than using read/write syscalls".

As of this writing, there have been no responses to this submission. This work is in an early state and seems likely to need to evolve somewhat before it could be considered for merging; developers are also likely to want to see a real user in the kernel. Even more interesting might be a demonstration of how some of the existing kernel ring-buffer interfaces could have been implemented using this abstraction (not that they will change now). If this work achieves its goals, it might, in a highly optimistic view, be the last ring-buffer interface added to the kernel.

Index entries for this article
Kernel	System calls/ringbuffer()

XKCD Standards?

Posted Jun 6, 2024 17:41 UTC (Thu) by andy_shev (subscriber, #75870) [Link] (4 responses)

Obligatory reference to XKCD #927.

XKCD Standards?

Posted Jun 6, 2024 22:08 UTC (Thu) by devnull13 (subscriber, #18626) [Link]

https://xkcd.com/927/

XKCD Standards?

Posted Jun 7, 2024 9:25 UTC (Fri) by bojan (subscriber, #14302) [Link] (2 responses)

No LOTR reference? One ring (buffer) to bind them...

LOTR reference

Posted Jun 7, 2024 12:48 UTC (Fri) by corbet (editor, #1) [Link]

The working title for this article was exactly that, but then I realized that we had already used it back in 2010.

XKCD Standards?

Posted Jun 7, 2024 14:28 UTC (Fri) by MortenSickel (subscriber, #3238) [Link]

One ring to buffer them all

fuse

Posted Jun 6, 2024 18:25 UTC (Thu) by koverstreet (✭ supporter ✭, #4296) [Link] (2 responses)

I wasn't committing to doing fuse myself, pipes are the more interesting one to me, but now that Jon's committed me I guess I'll have to follow through :)

fuse

Posted Jun 6, 2024 18:31 UTC (Thu) by corbet (editor, #1) [Link] (1 responses)

Hey, I like how that works. I have a whole long list of other things to add to your dance card in the near future :)

Meticulously undocumented

Posted Jun 6, 2024 19:30 UTC (Thu) by sdalley (subscriber, #18550) [Link]

I'll bet that list includes plenty of documentation items...

Flags

Posted Jun 6, 2024 19:46 UTC (Thu) by randomguy3 (subscriber, #71063) [Link] (2 responses)

There is no flags argument in this version of the system call; that seems likely to change before any eventual merging into the mainline.

Before I even got to this line, I was thinking that if there's anything years of reading LWN has taught me, it's that system calls should have flags arguments, and that accepting "any other value" will cause pain in the future. So I'd guess that the "int rw" argument will become a flags argument.

Flags

Posted Jun 7, 2024 13:22 UTC (Fri) by rrolls (subscriber, #151126) [Link] (1 responses)

I hope so!

"int rw" (which appears in not just one, but all four function signatures mentioned in this article) really just contains a boolean value, so all of these should be converted to "int flags" and a constant to indicate r/w direction.

Flags

Posted Jun 7, 2024 15:17 UTC (Fri) by adobriyan (subscriber, #30858) [Link]

yes, so we don't get another O_RDWR!

A generic ring buffer - increment size and mutiple readers and or writters

Posted Jun 7, 2024 8:23 UTC (Fri) by ppisa (subscriber, #67307) [Link]

In general, ring buffer API has sense and is solution for many lock less designs. But I miss the increment size in the shared header. Even if the buffer is intended only to pass pointer referring to other location then there is problem that variants for 32-bit user-space handling on 64-bit system has to be solved by kernel or the ring buffer needs to store only offsets into another mapped region.

Another problem are multiple readers and writers, because only hardware transactional memory support allows to keep write of pointer/offset increment to ringbuffer_ptrs->head and write of data to previous head location atomically in respect of other writers CPUs. If the writer is limited to single thread, then it is not a problem, writer check for size, then writes data and the last step is head increment in memory model release mode.

Even more complicated is case when multiple writers and variable size of the data are allowed and puzzle gets even more interesting when space allocate operation should be non blocking if there is a space or relaxed condition that allocation can be blocking but only for allocation duration but not during data fill operation which could require access to other resources which can block or memory which can be swapped out. But preventing such blocking is the must if the buffer is shared and used by tasks with different priorities and some of them fall into real-time critical domain.

I have tried to solve most of above problems in my ul_cbuff implementation. Even the data area is separated from mapping of the buffer control structures which can include even required mutexes and semaphores. The implementation allows and have been tested with multiple writers from multiple tasks, the data in the buffer are visible to readers only when all previously allocated entries are released by writers. Readers can be of the categories, for one, it is guaranteed that no data are lost, that is buffer is marked as full until last of such readers releases its data, or reader can be snapshot maker who has guarantee that receives only complete available messages from the head side but does not block following writes.

The system has been used for logging in critical systems like hospital infusion system based on RTEMS RTOS https://devel.rtems.org/wiki/TBR/UserApp/AMV_Technic_I but the test code evaluated it even on GNU/Linux in multiple memory address spaces/processes environment.

See the files

https://gitlab.com/pikron/sw-base/ulut/-/blob/master/ulut...

at the repository

https://gitlab.com/pikron/sw-base/ulut/-/tree/master/ulut

https://sourceforge.net/p/ulan/ulut/ci/master/tree/

I can post even test code.

Syscall interface details

Posted Jun 7, 2024 8:42 UTC (Fri) by zev (subscriber, #88455) [Link] (3 responses)

What's the reason for the size argument to ringbuffer() being a u32 instead of something size_t-ish like unsigned long? A >4GB buffer does seem kind of huge, and I can't think of anything offhand that would need/want one, but it doesn't seem *entirely* inconceivable that something someday some years hence might have a non-absurd reason to want it. At first I thought it might be due to needing a DCAS operation or something for operating on the members of struct ringbuffer_ptrs, but from a quick look at the example code it doesn't appear to involve anything beyond basic scalar atomics.

And on a different note, would it be possible to obviate the need for an explicit ringbuffer_wakeup() operation by setting the page containing struct ringbuffer_ptrs read-only and trapping the store that updates it? I suppose the page-table updates and fault-handling involved might end up making it slower in the case where it's actually used, but not having to explicitly check for it in the (hopefully?) common case of not needing it is at least aesthetically appealing.

Portable atomics

Posted Jun 7, 2024 17:35 UTC (Fri) by Tobu (subscriber, #24111) [Link] (2 responses)

This is addressed in a comment inside the struct:

	 * We use u32s because this type is shared between the kernel and
	 * userspace - ulong/size_t won't work here, we might be 32bit userland
	 * and 64 bit kernel, and u64 would be preferable (reduced probability
	 * of ABA) but not all architectures can atomically read/write to a u64;
	 * we need to avoid torn reads/writes.

(from the lkml thread)

Apparently, u32 is atomic on architectures where u64 isn't.

Portable atomics

Posted Jun 8, 2024 18:49 UTC (Sat) by zev (subscriber, #88455) [Link] (1 responses)

Okay, so u64 doesn't work because not all 32-bit machines can do 64-bit atomics...still, given the existence of 32-bit syscalls on 64-bit kernels, it seems like we could have ringbuffer32() that would use u32s where needed and let 64-bit userspace use u64? It would complicate things a bit of course, but it seems like it might be worthwhile to avoid saddling future code with largely artificial 32-bit compat constraints.

Portable atomics

Posted Jun 8, 2024 22:32 UTC (Sat) by koverstreet (✭ supporter ✭, #4296) [Link]

Except you can have mixed 32 bit and 64 bit applications on a 64 bit kernel, so it'd be a problem if a 64 bit application created a ringbuffer, then a 32 bit application tried to use it.

32 bits is enough; ringbuffers generally aren't supposed to be huge, they're a transport mechanism. We could add 64 bit ringbuffers, but there'd be no reason to use them most of the time.

ABI

Posted Jun 7, 2024 16:38 UTC (Fri) by meyert (subscriber, #32097) [Link]

I wonder what a generic ring buffer means for ABI stability?

stdout?

Posted Jun 7, 2024 16:59 UTC (Fri) by glenn (subscriber, #102223) [Link] (3 responses)

I'd be thrilled to be able to send stdout to a ring buffer file in a tmpfs. It would give me more confidence that I won't overflow my tmpfs. Folks who want to (try) to retain the entire file could "tail -f" the ring buffer file to disk. This approach has worked well for the ftrace buffers. I've tried transforming and sending stdout to ftrace instance-bufers as trace markers, but the solution was pretty janky.

stdout?

Posted Jun 7, 2024 18:00 UTC (Fri) by abatters (✭ supporter ✭, #6932) [Link] (1 responses)

I have used an out-of-tree kernel module called emlog for this:

https://www.circlemud.org/jelson/software/emlog/

Although the site does not appear responsive at the moment.

stdout?

Posted Jun 7, 2024 18:02 UTC (Fri) by abatters (✭ supporter ✭, #6932) [Link]

Working link:

https://github.com/nicupavel/emlog

stdout?

Posted Jun 7, 2024 18:59 UTC (Fri) by koverstreet (✭ supporter ✭, #4296) [Link]

so you want to map a ringbuffer where the buffer (not the head/tail pointer page) is mapped from another file?

that'd be interesting

unsigned file descriptors?

Posted Jun 7, 2024 19:42 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

> unsigned int fd

What's the rationale for using `unsigned` here? Is it possible to get a valid fd above 2³¹? AFAIK, all fd-creating syscalls return `int`. Why make invalid fds detectable behind a sign-change cast (or pretend that high-bit-set fd values are valid)?

why are ringbuffers bound to an existing file descriptor?

Posted Jun 11, 2024 18:46 UTC (Tue) by donald.buczek (subscriber, #112892) [Link] (3 responses)

I apologize, if these are naive questions.

Why are these ringbuffers implemented as attachments to existing file descriptors of arbitrary types and not as standalone objects which could, for example, be referenced by dedicated file descriptors of a new pseudo-filesystem? That way, epoll and friends could be used for the wait operations. I fail to see why there must be a one-to-one relation between a ringbuffer and an existing file descriptor. Is the only expected application a read/write substitute?

Another question: The original motivation seems to be communication between userspace and kernelspace. But is the implementation now also a proper tool for userspace-only communication between threads? Maybe this second question clarifies the first one: Where would I get my noop-type filedescriptor which only purpose would be to hold a ringbuffer.

why are ringbuffers bound to an existing file descriptor?

Posted Jun 13, 2024 13:42 UTC (Thu) by koverstreet (✭ supporter ✭, #4296) [Link] (2 responses)

Excluding the ringbuffer_wait and ringbuffer_wait syscalls, a ringbuffer is just a region of memory.

So once we switch to futex operations for wait/wake, you'll be able to make your own ringbuffer in shared memory, no kernel involvement.

why are ringbuffers bound to an existing file descriptor?

Posted Jun 14, 2024 17:12 UTC (Fri) by donald.buczek (subscriber, #112892) [Link] (1 responses)

Thanks.

I was just curious whether an unconventional use of the proposed ringbuffer API for userspace to userspace communication would work, assuming that whatever the kernel does with the ringbuffer for the chosen file type doesn't interfere or is taken into account. It's not sensible or needed, though, because that can be done right now and you only need help from the kernel for the non-spinning wait, e.g., via futex, which is available.

Regarding my first question, why a ringbuffer needs to be associated with a file descriptor: Do you expect ringbuffers will only be used as an alternative api to the same data channels as accessible by reading/writing the file descriptors? Is this why there is no standalone ringbuffer? Any kernel feature which supports ringbuffer communication should also support conventional I/O on the same descriptor?

why are ringbuffers bound to an existing file descriptor?

Posted Jun 14, 2024 18:52 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

There are now FDs that don't support regular reading/writing. For example, pidfd().

thread safe?

Posted Jul 5, 2024 16:58 UTC (Fri) by DanilaBerezin (guest, #168271) [Link]

I am pretty curious as to the purpose of the atomic operations. If multiple threads are reading and writing to the same ring buffer/header, surely synchronizing the state of the head and tail elements and preventing torn reads and writes to the actual buffer is the biggest problem. A problem that, at least to my understanding, doesn't seem to be solved by simply doing atomic reads and writes to the tail and head elements. And I'm honestly not quite sure what memory reordering issues those atomic operations are even solving.

A generic ring buffer for the kernel

The user-space interface

The kernel side

XKCD Standards?

XKCD Standards?

XKCD Standards?

LOTR reference

XKCD Standards?

fuse

fuse

Meticulously undocumented

Flags

Flags

Flags

Further reading on the wraparound pattern

A generic ring buffer - increment size and mutiple readers and or writters

Syscall interface details

Portable atomics

Portable atomics

Portable atomics

ABI

stdout?

stdout?

stdout?

stdout?

unsigned file descriptors?

why are ringbuffers bound to an existing file descriptor?

why are ringbuffers bound to an existing file descriptor?

why are ringbuffers bound to an existing file descriptor?

why are ringbuffers bound to an existing file descriptor?

thread safe?