A generic ring buffer for the kernel
A ring buffer is simply a circular buffer, maintained in memory, that is shared between user space and the kernel. One side of a data stream writes data into the buffer, while the other consumes it; as long as the buffer neither overflows nor underflows, data can be transferred with no system calls at all. The addition of a ring buffer can, thus, enable highly efficient data transfer in situations where the data rates are relatively high. Overstreet thinks that other kernel subsystems could benefit from ring-buffer interfaces, and would like to make it possible to add those interfaces without reinventing the wheel.
The user-space interface
On the user-space side, with the proposed interface, a ring buffer is always associated with a file descriptor, so the first step will be to open the object (such as a file, a device, or a pipe) to which the ring buffer will be attached. Then, the ring buffer is established with the new system call:
int ringbuffer(unsigned int fd, int rw, u32 size, void **buffer);
This call will request the creation of a ring buffer associated with the open file descriptor fd; the rw should be set to zero for read access, or anything else for write. The requested size of the buffer is passed in size. On a successful return, the address for the buffer will be stored in buffer.
Note that it is up to the subsystem implementing the ring buffer to decide whether to respond to this call by creating a new buffer, or by mapping an existing buffer into the caller's address space. The intent in the latter case is to allow a single ring buffer to be shared between multiple processes.
There is no flags argument in this version of the system call; that seems likely to change before any eventual merging into the mainline.
The returned pointer will be aimed at a region of memory starting with a copy of this structure:
struct ringbuffer_ptrs { __u32 head; __u32 tail; __u32 size; /* always a power of two */ __u32 mask; /* size - 1 */ __u32 data_offset; };
The buffer itself will begin at data_offset from the returned pointer; in practice, that offset will likely always be one page. The actual size of the buffer will be stored in size; as noted, it will always be a power of two for reasons that will soon become clear. The head offset is the writing position, while tail is the read position. If head and tail are equal, the buffer is empty.
Both head and tail are incremented unconditionally when data is added or consumed; there are no tests for wrapping around. Instead, these values must be ANDed with the mask value to obtain an actual offset into the buffer. This technique eliminates the need for branches in the increment path, but it also means that a little care is needed when working with the head and tail values. Testing for emptiness is easy, since the two values will always be equal in this case. A proper test for fullness looks like:
if ((head - tail) >= size) /* buffer is full, can't write to it */
(This test has been simplified; a properly written test must use atomic operations to avoid memory-ordering problems; see this test program for an example).
A process writing to a ring buffer can continue to feed data into it until the buffer fills, at which point it must stop. The writer could poll the tail value to detect when the reader has caught up and more space is available, but that would not be efficient. A similar situation applies to a reader that has emptied the buffer. Either way, the process can wait for the situation to change with:
int ringbuffer_wait(unsigned int fd, int rw);
Where fd is the file descriptor associated with the ring buffer, and rw indicates whether read or write access is needed. Either way, the calling process will block until the situation with the ring changes. Indicating such a change is the responsibility of user space as well; a writer that has put data into a previously empty buffer, or a reader that has consumed data from a previously full buffer should call:
int ringbuffer_wakeup(unsigned int fd, int rw);
In the cover letter, Overstreet suggests that ringbuffer_wait() and ringbuffer_wakeup() might eventually be replaced with futex operations. This patch also mentions the need to improve ringbuffer_wait() to allow for a timeout or to specify the minimum amount of data that must be written to (or read from) the buffer before a wakeup should happen.
The kernel side
To add ring-buffer support to a subsystem, the new ringbuffer() function in the file_operations structure should be provided:
struct ringbuffer *(*ringbuffer)(struct file *file, int rw);
This function will be called in response to a ringbuffer() call from user space. There is a whole set of meticulously undocumented support functions available to drivers for working with ring buffers, including ringbuffer_alloc() to create one, ringbuffer_read_iter() and ringbuffer_write_iter() to move data out of or into a ring buffer, etc. This patch provides a test driver showing the basics of implementing a ring buffer in kernel space.
Other than the test device, there are no in-kernel users submitted with
this patch set. In the cover letter, Overstreet says that the first such
user will be the filesystems in user space (FUSE) subsystem, but pipes and
sockets may follow thereafter. The performance benefits of the ring-buffer
approach are said to be significant: "reading/writing from the
ringbuffer 16 bytes at a time is ~7x faster than using read/write
syscalls
".
As of this writing, there have been no responses to this submission. This
work is in an early state and seems likely to need to evolve somewhat
before it could be considered for merging; developers are also likely to
want to see a real user in the kernel. Even more interesting might be a
demonstration of how some of the existing kernel ring-buffer interfaces
could have been implemented using this abstraction (not that they will
change now). If this work achieves its goals, it might, in a highly
optimistic view, be the last ring-buffer interface added to the
kernel.
Index entries for this article | |
---|---|
Kernel | System calls/ringbuffer() |
Posted Jun 6, 2024 17:41 UTC (Thu)
by andy_shev (subscriber, #75870)
[Link] (4 responses)
Posted Jun 6, 2024 22:08 UTC (Thu)
by devnull13 (subscriber, #18626)
[Link]
Posted Jun 7, 2024 9:25 UTC (Fri)
by bojan (subscriber, #14302)
[Link] (2 responses)
Posted Jun 7, 2024 12:48 UTC (Fri)
by corbet (editor, #1)
[Link]
Posted Jun 7, 2024 14:28 UTC (Fri)
by MortenSickel (subscriber, #3238)
[Link]
Posted Jun 6, 2024 18:25 UTC (Thu)
by koverstreet (✭ supporter ✭, #4296)
[Link] (2 responses)
Posted Jun 6, 2024 18:31 UTC (Thu)
by corbet (editor, #1)
[Link] (1 responses)
Posted Jun 6, 2024 19:30 UTC (Thu)
by sdalley (subscriber, #18550)
[Link]
Posted Jun 6, 2024 19:46 UTC (Thu)
by randomguy3 (subscriber, #71063)
[Link] (2 responses)
Posted Jun 7, 2024 13:22 UTC (Fri)
by rrolls (subscriber, #151126)
[Link] (1 responses)
"int rw" (which appears in not just one, but all four function signatures mentioned in this article) really just contains a boolean value, so all of these should be converted to "int flags" and a constant to indicate r/w direction.
Posted Jun 7, 2024 15:17 UTC (Fri)
by adobriyan (subscriber, #30858)
[Link]
Posted Jun 6, 2024 19:58 UTC (Thu)
by flussence (guest, #85566)
[Link]
There's a link in there to a lkml thread from 2004, and it may be even older than that given how it's mentioned in an offhand way there.
Posted Jun 7, 2024 8:23 UTC (Fri)
by ppisa (subscriber, #67307)
[Link]
Another problem are multiple readers and writers, because only hardware transactional memory support allows to keep write of pointer/offset increment to ringbuffer_ptrs->head and write of data to previous head location atomically in respect of other writers CPUs. If the writer is limited to single thread, then it is not a problem, writer check for size, then writes data and the last step is head increment in memory model release mode.
Even more complicated is case when multiple writers and variable size of the data are allowed and puzzle gets even more interesting when space allocate operation should be non blocking if there is a space or relaxed condition that allocation can be blocking but only for allocation duration but not during data fill operation which could require access to other resources which can block or memory which can be swapped out. But preventing such blocking is the must if the buffer is shared and used by tasks with different priorities and some of them fall into real-time critical domain.
I have tried to solve most of above problems in my ul_cbuff implementation. Even the data area is separated from mapping of the buffer control structures which can include even required mutexes and semaphores. The implementation allows and have been tested with multiple writers from multiple tasks, the data in the buffer are visible to readers only when all previously allocated entries are released by writers. Readers can be of the categories, for one, it is guaranteed that no data are lost, that is buffer is marked as full until last of such readers releases its data, or reader can be snapshot maker who has guarantee that receives only complete available messages from the head side but does not block following writes.
The system has been used for logging in critical systems like hospital infusion system based on RTEMS RTOS https://devel.rtems.org/wiki/TBR/UserApp/AMV_Technic_I but the test code evaluated it even on GNU/Linux in multiple memory address spaces/processes environment.
See the files
https://gitlab.com/pikron/sw-base/ulut/-/blob/master/ulut...
https://gitlab.com/pikron/sw-base/ulut/-/blob/master/ulut...
at the repository
https://gitlab.com/pikron/sw-base/ulut/-/tree/master/ulut
or
https://sourceforge.net/p/ulan/ulut/ci/master/tree/
I can post even test code.
Posted Jun 7, 2024 8:42 UTC (Fri)
by zev (subscriber, #88455)
[Link] (3 responses)
And on a different note, would it be possible to obviate the need for an explicit ringbuffer_wakeup() operation by setting the page containing struct ringbuffer_ptrs read-only and trapping the store that updates it? I suppose the page-table updates and fault-handling involved might end up making it slower in the case where it's actually used, but not having to explicitly check for it in the (hopefully?) common case of not needing it is at least aesthetically appealing.
Posted Jun 7, 2024 17:35 UTC (Fri)
by Tobu (subscriber, #24111)
[Link] (2 responses)
This is addressed in a comment inside the struct:
Apparently, u32 is atomic on architectures where u64 isn't.
Posted Jun 8, 2024 18:49 UTC (Sat)
by zev (subscriber, #88455)
[Link] (1 responses)
Posted Jun 8, 2024 22:32 UTC (Sat)
by koverstreet (✭ supporter ✭, #4296)
[Link]
32 bits is enough; ringbuffers generally aren't supposed to be huge, they're a transport mechanism. We could add 64 bit ringbuffers, but there'd be no reason to use them most of the time.
Posted Jun 7, 2024 16:38 UTC (Fri)
by meyert (subscriber, #32097)
[Link]
Posted Jun 7, 2024 16:59 UTC (Fri)
by glenn (subscriber, #102223)
[Link] (3 responses)
Posted Jun 7, 2024 18:00 UTC (Fri)
by abatters (✭ supporter ✭, #6932)
[Link] (1 responses)
https://www.circlemud.org/jelson/software/emlog/
Although the site does not appear responsive at the moment. Posted Jun 7, 2024 18:59 UTC (Fri)
by koverstreet (✭ supporter ✭, #4296)
[Link]
that'd be interesting
Posted Jun 7, 2024 19:42 UTC (Fri)
by mathstuf (subscriber, #69389)
[Link]
What's the rationale for using `unsigned` here? Is it possible to get a valid fd above 2³¹? AFAIK, all fd-creating syscalls return `int`. Why make invalid fds detectable behind a sign-change cast (or pretend that high-bit-set fd values are valid)?
Posted Jun 11, 2024 18:46 UTC (Tue)
by donald.buczek (subscriber, #112892)
[Link] (3 responses)
Why are these ringbuffers implemented as attachments to existing file descriptors of arbitrary types and not as standalone objects which could, for example, be referenced by dedicated file descriptors of a new pseudo-filesystem? That way, epoll and friends could be used for the wait operations. I fail to see why there must be a one-to-one relation between a ringbuffer and an existing file descriptor. Is the only expected application a read/write substitute?
Another question: The original motivation seems to be communication between userspace and kernelspace. But is the implementation now also a proper tool for userspace-only communication between threads? Maybe this second question clarifies the first one: Where would I get my noop-type filedescriptor which only purpose would be to hold a ringbuffer.
Posted Jun 13, 2024 13:42 UTC (Thu)
by koverstreet (✭ supporter ✭, #4296)
[Link] (2 responses)
So once we switch to futex operations for wait/wake, you'll be able to make your own ringbuffer in shared memory, no kernel involvement.
Posted Jun 14, 2024 17:12 UTC (Fri)
by donald.buczek (subscriber, #112892)
[Link] (1 responses)
I was just curious whether an unconventional use of the proposed ringbuffer API for userspace to userspace communication would work, assuming that whatever the kernel does with the ringbuffer for the chosen file type doesn't interfere or is taken into account. It's not sensible or needed, though, because that can be done right now and you only need help from the kernel for the non-spinning wait, e.g., via futex, which is available.
Regarding my first question, why a ringbuffer needs to be associated with a file descriptor: Do you expect ringbuffers will only be used as an alternative api to the same data channels as accessible by reading/writing the file descriptors? Is this why there is no standalone ringbuffer? Any kernel feature which supports ringbuffer communication should also support conventional I/O on the same descriptor?
Posted Jun 14, 2024 18:52 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jul 5, 2024 16:58 UTC (Fri)
by DanilaBerezin (guest, #168271)
[Link]
XKCD Standards?
https://xkcd.com/927/
XKCD Standards?
XKCD Standards?
The working title for this article was exactly that, but then I realized that we had already used it back in 2010.
LOTR reference
XKCD Standards?
fuse
Hey, I like how that works. I have a whole long list of other things to add to your dance card in the near future :)
fuse
Meticulously undocumented
Flags
There is no flags argument in this version of the system call; that seems likely to change before any eventual merging into the mainline.
Before I even got to this line, I was thinking that if there's anything years of reading LWN has taught me, it's that system calls should have flags arguments, and that accepting "any other value" will cause pain in the future. So I'd guess that the "int rw" argument will become a flags argument.
Flags
Flags
Further reading on the wraparound pattern
https://www.snellman.net/blog/archive/2016-12-13-ring-buf...
A generic ring buffer - increment size and mutiple readers and or writters
Syscall interface details
Portable atomics
* We use u32s because this type is shared between the kernel and
* userspace - ulong/size_t won't work here, we might be 32bit userland
* and 64 bit kernel, and u64 would be preferable (reduced probability
* of ABA) but not all architectures can atomically read/write to a u64;
* we need to avoid torn reads/writes.
(from the lkml thread)
Portable atomics
Portable atomics
ABI
stdout?
stdout?
stdout?
unsigned file descriptors?
why are ringbuffers bound to an existing file descriptor?
why are ringbuffers bound to an existing file descriptor?
why are ringbuffers bound to an existing file descriptor?
why are ringbuffers bound to an existing file descriptor?
thread safe?