|
|
Subscribe / Log in / New account

Improving iov_iter

By Jake Edge
June 10, 2025

LSFMM+BPF

The iov_iter interface is used to describe and iterate through buffers in the kernel. David Howells led a combined storage and filesystem session at the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit (LSFMM+BPF) to discuss ways to improve iov_iter. His topic proposal listed a few different ideas including replacing some iov_iter types and possibly allowing mixed types in chains of iov_iter entries; he would like to make the interface itself and the uses of iov_iter in the kernel better.

Howells began with an overview. An iov_iter is a stateful description of a buffer, which can be used for I/O; it stores a position within the buffer that can be moved around. There is a set of operations that is part of the API, which includes copying data into or out of the buffer, getting a list of the pages that are part of the buffer, and getting its length. There are multiple types of iov_iter. The initial ones were for user-space buffers, with ITER_IOVEC for the arguments to readv() and writev() and ITER_UBUF for a special case where the number of iovec entries (iovcnt) is one.

There are also three iov_iter types for describing page fragments: ITER_BVEC, which is a list of page, offset, and length; ITER_FOLIOQ, which describes folios and is used by filesystems; and ITER_XARRAY, which is deprecated and describes pages that are stored in an XArray. The problem with ITER_XARRAY is that it requires taking the read-copy-update (RCU) read lock inside iteration operations, which means there are places where it cannot be used, he said. An ITER_KVEC is a list of virtual kernel address ranges as with regions allocated with kmalloc(). Finally, the ITER_DISCARD type is used to simply discard the next N bytes without doing any copying, for example on a socket.

[David Howells]

One of the big problems with iov_iter, and buffer handling in general, is that as buffers are passed down into lower layers, those layers want to take page references on the buffer's pages. There is a pervasive view that all buffers have pages that references can be taken on, but that is no longer true in the folio world. There are also different lifetime rules for different kinds of memory that might be used in an iov_iter; pages might be pinned via get_user_pages(), there is slab memory (from kmalloc()), vmalloc() and vmap() memory, as well as device memory and other memory types, all of which have their own lifetimes. For example, user space could allocate GPU memory and do a direct read or write to it, which mixes several types. The bottom line is that a function that receives a buffer should not assume that it can take page references on it.

Beyond that, an array of pages may contain mixed types, Howells said. That means that cleaning up should not be done at the lower layers. Cleanup should instead be the responsibility of the caller.

A filesystem that does direct I/O will use an iov_iter to pass its buffers to a lower layer, but that layer does not know what that memory is. It is "a random set of user addresses and you don't know that you can pin them". In addition, readahead and writeback do not know how many pages or folios there are in an iov_iter that reference the page cache. Those operations have to iterate through the list to count them. Things are even worse if writeback_iter() is used, he said, which needs to traverse the page cache pages to flip the dirty bits on pages, then it needs to do so again to create an ITER_BVEC iov_iter, and another time to copy the data there.

Christoph Hellwig said that he did not really follow the problem for writeback as described, which may be because he comes from a block-layer perspective. Howells, Hellwig, and Matthew Wilcox had a rapid-fire discussion about the problems reported; Howells said that he is encountering the problems with network filesystems. Both Hellwig and Wilcox suggested that Howells was trying to optimize for a corner case, which is something that should be avoided; if the code works correctly, it can be slow for cases that rarely happen.

Howells then turned to the crypto API, which uses scatter-gather lists; he would like to switch that to use iov_iter. Wilcox said that was a good idea, since kernel developers want to get rid of scatter-gather lists. Howells's idea is to add a temporary ITER_SCATTERLIST type for iov_iter as a bridge to convert crypto drivers. Hellwig strongly recommended avoiding that approach, saying that previous experience shows that other developers sometimes start using a transitional feature, which makes it hard to remove it down the road. He was concerned that direct-rendering-manager (DRM) or dma-buf developers would start using it; "I don't want to give them that rope to hang themselves."

Duplicating the crypto APIs using iov_iter and slowly converting all of the crypto pieces to use the new ones was a better approach, Hellwig said. It is only needed for parts of the crypto layer that are implementing the asynchronous APIs, "which is actually not that much". Howells disagreed, saying there were lots of places in the crypto subsystem that needed the changes and that not all of it was in C code. Hellwig said that the assembly code operated at a lower level so it was not really a concern; he said he could lend a hand to help with the conversion.

Howells and Hellwig went back and forth about problems that Howells is trying to solve in the interaction between various subsystems, including networking, crypto, block, and memory management, which have led to all of the different ITER_* types and to some developers wanting (or needing) to add more. Hellwig said that the underlying problem is that the various subsystems cannot agree on a single common way to "describe chunks of physical memory", because "in the end, that's what all of the kernel operates on". Most of that is RAM, but there are other types as well. Without a kernel-wide agreement on what that that description should be, there will be a need to convert between all of the different representations.

Most people seem to think that representation should be pairs of physical addresses and lengths, perhaps with a flag, he said. That is not quite what a bio_vec is yet, but that is "the structure we think we can turn into that soon-ish". Then there will be a need to get all of the subsystems to use that; in some cases, it may make sense to "sugarcoat that in an iov_iter", but most of the low-level code should be operating on bio_vec (or whatever the name ends up being) objects.

Howells did not really see the path to getting to that point and wanted to talk about less-long-term solutions. He and Wilcox went back and forth some without seeming to make any real progress in understanding each other. Along the way, it became clear that there is some unhappiness because it seems like the networking-subsystem developers are unwilling to work with other parts of the kernel to solve these big-picture problems; it was unclear where things go from here—at least to me.


Index entries for this article
ConferenceStorage, Filesystem, Memory-Management and BPF Summit/2025


to post comments

address representation and KASAN

Posted Jun 12, 2025 13:12 UTC (Thu) by TheJH (subscriber, #101155) [Link]

> Most people seem to think that representation should be pairs of physical addresses and lengths, perhaps with a flag, he said.

For what it's worth, from the perpective of address-tagging-based KASAN and the (not yet upstream) SLUB-virtual security mitigation, it would be nice if the kernel (in particular on 64-bit) stayed in the kernel-virtual-address representation as much as possible and only used physical addresses when actually initiating DMA operations or such.

For example, some configurations of KASAN rely on tag bits in the upper part of an address being preserved for detection of memory safety bugs.


Copyright © 2025, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds