System calls and 64-bit architectures
Adding a system call to the kernel is never done lightly. It is important to get it right before it gets merged because, once that happens, it must be maintained as part of the kernel's binary interface forever. The proposal to add preadv() and pwritev() system calls provides an excellent example of the kinds of concerns that need to be addressed when adding to the kernel ABI.
The two system calls themselves are quite straightforward. Essentially, they combine the existing pread() and readv() calls (along with the write variants of course) into a way to do scatter/gather I/O at a particular offset in the file. Like pread(), the current file position is unaffected. The calls, which are available on various BSD systems, can be used to avoid races between an lseek() call and a read or write. Currently, applications must do some kind of locking to prevent multiple threads from stepping on each other when doing this kind of I/O.
The prototypes for the functions look much like readv/writev, simply adding the offset as the final parameter:
ssize_t preadv(int d, const struct iovec *iov, int iovcnt, off_t offset); ssize_t pwritev(int d, const struct iovec *iov, int iovcnt, off_t offset);But, because off_t is a 64-bit quantity, this causes problems on some architectures due to the way system call arguments are passed. After Gerd Hoffmann posted version 2 of the patchset, Matthew Wilcox was quick to point out a problem:
Several other architectures (ARM, PowerPC, s390, ...) have similar
constraints. Because the offset is the fourth argument, it gets placed in
the r3 and r4 32-bit registers, but some architectures need it in either
r2/r3 or r4/r5. This led some to advocate reordering the
parameters, putting the offset before iovcnt to avoid the
problem. As long as that change doesn't bubble out to user space, Hoffmann
is amenable to making the change:
"I'd *really* hate it to have the same system call with different
argument ordering on different systems though
".
Most seemed to agree that the user-space interface as presented by glibc should match what the BSDs provide. It causes too many headaches for folks trying to write standards or portable code otherwise. To fix the alignment problem, the system call itself has the reordered version of the arguments. That led to Hoffmann's third version of the patchset, which still didn't solve the whole problem.
There are multiple architectures that have both 32 and 64-bit versions and
the 64-bit kernel must support system calls from 32-bit user-space
programs. Those programs will put 64-bit arguments into two registers,
but the 64-bit kernel will expect that argument in a single register.
Because of this, Arnd Bergmann recommended
splitting the offset into two arguments, one for the high 32 bits and
one for the low: "This is the only way I can see that lets us use a
shared compat_sys_preadv/pwritev across all 64 bit architectures
".
When a 32-bit user-space program makes a system call on a 64-bit system, the compat_sys_* version is used to handle differences in the data sizes. If a pointer to a structure is passed to a system call, and that structure has a different representation in 32-bits than it does in 64-bits, the compat layer makes the translation. Because different 64-bit architectures do things differently in terms of calling conventions and alignment requirements, the only way to share compat code is to remove the 64-bit quantity from the system call interface entirely.
That just leaves one final problem to overcome: endian-ness. As Ralf Baechle notes, MIPS can be either little or big-endian, so the compat_sys_preadv/pwritev() needs to put the two 32-bit offset values together in the proper way. He recommended moving the MIPS-specific merge_64() macro into a common compat.h include file, which could then be used by the common compat routines. So far, version 4 of the patchset has not emerged, but one suspects that the offset argument splitting and use of merge_64() will be part of it.
The implementation of the operation of preadv() and pwritev() is very obvious, certainly in comparison to the intricacies of passing its arguments. The VFS implementations of readv()/writev() already take an offset argument, so it was simply a matter of calling those. It is interesting to note that as part of the review, Christoph Hellwig spotted a bug in the existing compat_sys_readv/writev() implementations which would lead to accounting information not being updated for those calls.
This is not the first time these system calls have been proposed; way back in 2005, we looked at some patches from Badari Pulavarty that added them. Other than a brief appearance in the -mm tree, they seem to have faded away. Even if this edition of preadv() and pwritev() do not make it into the mainline—so far there are no indications that they won't—the code review surrounding it was certainly useful. Getting a glimpse of the complexities around 64-bit quantities being passed to system calls was quite informative as well.
Index entries for this article | |
---|---|
Kernel | Architectures |
Kernel | User-space API |
Posted Dec 18, 2008 5:22 UTC (Thu)
by kbob (guest, #1770)
[Link] (1 responses)
Posted Dec 18, 2008 5:58 UTC (Thu)
by quotemstr (subscriber, #45331)
[Link]
Posted Dec 18, 2008 9:51 UTC (Thu)
by meuh (guest, #22042)
[Link] (2 responses)
Posted Dec 18, 2008 13:40 UTC (Thu)
by abatters (✭ supporter ✭, #6932)
[Link] (1 responses)
Posted Dec 21, 2008 22:51 UTC (Sun)
by jlokier (guest, #52227)
[Link]
Some folks asking for preadv/pwritev are actually doing so because they rejected Linux AIO for being too broken to use.
They are preferring to use preadv/pwritev in userspace helper threads, than Linux AIO, because at least with threads it is always asynchronous.
Posted Dec 18, 2008 12:42 UTC (Thu)
by ballombe (subscriber, #9523)
[Link]
Posted Dec 18, 2008 14:56 UTC (Thu)
by liljencrantz (guest, #28458)
[Link] (2 responses)
Posted Dec 18, 2008 17:02 UTC (Thu)
by vonbrand (subscriber, #4458)
[Link] (1 responses)
Posted Dec 19, 2008 23:08 UTC (Fri)
by felixfix (subscriber, #242)
[Link]
Posted Dec 20, 2008 11:14 UTC (Sat)
by rwmj (subscriber, #5474)
[Link]
This is quite an interesting problem, and one we've also encountered
with virtualization. Paravirtualized guests are a bit like processes, and like
processes they can make hypercalls (which are a bit like system calls).
Where the complexity arises is system administrators want to run
a mixture of 32 bit and 64 bit guests on a system (and on crazy
architectures like IA64, they can even run a mixture of big and little
endian guests). So there's a degree of complexity ensuring the
guests are all passing identical C structs to hypercalls, particularly
in the "32-on-64" case.
Rich.
System calls and 64-bit architectures
System calls and 64-bit architectures
Why not extend struct iovec with an offset field:
System calls and 64-bit architectures
struct iovecs
{
void *iovs_base;
size_t iovs_len;
off_t iovs_off;
};
ssize_t preadv(int d, const struct iovecs *iovs, int iovcnt);
ssize_t pwritev(int d, const struct iovecs *iovs, int iovcnt);
Bad things could happen if offsets are going backward:performance penality or data overlapping. And this kind of interface are not the best regarding to error report.
So the kernel would have to enforce (iovs[n].iovs_off + iovs[n].iovs_len) < iovs[n+1].iovs_off
System calls and 64-bit architectures
System calls and 64-bit architectures
System calls and 64-bit architectures
System calls and 64-bit architectures
The problem is the lefover 65th bit ;-)
System calls and 64-bit architectures
System calls and 64-bit architectures
System calls and 64-bit architectures