System calls and 64-bit architectures

By Jake Edge
December 17, 2008

Adding a system call to the kernel is never done lightly. It is important to get it right before it gets merged because, once that happens, it must be maintained as part of the kernel's binary interface forever. The proposal to add preadv() and pwritev() system calls provides an excellent example of the kinds of concerns that need to be addressed when adding to the kernel ABI.

The two system calls themselves are quite straightforward. Essentially, they combine the existing pread() and readv() calls (along with the write variants of course) into a way to do scatter/gather I/O at a particular offset in the file. Like pread(), the current file position is unaffected. The calls, which are available on various BSD systems, can be used to avoid races between an lseek() call and a read or write. Currently, applications must do some kind of locking to prevent multiple threads from stepping on each other when doing this kind of I/O.

The prototypes for the functions look much like readv/writev, simply adding the offset as the final parameter:

    ssize_t preadv(int d, const struct iovec *iov, int iovcnt, off_t offset);
    ssize_t pwritev(int d, const struct iovec *iov, int iovcnt, off_t offset);

But, because off_t is a 64-bit quantity, this causes problems on some architectures due to the way system call arguments are passed. After Gerd Hoffmann posted version 2 of the patchset, Matthew Wilcox was quick to point out a problem:

Are these prototypes required? MIPS and PARISC will need wrappers to fix them if they are. These two architectures have an ABI which requires 64-bit arguments to be passed in aligned pairs of registers, but glibc doesn't know that (and given the existence of syscall(3), can't do much about it even if it knew), so some of the arguments end up in the wrong registers.

Several other architectures (ARM, PowerPC, s390, ...) have similar constraints. Because the offset is the fourth argument, it gets placed in the r3 and r4 32-bit registers, but some architectures need it in either r2/r3 or r4/r5. This led some to advocate reordering the parameters, putting the offset before iovcnt to avoid the problem. As long as that change doesn't bubble out to user space, Hoffmann is amenable to making the change: "I'd *really* hate it to have the same system call with different argument ordering on different systems though".

Most seemed to agree that the user-space interface as presented by glibc should match what the BSDs provide. It causes too many headaches for folks trying to write standards or portable code otherwise. To fix the alignment problem, the system call itself has the reordered version of the arguments. That led to Hoffmann's third version of the patchset, which still didn't solve the whole problem.

There are multiple architectures that have both 32 and 64-bit versions and the 64-bit kernel must support system calls from 32-bit user-space programs. Those programs will put 64-bit arguments into two registers, but the 64-bit kernel will expect that argument in a single register. Because of this, Arnd Bergmann recommended splitting the offset into two arguments, one for the high 32 bits and one for the low: "This is the only way I can see that lets us use a shared compat_sys_preadv/pwritev across all 64 bit architectures".

When a 32-bit user-space program makes a system call on a 64-bit system, the compat_sys_* version is used to handle differences in the data sizes. If a pointer to a structure is passed to a system call, and that structure has a different representation in 32-bits than it does in 64-bits, the compat layer makes the translation. Because different 64-bit architectures do things differently in terms of calling conventions and alignment requirements, the only way to share compat code is to remove the 64-bit quantity from the system call interface entirely.

That just leaves one final problem to overcome: endian-ness. As Ralf Baechle notes, MIPS can be either little or big-endian, so the compat_sys_preadv/pwritev() needs to put the two 32-bit offset values together in the proper way. He recommended moving the MIPS-specific merge_64() macro into a common compat.h include file, which could then be used by the common compat routines. So far, version 4 of the patchset has not emerged, but one suspects that the offset argument splitting and use of merge_64() will be part of it.

The implementation of the operation of preadv() and pwritev() is very obvious, certainly in comparison to the intricacies of passing its arguments. The VFS implementations of readv()/writev() already take an offset argument, so it was simply a matter of calling those. It is interesting to note that as part of the review, Christoph Hellwig spotted a bug in the existing compat_sys_readv/writev() implementations which would lead to accounting information not being updated for those calls.

This is not the first time these system calls have been proposed; way back in 2005, we looked at some patches from Badari Pulavarty that added them. Other than a brief appearance in the -mm tree, they seem to have faded away. Even if this edition of preadv() and pwritev() do not make it into the mainline—so far there are no indications that they won't—the code review surrounding it was certainly useful. Getting a glimpse of the complexities around 64-bit quantities being passed to system calls was quite informative as well.

Index entries for this article
Kernel	Architectures
Kernel	User-space API

System calls and 64-bit architectures

Posted Dec 18, 2008 5:22 UTC (Thu) by kbob (guest, #1770) [Link] (1 responses)

This doesn't seem like a kernel issue. If gcc can't call a function whose fourth argument is an int64_t, then either gcc or that platform's ABI is broken.

System calls and 64-bit architectures

Posted Dec 18, 2008 5:58 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

syscall calling convention is not the platform's stdcall.

System calls and 64-bit architectures

Posted Dec 18, 2008 9:51 UTC (Thu) by meuh (guest, #22042) [Link] (2 responses)

Why not extend struct iovec with an offset field:

struct iovecs
{
  void  *iovs_base;
  size_t iovs_len;
  off_t  iovs_off;
};

ssize_t preadv(int d, const struct iovecs *iovs, int iovcnt);
ssize_t pwritev(int d, const struct iovecs *iovs, int iovcnt);

Bad things could happen if offsets are going backward:performance penality or data overlapping. And this kind of interface are not the best regarding to error report. So the kernel would have to enforce (iovs[n].iovs_off + iovs[n].iovs_len) < iovs[n+1].iovs_off

System calls and 64-bit architectures

Posted Dec 18, 2008 13:40 UTC (Thu) by abatters (✭ supporter ✭, #6932) [Link] (1 responses)

While you are at it, put a direction flag in each vector so that you can submit reads and writes at the same time. And put the file descriptor in each vector too so that you can submit I/O to different files with one syscall. Then make it asynchronous and add in an optional signal upon completion. Oh wait, io_submit() already does all that.

System calls and 64-bit architectures

Posted Dec 21, 2008 22:51 UTC (Sun) by jlokier (guest, #52227) [Link]

Except that Linux AIO (io_submit) isn't always asynchronous, and you can't easily tell when it will block the caller.

Some folks asking for preadv/pwritev are actually doing so because they rejected Linux AIO for being too broken to use.

They are preferring to use preadv/pwritev in userspace helper threads, than Linux AIO, because at least with threads it is always asynchronous.

System calls and 64-bit architectures

Posted Dec 18, 2008 12:42 UTC (Thu) by ballombe (subscriber, #9523) [Link]

The article title seems a bit of a misnomer. The issue is with 64bit syscalls arguments on 32bit architectures (or emulation of 32bit architectures), rather than an issue with 64bit architectures.

System calls and 64-bit architectures

Posted Dec 18, 2008 14:56 UTC (Thu) by liljencrantz (guest, #28458) [Link] (2 responses)

This may be a silly question, but what is the problem with simply fixing glibc to send 65-bit data properly aligned?

System calls and 64-bit architectures

Posted Dec 18, 2008 17:02 UTC (Thu) by vonbrand (subscriber, #4458) [Link] (1 responses)

The problem is the lefover 65th bit ;-)

System calls and 64-bit architectures

Posted Dec 19, 2008 23:08 UTC (Fri) by felixfix (subscriber, #242) [Link]

That must be the security bit for packets of which I have heard mention now and then ... sort of like Perl's taint flag, maybe.

System calls and 64-bit architectures

Posted Dec 20, 2008 11:14 UTC (Sat) by rwmj (subscriber, #5474) [Link]

This is quite an interesting problem, and one we've also encountered with virtualization. Paravirtualized guests are a bit like processes, and like processes they can make hypercalls (which are a bit like system calls).

Where the complexity arises is system administrators want to run a mixture of 32 bit and 64 bit guests on a system (and on crazy architectures like IA64, they can even run a mixture of big and little endian guests). So there's a degree of complexity ensuring the guests are all passing identical C structs to hypercalls, particularly in the "32-on-64" case.

Rich.