mknodat() and friends
Ulrich Drepper, the maintainer of glibc, isn't just trying to add a system call; his proposal creates eleven of them. They are all variants on current file operations:
int mknodat(int dfd, const char *pathname, mode_t mode, dev_t dev);
int mkdirat(int dfd, const char *pathname, mode_t mode);
int unlinkat(int dfd, const char *pathname);
int symlinkat(const char *oldname, int newdfd, const char *newname);
int linkat(int olddfd, const char *oldname,
int newdfd, const char *newname);
int renameat(int olddfd, const char *oldname,
int newdfd, const char *newname);
int utimesat(int dfd, const char *filename, struct timeval *tvp);
int chownat(int dfd, const char *path, uid_t owner, gid_t group);
int openat(int dfd, const char *filename, int flags, int mode);
int newfstatat(int dfd, char *filename, struct stat *buf, int flag);
int readlinkat(int dfd, const char *pathname, char *buf, int size);
The pattern should be clear by now: each new system call extends an existing one by adding one or more "dfd" (default file descriptor) arguments. In each case, the new argument indicates a directory which is used instead of the current working directory when relative path names are provided. These calls can help applications work their way through directory trees in a race-free manner, and are also useful for implementing a virtual per-thread working directory.
There was a minor comment on the implementation - Ulrich had wanted to avoid changing an exported function, but such changes are always fair game. Beyond that, there seems to be little resistance to adding these system calls. Expect them in a future kernel.
pselect() and ppoll()
David Woodhouse, meanwhile, has been circulating a patch implementing the pselect() and ppoll() system calls. These calls each take a signal mask; that mask will be applied while the calling process waits for events, with the previous mask being restored on return. There is an emulated version of these calls in glibc now, but a truly robust implementation requires kernel support. As with most things involving signals, the new code gets somewhat complex in places. The end result, however, should be a pair of straightforward system calls which allow a process to apply a different signal mask while waiting for I/O.
unshare()
The unshare() patch by Janak Desai was first covered here last May. It allows a process to disconnect from resources which are shared with others. The target application is per-user namespaces; implementing these requires the ability to detach from the global namespace normally shared by all processes on the system. The current version of this patch implements namespace unsharing, but it also allows a process to privatize its view of virtual memory and open files.
This patch has been through a fair amount of review, and has seen a number of improvements from that process. Andrew Morton's reaction to a request to include the patch in -mm suggests that there is some work yet to be done, though. Andrew wants to see a better justification for the patch; he is also concerned about the security implications of adding a relatively obscure bit of code. The end result is that Janak still has some homework to do before this patch will make it into the kernel.
preadv() and pwritev()
The kernel currently supports the pread() and pwrite() system calls; these behave like read() and write(), with the exception that they take an explicit offset in the file. They will perform the operation at the given offset regardless of whether the "current" offset in the file has been changed by another thread, and they do not change the current offset as seen by any thread. Also supported are readv() and writev(), which perform scatter/gather I/O from the current file offset. The kernel does not have, however, any system call which combines these two modes of operation.
It turns out that there are developers who wish they had system calls along the lines of:
int preadv(unsigned int fd, struct iovec *vec, unsigned long vlen,
loff_t pos);
int pwritev(unsigned int fd, struct iovec *vec, unsigned long vlen,
loff_t pos);
To satisfy this need, Badari Pulavarty has created a simple implementation which is currently part of the -mm tree. It seems that Ulrich Drepper suggested an alternative to adding two new system calls, however: change the iovec structure instead. Badari ran with that idea, posting a new patch creating a new iovec type:
struct niovec
{
void __user *iov_base;
__kernel_size_t iov_len;
__kernel_loff_t iov_off; /* NEW */
};
The new iov_off field is more flexible than plain preadv() in that it enables each segment in the I/O operation to have its own offset. The only down side is that the prototypes for the readv() and writev() methods in the file_operations structure must be changed. So every driver and filesystem which implements readv() and writev() breaks and must be changed. There are fewer of those than one might expect, but it is still a significant change.
It was suggested that the asynchronous I/O operations could be used instead. The AIO interface already allows for the creation of vectored operations with per-segment offsets. The downside is that using AIO is more complicated in user space, heavier in the kernel, and, incidentally, AIO support in the kernel was never completed to the point where it will support these operations anyway. Still, that is an option which may need more consideration before changing one of the fundamental interfaces used by filesystems and drivers.
splice()
Finally, there has been talk over many years of creating a splice() system call. The core idea is that a process could open a file descriptor for a data source, and another for a data sink. Then, with a call to splice(), those two streams could be connected to each other, and the data could flow from the source to the sink entirely within the kernel, with no need for user-space involvement and with minimal (or no) copying.
Some of the infrastructure was put in place one year ago when Linus created a circular pipe buffer mechanism. Now Jens Axboe has put together a simple splice() implementation which uses that mechanism. The patch is not ready for prime time yet (Jens: "I'm just posting this in the spirit of posting early"), but it is a beginning. In particular, it allows a file to be spliced to a pipe, as either the source or the sink. With a pair of splices, it is possible to set up an in-kernel file copy operation with no internal memory copying.
Work left for the future includes cleaning up the ("ugly," "nasty") internal interfaces, and generalizing the code so that any two file descriptors can be spliced together. The ability to splice to network sockets would be particularly useful. Some of this may take a while, so don't expect splice() to show up in the mainline in the immediate future.
Some new system calls
Posted Dec 22, 2005 4:15 UTC (Thu) by daniel (subscriber, #3181) [Link]
Re openat and friends:
A Good Thing indeed. A not so obvious application: user space filesystem stacking can be more efficient and race free with this syscall style. Too bad it wasn't done this way from the dawn of time.
I dimly recall proposing, for userspace filesystem stacking reasons, a similar set of syscalls to Ulrich a Christmas or two ago, oblivious to the existence of the Solaris syscalls of course. I think Ulrich must have been also, at that time. Maybe we need to spend more time looking at API improvements that Solaris, and Irix for that matter, have had since forever.
Regards,
Daniel
splice
Posted Dec 22, 2005 5:34 UTC (Thu) by thedevil (guest, #32913) [Link]
How is this different / not overlapping with sendfile () ?
splice
Posted Dec 22, 2005 8:12 UTC (Thu) by Ross (guest, #4065) [Link]
Using sendfile() is just a shortcut for read() followed by write() with the same buffer and length; it avoids copying into and out of userspace. The difference with splice() is that the reading and writing will happen automatically as data becomes available, not requiring userspace to perform additional system calls, need to determine optimal buffer sizes, etc.
I do wonder how error handling would work...
splice
Posted Dec 22, 2005 12:04 UTC (Thu) by nix (subscriber, #2304) [Link]
Plus, Linus has been heard to say that he now believes sendfile() to have been a mistake. This doesn't mean it will go away of course (it's a syscall, it's immortal) but if sendfile() can be implemented in terms of splice(), so much the better!
splice
Posted Dec 23, 2005 7:17 UTC (Fri) by thedevil (guest, #32913) [Link]
Ah, now I looked at the Linux manpage of sendfile and I understand a bit
Some new system calls
Posted Dec 22, 2005 20:54 UTC (Thu) by iabervon (subscriber, #722) [Link]
I remember openat() being relevant to the mechanism of handling file attributes as files. IIRC, the dfd could be a file, and you'd be interacting with the attribute pseudo-filesystem for that file. And you could open a directory as a file, then use openat() and get the attributes of the directory instead of the regular contents of the directory. Or, at least, there was something similar on Solaris for dealing with this sort of thing.
Multi-user systems
Posted Jan 5, 2006 6:46 UTC (Thu) by ringerc (subscriber, #3071) [Link]
This would be bliss if it became widely used on multi-user systems. If only `cp' was tweaked to use it when available the server at work would probably gain quite a bit, since we wouldn't be throwing out piles of perfectly good cached binaries and user data in favour of some disk blocks that'll only be used once.
The *at calls and Windows NT
Posted Mar 20, 2006 21:40 UTC (Mon) by Myria (guest, #36609) [Link]
A lot of people don't realize this, but Windows NT (and its descendants 2000, XP, 2003, Vista) actually does not have a "current directory" at all at the kernel level. When you open a file, you specify the directory to which the filename is relative. This is exactly like the openat, etc. proposed Linux syscalls above.
The current directory concept in Win32 is simulated by the user-mode library kernel32.dll. kernel32.dll retains an open handle to 27 open directory handles, one per drive letter (plus 1 for things like UNC paths). When you open a file relative to the current directory, kernel32.dll and ntdll.dll translate your filename into a path relative to the open handles that simulate the current directory before calling the NT kernel.
I think these syscalls are a good idea. Unfortunately, the concept of a kernel-level current directory must be retained in Linux, otherwise things like a chroot jail would be impossible.
Melissa
The *at calls and Windows NT
Posted Mar 21, 2006 1:23 UTC (Tue) by BrucePerens (guest, #2510) [Link]
Unfortunately, the concept of a kernel-level current directory must be retained in Linux, otherwise things like a chroot jail would be impossible.The new system call that splits off the pathname space for private mounts is probably a susperset of chroot(), and chroot() could be implemented on top of it. You don't need a current directory pointer for chroot() to work. Just a root.
Bruce
Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds