Glibc wrappers for (nearly all) Linux system calls
A C programmer working with glibc now would look in vain for for a straightforward way to invoke a number of Linux system calls, including futex(), gettid(), getrandom(), renameat2(), execveat(), bpf(), kcmp(), seccomp(), and a number of others. The only way to get at these system calls is via the syscall() function. Over the years, there have been requests to add wrappers for a number of these system calls; in some cases, such as gettid() and futex(), the requests were summarily rejected by the (at-the-time) glibc maintainer in fairly typical style. More recently these requests have been reopened and others have been entertained, but there have been no system-call wrappers added since glibc 2.15, corresponding roughly to the 3.2 kernel.
On the face of it, adding a new system-call wrapper should be a simple exercise. The kernel has already defined an API for the system call, so it is just a matter of writing a simple function that passes the caller's arguments through to the kernel implementation. Things quickly get more complicated than that, though, for a number of reasons, but they all come down to one root cause: glibc is not just a wrapper interface for kernel-supplied functionality. Instead, it provides a (somewhat standard-defined) API that is meant to be consistent and independent of any specific operating system.
There are provisions for adding kernel-specific functions to glibc now; those functions will typically fail (with errno set to ENOSYS) when called on a kernel that does not support them. Examples of such functions include the Linux-specific epoll_wait() and related system calls. As a general rule, though, the glibc developers, as part of their role maintaining the low-level API for the GNU system, would like to avoid kernel-specific additions.
This concern has had the effect of keeping a lot of Linux system-call wrappers out of the GNU C Library. It is not necessarily that the glibc developers do not want that functionality, but figuring out how a new function would fit into the overall GNU API is not a straightforward task. The ideal interface may not (from the glibc point of view) be the one exposed by the Linux kernel, so another may need to be designed. Issues like error handling, thread safety, support on non-Linux systems, and POSIX-thread cancellation points can complicate things considerably. In many cases, it seems that few developers have wanted to run the gauntlet of getting new system-call wrappers into the library, even if the overall attitude toward such wrappers has become markedly more friendly in recent years.
Back in May 2015, Joseph Myers proposed relaxing the rules just a little bit, at least in cases when the functionality provided by a wrapper might be of general interest. In such cases, Joseph suggested, there would be no immediate need to provide support for other operating-system kernels unless somebody found the desire and the time to do the work.
Roland McGrath is, by his own admission, the hardest glibc developer to convince about the value of adding Linux-specific system calls to the library. He still does not see a clear case for adding many Linux system-call wrappers to the core library; it is only clear, he said, when the system call is to be a part of the GNU API:
I propose that we rule out adding any symbols to the core libc ABIs that are not entering the OS-independent GNU API.
Roland does not seem to believe that glibc should entirely refuse to support system calls that don't meet the above criterion, though. Instead, he suggested creating another library specifically for them. It would be called something like "libinux-syscalls" (so that one would link with "-linux-syscalls"). Functions relegated to this library should be simple wrappers, without internal state, with the idea that supporting multiple versions of the library would be possible.
There was some discussion on the details of this idea, but the core of it seems to be relatively uncontroversial. Also uncontroversial is the idea that glibc need not provide wrappers for system calls that are obsolete, that cannot be used without interfering with glibc (set_thread_area() is an example), or those that are expected to have a single caller (such as create_module()). So Carlos O'Donell has proposed a set of rules that would clear the way for the immediate addition of operating-system-independent system calls into the core and the addition of a system-dependent library for the rest.
Of course, "immediate" is a relative term. Any system-call wrappers will still need to be properly implemented and documented, with test cases and more. There is also, in some cases, the more fundamental question of what the API should look like. Consider the case of the futex() system call, which provides access to a fast mutual-exclusion mechanism. As defined by the kernel, futex() is a multiplexer interface, with a single entry point providing access to a range of different operations.
Torvald Riegel made the case that exposing this multiplexer interface would do a disservice to glibc users:
He proposed exposing a different API based around several functions with
names like futex_wake() and futex_wait(); he also posted
a patch implementing this interface.
Joseph, while not disagreeing with that
interface, insisted that the C library should provide direct access to
the raw system call, saying: "The fact that, with hindsight, we might
not have designed an API the way it was in fact designed does not mean we
should embed that viewpoint in the choice of APIs provided to
users
". In the end, the two seemed to agree that both types of
interface should, in some cases, be provided. If the C library can provide
a useful higher-level interface, that may be appropriate to add, but more
direct access to the system call as provided by the kernel should be there
too.
The end result of all this is that we are likely to see a break in the
logjam that has kept new system-call wrappers out of glibc. Some new
wrappers could even conceivably show up in the 2.23 release, which can be
expected sometime around February 2016. Even if the attitude and rules
have changed, though, this is still glibc we are talking about, so catching
up with the kernel may take a while yet. But one can take comfort in the
fact that a path is now visible, even if it may yet be a slow one.
Posted Aug 21, 2015 0:11 UTC (Fri)
by ldo (guest, #40946)
[Link] (23 responses)
The fundamental problem is that the errno convention has outlived its usefulness. The Linux kernel calls return an error code directly, but to be POSIX-compatible, glibc has to squirrel these away in errno. Which requires all this complicated wrapper code, as well as a whole extra mechanism to make errno thread-safe.
I recently hit the situation where the write(2) call didn’t write all the bytes I gave it to disk, with no error indication in errno. The man page only says this can happen
In other words, you don’t know why it happened.
Posted Aug 21, 2015 0:16 UTC (Fri)
by dlang (guest, #313)
[Link]
It has nothing to do with errorno, and a lot to do with the fact that there are a LOT of possible things that can cause write() to not write everything out, and the number of possible reasons is going to increase over time. How many thousands of error messages do you want to have to define (and then handle)?
Posted Aug 21, 2015 0:31 UTC (Fri)
by proski (subscriber, #104)
[Link] (9 responses)
Posted Aug 21, 2015 3:35 UTC (Fri)
by wahern (subscriber, #37304)
[Link] (8 responses)
True, but C functions can return compound objects.
Modern ABIs would return the values in registers, and an optimizing compiler could elide the existence of an independent struct writeret object altogether. That kernels still only return a single integer value is more about not evolving with the times. Pre-ANSI C didn't permit passing compound objects by value, only pointers, so ABIs and compilers didn't have to consider optimizing that case. In 1989 ANSI C changed that to permit passing structs and unions, but not arrays. For a long time ABIs and compilers would always pass the values on the stack, and it was considerable poor practice to make use of the feature in performance-critical code. But modern ABIs (e.g. AMD64) can pass the member values through registers. So there's no cost to using smallish structs as function parameters or return values.
Posted Aug 21, 2015 8:36 UTC (Fri)
by ehiggs (subscriber, #90713)
[Link]
Posted Aug 21, 2015 22:04 UTC (Fri)
by kleptog (subscriber, #1183)
[Link] (5 responses)
Well, and the fact that you can't just change the userland/kernel interface like that. Promises have been made about what happens to all the registers and there isn't anywhere to put any extra return values in a backward compatible way.
On top of that, even if the kernel could return the info, you can't change the POSIX API either so errno is here to stay.
Posted Aug 21, 2015 23:23 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
Posted Aug 22, 2015 9:06 UTC (Sat)
by kleptog (subscriber, #1183)
[Link] (2 responses)
FWIW I disagree with the OP, it is nice to be able to know that a syscall either succeeded or failed and not some halfway state. If you did a write and the write was short the write still succeeded. POSIX does specify that if the write size is less than PIPE_BUF length then it will succeed or fail atomically. If you have to write your code to handle the case where some data was written but you also have to handle an error code, that just feels more fragile.
All the cases where it would be useful to return more information the specific syscall has made allowances for it, for example recvmsg(). The fact that there are syscalls that are badly designed is a problem with the interface and not the mechanism. I find the ip/tc tools use of netlink here pretty bad, they return EINVAL and you have to hope there's something useful in the kernel log. Would it have killed them to add an extra field for "extended error code"?
Posted Aug 23, 2015 8:03 UTC (Sun)
by lsl (subscriber, #86508)
[Link] (1 responses)
Only when actually writing to a pipe, not in the general case.
I just discovered that Linux (since 3.4) implements a Plan-9-style pipe mode where reads from a pipe match up with previous writes (provided the latter weren't greater than PIPE_BUF bytes). See the pipe(2) Linux manpage for the O_DIRECT flag to pipe2. Very nice.
Posted Aug 23, 2015 10:35 UTC (Sun)
by kleptog (subscriber, #1183)
[Link]
How is this different to socketpair(AF_UNIX, SOCK_DGRAM) ? The only reason I can think of is that you want it to work on systems without UNIX domain sockets...
Posted Aug 22, 2015 2:58 UTC (Sat)
by mathstuf (subscriber, #69389)
[Link]
Posted Aug 28, 2015 12:44 UTC (Fri)
by justincormack (subscriber, #70439)
[Link]
Posted Aug 21, 2015 0:35 UTC (Fri)
by ncm (guest, #165)
[Link] (2 responses)
Posted Aug 21, 2015 1:46 UTC (Fri)
by k8to (guest, #15413)
[Link] (1 responses)
There are some harder-to-get-right network situations where you want to differentiate between trying again and not trying again, and not differentiating doesn't give you good behavior. But maybe you mean most of the time we screw that up too?
Posted Aug 21, 2015 2:29 UTC (Fri)
by ncm (guest, #165)
[Link]
Posted Aug 21, 2015 2:20 UTC (Fri)
by deater (subscriber, #11746)
[Link]
There is work underway though to address this, by improving the syscall error handling. I'm not sure how generic of a solution it is though.
Posted Aug 21, 2015 2:35 UTC (Fri)
by gutschke (subscriber, #27910)
[Link] (1 responses)
I have a completely different issue with "errno". The bulk of the time that I had to make raw system calls has been in extremely low-level code. When the code executes, I can't make much of a guarantee about the execution environment. Quite frequently, there is no such thing as an "errno" variable (e.g. because I just called clone(), and didn't set up thread local storage yet).
This means, I would need libinux to have zero dependencies on any libc code. No accesses to "errno", no accesses to thread local storage, no cancellation points, no locking, no calls to atfork handlers, nothing! But things get even more complicated than that. By default, the dynamic linker lazily resolves library functions. This means, whenever I make a call into libinux, there is a chance that the dynamic loader gets called and makes all sorts of calls that are incompatible with my particular requirements.
In other words, all of libinux would either need to be inline functions, or there needs to be a way to fully resolve its symbols on demand. It is quite possible that my needs are a little unusual, as I have been writing very low-level and Linux specific code. But that's probably something that people will end up wanting to use libinux for.
Other than that, yes, I am fully in favor of libinux giving easy and direct access to all Linux system calls. That feature is long overdue and would be very welcome. I also feel that having wrapper functions that make system calls easier to use is wonderful. I sometimes need the exact raw system calls; and when I do, I am OK with researching the idiosyncrasies of the kernel API and making sure I get things right. But most of the time, I don't need this much control and I actually appreciate having helpers that allow the compiler to make sure I don't do anything stupid.
Posted Aug 21, 2015 23:39 UTC (Fri)
by ncm (guest, #165)
[Link]
Posted Aug 21, 2015 6:36 UTC (Fri)
by epa (subscriber, #39769)
[Link]
Is there scope for adding non-errno versions of calls to POSIX?
Posted Aug 21, 2015 7:26 UTC (Fri)
by Yorick (guest, #19241)
[Link]
Posted Aug 21, 2015 21:02 UTC (Fri)
by vapier (guest, #15768)
[Link] (2 responses)
some functions, like ptrace(), have overlap between valid values and errors. in some cases it returns arbitrary data, so you cannot know whether 0xffffffff is because the data was 0xffffffff or -EPERM (on a 32bit system). you simply have to make assumptions that it's always valid based on other syscalls.
glibc doesn't treat *all* negative values as being errors -- it caps it at different values. on x86, it treats [-1,-4095] (or should it be [-4095,-1] ?) as an errno value. that way you aren't limiting yourself to 31bits, but (2^32 - 4096) possible valid values.
further, the convention for returning errno values isn't consistent across architectures. some (most) will normalize into one register, but a few split it -- at least ia64 & mips do. that way there is no confusion whether there was an error.
further further, some syscalls have to deal with raw C calling conventions. namely, some ABIs (like arm, mips, and ppc) require uint64_t to be split on even/odd pairs. so instead of doing:
further further further, just because you call a specific syscall by name, it does not mean it's going to be the same across architectures. alpha is a pretty big example of this -- there are many syscalls that don't exist like __NR_getpid. instead they named it __NR_getxpid. they made a lot of decisions so as to be compatible with OSF (after all, surely OSF is more important than this toy "linux" project, and will obivously outlive it). or g'luck trying to do something as simple as mmap -- there's __NR_mmap, __NR_mmap2, and arches are not consistent as to how the offsets are used (maybe they're shifted ?).
the syscall(2) man page has a lot of good discussion in it.
Posted Aug 22, 2015 13:52 UTC (Sat)
by hrw (subscriber, #44826)
[Link] (1 responses)
Should I use __NR_stat, __NR_fstat, __NR_fstat64 or maybe __NR_newfstatat? Will my code run properly on all architectures if I use one of them or should I add some #ifdefs for architecture checks?
Even x86 has 3 architectures now (x86, x86-64, x32) which have different set of syscalls defined.
Posted Sep 23, 2015 3:24 UTC (Wed)
by vapier (guest, #15768)
[Link]
as for the syscalls you quoted, there are other stat variants (stat64 and fstatat64 at least). there's really no guarantee your code will compile or run properly when you call the syscalls directly. C libraries provide stable APIs/ABIs, including emulating newer functionality when the kernel is old (e.g. the *at syscalls could be emulated in userspace when the kernel was too old by utilizing /proc/self/fd/, but you'd have to call glibc's fstatat and not the kernel's syscall(__NR_newfstatat)).
Posted Sep 2, 2015 9:12 UTC (Wed)
by mirabilos (subscriber, #84359)
[Link]
Posted Aug 21, 2015 8:36 UTC (Fri)
by xnox (guest, #63320)
[Link] (1 responses)
Posted Aug 21, 2015 19:47 UTC (Fri)
by wahern (subscriber, #37304)
[Link]
I partition the negative range using a simple prefix system, which makes it easy to mix-and-match components. For example, from my DNS library,
Helpfully, strerror must always return a valid string for all integer values, even for unknown values.
So if an application-specific value accidentally leaks to a component that doesn't understand the protocol it's relatively benign. In fact, most strerror implementations will include the value in the message, so you'll actually get useful output if it gets passed to strerror. But usually each component will define it's own strerror interface that forwards to strerror or a sub-component's strerror.
It's the most useful and practical error reporting method I've found, at least for C code, and especially for C libraries. Of course, sometimes a routine is better defined (easier to use, more intuitive) by returning the error value through a reference, rather than as the return value. But that's just a variation on the theme. Unless you know you'll always operate in a closed software ecosystem, every other scheme is just chasing a dragon like other classic rookie mistakes: constantly writing configuration parsers, relying too heavily on malloc/free replacements, and reinventing logging instead of using stderr or perhaps syslog. Billions of man hours have been wasted down those rabbit holes.
Posted Aug 21, 2015 14:18 UTC (Fri)
by pr1268 (guest, #24648)
[Link] (9 responses)
Umm, I have a problem with the "libinux" part of that. Granted, the associated command-line linker directive may be easier to understand, but everyone1 knows that the shared library name to link is the same as the actual .so file, minus that extension and also without the leading "lib" (immediately following the -l switch, of course). Why not just call it "liblinux-syscalls and tell programmers to link it with -llinux-syscalls? Having two letter l's isn't so bad—hasn't anyone ever used -llzma, -llcms, or -llo? I think Roland's idea is fine—just not the part about naming a proposed linked library something awkward and possibly confusing programmers to thinking it is a new and different command-line option. 1 I'm making a silly generalization here, of course.
Posted Aug 21, 2015 14:23 UTC (Fri)
by gevaerts (subscriber, #21521)
[Link] (5 responses)
Posted Aug 21, 2015 16:03 UTC (Fri)
by barryascott (subscriber, #80640)
[Link]
nroff -man ls.1
Posted Aug 21, 2015 23:27 UTC (Fri)
by ncm (guest, #165)
[Link]
Posted Aug 26, 2015 7:19 UTC (Wed)
by pr1268 (guest, #24648)
[Link] (2 responses)
Well then, why stop here? Why not get creative with DSO names? I propose: (Just kidding.) ;-)
Posted Aug 27, 2015 9:00 UTC (Thu)
by njd27 (subscriber, #5770)
[Link] (1 responses)
Posted Sep 9, 2015 9:47 UTC (Wed)
by jengelh (guest, #33263)
[Link]
Posted Aug 21, 2015 19:26 UTC (Fri)
by xtifr (guest, #143)
[Link] (2 responses)
1 I'm making a silly generalization as well, of course. :)
As gevaerts points out, there is definite precedent for this, and once noticed, it's never forgotten. And the consequences of getting it wrong are: link fails, need to link again. Not exactly earth-shattering. And the precedent was even associated with the same project: libiberty was what you used when you wanted to take advantage of some advanced glibc features on another vendor's libc.
I'd hope that this would only affect a pretty tiny percentage of code, in any case. It might turn out to be a very important tiny percentage—critical daemons and the like—but I hope people haven't given up on the idea of writing reasonably portable code in general. Or of not making their tangled web of #ifdefs any more tangled than they really need to be!
Posted Aug 22, 2015 2:54 UTC (Sat)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
Posted Aug 23, 2015 2:19 UTC (Sun)
by xtifr (guest, #143)
[Link]
Time To Get Rid Of errno
... if, for example, there is insufficient space on the underlying physical medium, or the RLIMIT_FSIZE resource limit is encountered (see setrlimit(2)), or the call was interrupted by a signal handler after having written less than count bytes.
Time To Get Rid Of errno
Time To Get Rid Of errno
Time To Get Rid Of errno
struct writeret { size_t n; int errno; };
struct writeret writex(int, const void *, size_t);
To go along with this change, it would be nice if C could also destructure return values of anonymous structs.
e.g.
Time To Get Rid Of errno
struct { size_t n; int errno } write(int fd, const void *buf, size_t count);
(n, errno) = write(fd, buf, count);
Time To Get Rid Of errno
Time To Get Rid Of errno
Time To Get Rid Of errno
Time To Get Rid Of errno
Time To Get Rid Of errno
Time To Get Rid Of errno
Time To Get Rid Of errno
Time To Get Rid Of errno
Time To Get Rid Of errno
errno, schmerrno.
Time To Get Rid Of errno
https://lwn.net/Articles/652326/
It will be interesting if that actually gets merged.
Time To Get Rid Of errno
Time To Get Rid Of errno
Time To Get Rid Of errno
The write design may not be wonderful, but the standard procedure upon a short write is to try again with the remainder of the data to be written. Then you will get the real reason in errno (unless it was just a temporary condition the first time).
Time To Get Rid Of errno
Time To Get Rid Of errno
syscall(SYS_readahead, fd, (uint32_t)(offset), (uint32_t)(offset >> 32), count)
you have to insert a 0 after the fd by hand:
syscall(SYS_readahead, fd, 0, (uint32_t)(offset), (uint32_t)(offset >> 32), count)
Time To Get Rid Of errno
Time To Get Rid Of errno
Time To Get Rid Of errno
Glibc wrappers for (nearly all) Linux system calls
If everything is else is returned via reference, why -errno? In my code I also usually return errno directly, but without negating it. I return negative values for application-defined errors. All C- and POSIX-defined errno values must be positive, and no unix-like system uses the negative range for implementation-specific errno values; nor does Windows AFAIK. That way my libraries can pass-through system errors, and I don't have to define ad hoc types for error reporting (which sounds like good idea in principle until you have to glue together a dozen different libraries; even using enums causes headaches because of GCC and clang warnings).
Glibc wrappers for (nearly all) Linux system calls
#define DNS_EBASE -(('d' << 24) | ('n' << 16) | ('s' << 8) | 64)
enum dns_errno {
DNS_ENOBUFS = DNS_EBASE,
DNS_EILLEGAL,
DNS_EORDER,
DNS_ESECTION,
DNS_EUNKNOWN,
DNS_EADDRESS,
DNS_ENOQUERY,
DNS_ENOANSWER,
DNS_EFETCHED,
DNS_ESERVICE, /* EAI_SERVICE */
DNS_ENONAME, /* EAI_NONAME */
DNS_EFAIL, /* EAI_FAIL */
DNS_ELAST,
}; /* dns_errno */
/* for documentation only; will always be type int */
#define dns_error_t int
...
dns_error_t dns_res_submit(struct dns_resolver *, const char *, enum dns_type, enum dns_class);
struct dns_packet *dns_res_fetch(struct dns_resolver *, dns_error_t *);
The strerror function maps the number in errnum to a message string. Typically,
the values for errnum come from errno, but strerror shall map any value of type
int to a message.
C11 (N1570) 7.24.6.2p2
libinux ain't right
It would be called something like "libinux-syscalls" (so that one would link with "-linux-syscalls").
libinux ain't right
libinux ain't right
libinux ain't right
libinux ain't right - why not add more?
It's clearly inspired by libiberty. There is a precedent for this!
libinux ain't right - why not add more?
libinux ain't right - why not add more?
libinux ain't right
...possibly confusing programmers to thinking it is a new and different command-line option.
But everyone1 knows that an argument starting with -l refers to a library, so there's no possibility of confusion!
libinux ain't right
libinux ain't right