By Michael Kerrisk
January 30, 2013
We are accustomed to thinking of a system call as
being a direct service request to the kernel. However, in reality, most
system call invocations are mediated by wrapper functions in the GNU C
library (glibc). These wrapper functions eliminate work that the programmer
would otherwise need to do in order to employ a system call. But it turns
out that glibc does not provide wrapper functions for all system calls,
including a few that see somewhat frequent use. The question of what (if
anything) to do about this situation has arisen a few times in the last few
months on the libc-alpha mailing list, and has recently surfaced once more.
A system call allows a program to request a service—for example,
open a file or create a new process—from the kernel. At the assembler
level, making a system call requires the caller to
assign the unique system call number and the argument values to particular
registers, and then execute a special instruction (e.g., SYSENTER on modern
x86 architectures) that switches the processor to kernel mode to execute
the system-call handling code. Upon return, the kernel places the system
call's result status into a particular register and executes a special
instruction (e.g., SYSEXIT on x86) that returns the processor to user
mode. The usual convention for the result status is that a non-negative
value means success, while a negative value means failure. A negative
result status is the negated error number (errno) that indicates
the cause of the failure.
All of the details of making a system call are normally hidden from the
user by the C library, which provides a corresponding wrapper function and
header file definitions for most system calls. The wrapper function accepts
the system call arguments as function arguments on the stack, initializes
registers using those arguments, and executes the assembler instruction
that switches to kernel mode. When the kernel returns control to user mode,
the wrapper function examines the result status, assigns the (negated)
error number to errno in the case of a negative result, and
returns either -1 to indicate an error or the non-negative result status as
the return value of the wrapper function. In many cases, the wrapper
function is quite simple, performing only the steps just described. (In
those cases, the wrapper is actually autogenerated from
syscalls.list files in the glibc source that tabulate the types
of each system call's return value and arguments.) However, in a few cases
the wrapper function may do some extra work such as repackaging arguments
or maintaining some state information inside the C library.
The C library thus acts as a kind of gatekeeper on the API that the kernel
presents to user space. Until the C library provides a wrapper function,
along with suitable header files that define the calling signature and any
constant and structure definitions used by the system call, users must
do some manual work to make a system call.
That manual work includes defining the structures and constants needed
by the system call and then invoking the syscall() library
function, which handles the details of making the system call—copying
arguments to registers, switching to kernel mode, and then setting
errno once the kernel returns control to user space. Any system
call can be invoked in this manner, including those for which the C library
already provides a wrapper. Thus for example, one can bypass the wrapper
function for read() and invoke the system call directly by
writing:
nread = syscall(SYS_read, fd, buf, len);
The first argument to syscall() is the number of the system
call to be invoked; SYS_read is a constant whose
definition is provided by including <unistd.h>
The C library used by most Linux developers is of course the GNU C
library. Normally, glibc tracks kernel system call changes quite
closely, adding wrapper functions and suitable header file definitions to
the library as new system calls are added to the kernel. Thus, manually
coding system calls is normally only needed when trying to use the
latest system calls that have not yet appeared in the most recent iteration
of glibc's six-month release cycle or when using a recent kernel on a
system that has a significantly older version of glibc.
However, for some system calls, glibc support never appears. The
question of how the decision is made on whether to support a particular
system call in glibc has once again become a topic of discussion on the
libc-alpha mailing list. The most recent discussion started when Kees Cook,
the implementer of the recently added
finit_module() system call, submitted a rudimentary patch to add glibc
support for the system call. In response, Joseph Myers and Mike Frysinger
noted various pieces that were missing from the patch, with Joseph
adding that "in the
kexec_load discussion last May / June, doubts were expressed about whether
some existing module-related syscalls really should have had functions in
glibc."
The module-related system calls—init_module(),
delete_module(), and so on—are among those for which glibc
does not provide support. The situation is in fact slightly more complex
in the case of these system calls: glibc does not provide any header file
support for these system calls but does, through an accident of history,
export a wrapper function ABI for the calls.
The earlier discussion that Joseph referred to took place when
Maximilian Attems attempted to add a header file to glibc to provide
support for the kexec_load() system call, stating that his aim was "to axe the
syscall maze in kexec-tools itself and have this syscall supported in
glibc." One of the primary glibc maintainers, Roland McGrath, had a rather different take on the
necessity of such a change, stating "I'm not really convinced this
is worthwhile. Calling 'syscall' seems quite sufficient for such arcane
and rarely-used calls." In other words, adding support for these
system calls clutters the glibc ABI and requires (a small amount of) extra
code in order to satisfy the needs of a handful of users who could just use
the syscall() mechanism.
Andreas Jaeger, who had reviewed earlier versions of Maximilian's
patch, noted that
"linux/syscalls.list already [has] similar esoteric syscalls like
create_module without any header support. I wouldn't object to do this for
kexec_load as well". Roland agreed
that the kexec_load() system call is a similar case, but felt that
this point wasn't quite germane, since adding the module system calls to
the glibc ABI was a "dubious" historical step that can't be reversed for
compatibility reasons.
But in the recent discussion of finit_module(), Mike Frysinger
spoke in favor of adding full glibc support
for module-related system calls such as init_module(). Dave
Miller made a similar argument even more
succinctly:
It makes no sense for every tool that wants to support
doing things with kernel modules to do the syscall()
thing, propagating potential errors in argument signatures
into more than one location instead of getting it right in
one canonical place, libc.
In other words, employing syscall() can be error prone: there is
no checking of argument types nor even checking that sufficient arguments
have been passed.
Joseph Myers felt that the earlier
kexec_load() discussions hadn't fully settled the issue, and was
interested in having some concrete data on how many system calls don't have
glibc wrappers. Your editor subsequently donned his man-pages maintainer
hat and grepped the man pages in section 2 to determine which system calls
do not have full glibc support in the form of a wrapper function and header
files. The resulting list turns out to be
quite long, running to nearly nearly 40 Linux system calls. However, the
story is not quite so simple, since some of those system calls are obsolete
(e.g., tkill(), sysctl(), and query_module())
and others are intended for use only by the kernel or glibc (e.g.,
restart_syscall()). Yet others have wrappers in the C library,
although the wrappers have a significantly different names and provide some
piece of extra functionality on top of the system call (e.g.,
rt_sigqueueinfo() has a wrapper in the form of the sigqueue()
library function). Clearly, no wrapper is required for those system calls,
and once they are excluded there remain perhaps 15 to 20 system calls
that might be candidates to have glibc support added.
Motohiro Kosaki considered that the
remaining system calls could be separated into two categories: those with
only one or a few applications uses and those that seemed to him to have
more widespread application use. Motohiro was agnostic about whether
the former category (which includes the module-related system calls,
kcmp(), and kcmp_load()) required a wrapper. However, in
his opinion the system calls in the latter category (which includes system
calls such as ioprio_set(), ioprio_get(), and
gettid()) clearly merited having full glibc support.
The lack of glibc support for gettid(), which returns the
caller's kernel thread ID, is an especially noteworthy case. A
long-standing glibc bug report
requesting that glibc add support for this system gained little traction
with the previous glibc maintainer. However, excluding that system call is
rather anomalous, since it is quite frequently used and the kernel exposes
thread IDs via various /proc interfaces, and glibc exposes various
kernel APIs that can employ kernel thread IDs (for example,
sched_setaffinity(), fcntl(), and the
SIGEV_THREAD_ID notification mode for POSIX timers).
The discussion has petered out in the last few days, despite Mike
Frysinger's attempt to further push the debate along by reading and
summarizing the various pro and contra arguments in a single email. As noted by various
participants in the discussion, adding glibc wrappers for some currently
unsupported system calls would seem to have some worthwhile benefits. It
would also help to avoid the confusing situation where programmers
sometimes end up searching for a glibc wrapper function and header file
definitions that don't exist. It remains to be seen whether these arguments
will be sufficient to persuade Roland in the face of his concerns about
cluttering the glibc ABI and adding extra code to the library for the
benefit of what he believes is a relatively small number of users.
(
Log in to post comments)