Extended error reporting

By Jonathan Corbet
February 17, 2010

Linux contains a number of system calls which do complex things; they take large structures as input, operate on significant internal state, and, perhaps, return some sort of complicated output data. The normal status returned from these system calls, however, is compressed down into a single integer called errno. Application programmers dealing with certain subsystems (Video4Linux2 being your editor's favorite in this regard) will all be well familiar with the process of trying to figure out what the problem is when the kernel says only "it failed."

Andi Kleen describes the problem this way:

I always describe that as a the "ed approach to error handling". Instead of giving a error message you just give ?. Just ? happens to be EINVAL in Linux.

My favourite example of this is the configuration of the networking queueing disciplines, which configure complicated data structures and algorithms and in many cases have tens of different error conditions based on the input parameters -- and they all just report EINVAL.

It would be nice to provide application developers with better information than this. A brief discussion covered some of the options:

Use printk() to put information into the system logfile. This approach is widely used, but it bloats the kernel with string data, risks flooding the logs, and the resulting information may not be easily accessible to an unprivileged programmer.
Extend specific system calls to enable them to provide richer status information. Just adding a new version of ioctl() would address many of the worst problems.
Create an errno-like mechanism by which any system call could return extended information. That information could be an error string, some sort of special code, or, as Alan Cox suggested, a pointer to the structure field which caused the problem.

One could certainly argue that the narrow errno mechanism is showing its age and could use an upgrade. Any enhancements, though, would be Linux-specific and non-POSIX, which always tends to limit their uptake. They would also have to be lived with forever, and, thus, would require careful design. So we're unlikely to see a solution in the mainline anytime soon, even if somebody does take up the challenge.

Index entries for this article
Kernel	Development model/User-space ABI
Kernel	User-space API

Extended error reporting

Posted Feb 19, 2010 4:23 UTC (Fri) by PaulWay (guest, #45600) [Link]

I'd like to put a mention in here of libexplain - http://libexplain.sourceforge.net - that Peter Miller has worked on for many years. It's purpose is somewhat different - you use it in your code (e.g. 'explain_open') in the place of your regular system calls, and if it fails it prints out a complete message why it failed. This even breaks down to which directory or file was missing in the path you've given to open.

But I think Peter's experience and general approach here would be invaluable as groundwork for making errors better in the kernel.

Have fun,

Paul

Extended error reporting

Posted Feb 22, 2010 10:39 UTC (Mon) by mjcoder (guest, #54432) [Link] (5 responses)

What about some kind of exception handling - just in a way similar to Windows? I guess that it would mean massive changes to the kernel, but getting rid of this ancient way of error reporting (with simple return codes) seems to be necessary.

Extended error reporting

Posted Feb 22, 2010 11:05 UTC (Mon) by nix (subscriber, #2304) [Link] (4 responses)

Find a way to do that without violating POSIX, adding a huge new layer of
non-POSIX infrastructure (which nobody will ever use for portable code),
or replicating every single POSIX call (which all set errno) with some
other call (which throws this kernel->userspace exception thing).

Now find a way to throw exceptions from the kernel into userspace without
violating MS's patents on exactly that.

Extended error reporting

Posted Feb 22, 2010 11:22 UTC (Mon) by mjcoder (guest, #54432) [Link] (3 responses)

I'm not sure about it, but is POSIX compatibility required at kernel level - or is it enough to get POSIX compatibility at "libc" level?

Extended error reporting

Posted Feb 22, 2010 13:46 UTC (Mon) by nix (subscriber, #2304) [Link] (2 responses)

What's the point of an extended error reporting interface if it's only
exposed to the libc? Of *course* applications will have to be able to see
it.

Extended error reporting

Posted Feb 24, 2010 11:25 UTC (Wed) by mjcoder (guest, #54432) [Link] (1 responses)

Just imagine: You would be able to have several subsystems (like POSIX with limited error information) on top of a stable feature rich kernel. I really like this idea. Something like Wine could be implemented as another subsystem ... or even BSD compatibility (if needed).

Extended error reporting

Posted Feb 24, 2010 22:41 UTC (Wed) by nix (subscriber, #2304) [Link]

I'm just imagining what Al would say about the possibility of adding extra
personalities to the Linux kernel for something like this (personality()
is what Linux uses for this sort of thing). I don't think I can invent
elaborate enough ways of saying 'hell no'.

(Among other things, this would partition the space of processes *again*:
you'd now have 32-bit, 64-bit, 32-bit-extended-error, and
64-bit-extended-error processes... and, uniquely, these extended-error
processes would require *source code changes*. I don't think this is going
to fly, not least because you'd need to convince Ulrich Drepper to do this
to libc, or write a new one. Writing a new one is likely to be less
difficult :) )

Extended error reporting

Posted Feb 24, 2010 19:50 UTC (Wed) by nevets (subscriber, #11875) [Link]

What about adding a char array to the task_struct that any syscall fucntion could write to. Then add another syscall that lets the user access this string.

This may be the easiest way to handle such a thing. It does not break any POSIX compliance, but allows new applications to test if this syscall exists, and if so, after a failure of another syscall, it can query more information about what happened.

Extended error reporting

Posted Feb 27, 2010 15:17 UTC (Sat) by jengelh (guest, #33263) [Link]

iptables has this so far "solved" by also doing checks in userspace, or having the kernel at least printk a hint to what's wrong.