Automatic async-signal-unsafe checking?

Posted Jul 2, 2024 17:35 UTC (Tue) by rweikusat2 (subscriber, #117920)
In reply to: Automatic async-signal-unsafe checking? by geofft
Parent article: Serious vulnerability fixed with OpenSSH 9.8

I fail to see any difference between "just don't call functions which aren't async signal sage" and your elaboration about just that, because that's what it boils down to. What precisely is or isn't "nontrivial work" is in the eye of the beholder. OpenSSH has had two vulnerabilities because of this approach in the last 25 years. And OpenSSH is a high-profile security-related program developed by experts in this area.

An alternate option is to defer handling of asynchronous signals until a time when the process can safely be interrupted. For a program structured around a synchronous I/O multiplexing loop, this will be when the process has again called into the kernel to be notified when more I/O is possible. System call variants exists specifically for this use case, namely the p-variants of select, poll etc. This obviously also requires that all handled signals are blocked while signal handlers themselves are executing as they will have been unblocked by the I/O multiplexing call. An advantage of this is that it doesn't need userspace code for dispatching signals.

Yet another option is to block all signals an application wants to handle and receive signals as input on a signalfd file descriptor. This needs some form of userspace signal dispatching but will work in environments where sigaction isn't easily usable (eg, perl).

Automatic async-signal-unsafe checking?

Posted Jul 2, 2024 21:40 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (4 responses)

> What precisely is or isn't "nontrivial work" is in the eye of the beholder.

The comment you are replying to just explained exactly what "nontrivial work" is. It's anything other than setting a flag, writing one byte into a self-pipe (or the like), or calling _exit(2). These are not examples. They are an exhaustive list. Literally any other code under the sun is "nontrivial."

> System call variants exists specifically for this use case, namely the p-variants of select, poll etc. This obviously also requires that all handled signals are blocked while signal handlers themselves are executing as they will have been unblocked by the I/O multiplexing call. An advantage of this is that it doesn't need userspace code for dispatching signals.

This works great, right up until one of your libraries tries to use alarm(2) (or sleep(3), usleep(3), setitimer(2), etc.) to do something interesting, and finds that signals are blocked and it does not work. And yes, that probably does go against the "no nontrivial work" rule in most practical cases, but in userspace, the sad reality is that you do not own all of the code in your process, and sometimes have to live with libraries doing ill-advised things.

(Never mind the even crazier possibility that one of your libraries will decide it is a good idea to manipulate the signal mask!)

Automatic async-signal-unsafe checking?

Posted Jul 3, 2024 14:18 UTC (Wed) by rweikusat2 (subscriber, #117920) [Link] (3 responses)

Parent article: Serious vulnerability fixed with OpenSSH 9.8

>> What precisely is or isn't "nontrivial work" is in the eye of the beholder.
> The comment you are replying to just explained exactly what "nontrivial work" is. It's anything other than setting a > flag, writing one byte into a self-pipe (or the like), or calling _exit(2). These are not examples. They are an
> exhaustive list. Literally any other code under the sun is "nontrivial."

The comment reflects the opinion of the author what's considered "non-trivial". The set of async signal safe function is significantly larger than just write and _exit which very strongly suggests that other people do not share this opinion. For another example, OpenBSD (and also, some Linux distributions) have an asycnc-signal-safe syslog function specifically to enable it to be used safely from asynchronous signal handlers.

Automatic async-signal-unsafe checking?

Posted Jul 3, 2024 22:48 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

Operating systems will happily let you carry out all manner of bad ideas. The fact that something is supported doesn't magically make it into a good idea.

Automatic async-signal-unsafe checking?

Posted Jul 4, 2024 1:52 UTC (Thu) by geofft (subscriber, #59789) [Link] (1 responses)

You are continuing to misunderstand or mischaracterize my proposal. I am not making the proposal of restricting your signal handlers to the entire set of async-signal-safe functions, so I don't know why you are commenting on how large that set is - it is not relevant to my proposal.

My proposal is to permit a narrow and well-defined set of operations, much smaller than the set of async-signal-safe functions. I am not expressing the opinion that all other functions are async-signal-unsafe. There are many other async-signal-safe functions; you should, nonetheless, not call them.

Contrary to what you say, OpenSSH has not tried my proposal. They have tried the proposal you are attacking - of restricting signal handlers to async-signal-safe calls, and paying attention to how that set changes on various OSes. That is the proposal that (as you point out) has failed in practice. We should not adopt it. I am not calling for people to adopt it.

For context, here is the patch in TFA that OpenSSH suggests to fix the vulnerability in existing releases:

diff --git a/log.c b/log.c
index 9fc1a2e2e..191ff4a5a 100644
--- a/log.c
+++ b/log.c
@@ -451,12 +451,14 @@ void
 sshsigdie(const char *file, const char *func, int line, int showfunc,
     LogLevel level, const char *suffix, const char *fmt, ...)
 {
+#ifdef SYSLOG_R_SAFE_IN_SIGHAND
 	va_list args;
 
 	va_start(args, fmt);
 	sshlogv(file, func, line, showfunc, SYSLOG_LEVEL_FATAL,
 	    suffix, fmt, args);
 	va_end(args);
+#endif
 	_exit(1);
 }

It's very clear to see that they're taking the approach of calling or not calling functions depending on whether they are async-signal-safe, and this is exactly what I am claiming one should not do. Under my proposal, there should not be any function calls to other user-written functions in a signal handler - the very existence of this sshsigdie() function is a design flaw, because it is not immediately obvious that it is called from a signal handler and subject to async-signal-safety restrictions. (In fact, it's called indirectly from a macro sigdie(), further obscuring the link between the signal handler and this code.) The idea of doing conditional compilation based on the current OS's async signal safety guarantees is an indication that you are already well down an incorrect path.

My proposal is to change the code in OpenSSH such that the SIGALRM handler is not responsible for ending the child process, it is just responsible for notifying whatever code was being called that the child process should shut down. If a read() from the remote end returns EINTR and a flag is set, that's a sign to call the regular code to exit the process. The handler should not be doing it.

My proposal is that, if you want to use alarm()/SIGALRM for a timeout, your alarm handling should be in the course of how you process all other events - much as it would be if you were using timerfd for a timeout on an OS that has it. If you want to syslog, pass a message to your main program to syslog. The fact that things like timerfd exist is, I would argue, evidence that this is in fact the consensus view. So is the fact that we have had numerous new types of kernel features in the last few decades exposed via pollable file descriptors and none via new signals (not counting the real-time signals, which are a rather different sort of thing from signals like SIGALRM and SIGWINCH).

In fact, it looks like the patch that OpenSSH applied in 9.8 to fix the bug was to do exactly this sort of change - change the signal handler to exit without logging anything, and move responsibility for logging into the parent sshd process.

And this patch points out what's actually going on in the vulnerability: it's a timeout that needs to be able to interrupt a blocking PAM call. The approach of setting a flag doesn't work here, because you're in code that you don't control in a PAM module which cannot check your flag; you have to call _exit(). But by the same token, your proposal of blocking signals outside of pselect() etc. wouldn't work here either: if you only unblocked SIGALRM while you're inside a pselect() call of your own, you would never interrupt the stuck PAM module.

Automatic async-signal-unsafe checking?

Posted Jul 4, 2024 5:33 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

> The approach of setting a flag doesn't work here, because you're in code that you don't control in a PAM module which cannot check your flag; you have to call _exit().

These days, I would move the timeout to the parent process entirely, and instead I will send heartbeat messages to it.