This is why we can't have safe cancellation points

April 13, 2016

This article was contributed by Neil Brown

Signals have been described as an "unfixable design" aspect of Unix. A recent discussion on the linux-kernel mailing list served to highlight some of the difficulties yet again. There were two sides to the discussion, one that focused on solving a problem by working with the known challenges and the existing semantics, and one that sought to fix the purportedly unfixable.

The context for this debate is the pthread_cancel(3) interface in the Pthreads POSIX threading API. Canceling a thread is conceptually similar to killing a process, though with significantly different implications for resource management. When a process is killed, the resources it holds, like open file descriptors, file locks, or memory allocations, will automatically be released.

In contrast, when a single thread in a multi-threaded process is terminated, the resources it was using cannot automatically be cleaned up since other threads might be using them. If a multi-threaded process needs to be able to terminate individual threads — if for example it turns out that the work they are doing is no longer needed — it must keep track of which resources have been allocated and where they are used. These resources can then be cleaned up, if a thread is canceled, by a cleanup handler registered with pthread_cleanup_push(3). For this to be achievable, there must be provision for a thread to record the allocation and deallocation of resources atomically with respect to the actual allocation or deallocation. To support this Pthreads introduces the concept of "cancellation points".

These cancellation points are optional and can be disabled with a call to pthread_setcanceltype(3). If the cancel type is set to PTHREAD_CANCEL_ASYNCHRONOUS then a cancellation can happen at any time. This is useful if the thread is not performing any resource allocation or not even making any system calls at all. In this article, though, we'll be talking about the case where cancellation points are enabled.

On cancellation points and their implementation

From the perspective of an application, a "cancellation point" is any one of a number of POSIX function calls such as open(), and read(), and many others. If a cancellation request arrives at a time when none of these functions is running, it must take effect when the next cancellation-point function is called. Rather than performing the normal function of the call, it must call all cleanup handlers and cause the thread to exit.

If the cancellation occurs while one of these function calls is waiting for an event, the function must stop waiting. If it can still complete successfully, such as a read() call for which some data has been received but a larger amount was requested, then it may complete and the cancellation will be delayed until the next cancellation point. If the call cannot complete successfully, the cancellation must happen within that call. The thread must clean up and exit and the interrupted function will not return.

From the perspective of a library implementing the POSIX Pthreads API, such as the musl C library (which was the focus of the discussions), the main area of interest is the handling of system calls that can block waiting for an event, and how this interacts with resource allocation. Assuming that pthread_cancel() is implemented by sending a signal, and there aren't really any alternatives, the exact timing of the arrival of the cancellation signal can be significant.

If the signal arrives after the function has checked for any pending cancellation, but before actually making a system call that might block, then it is critical that the system call is not made at all. The signal handler must not simply return but must arrange to perform the required cleanup and exit, possibly using a mechanism like longjmp().
If the signal arrives during or immediately after a system call that performs some sort of resource allocation or de-allocation, then the signal handler must behave differently. It must let the normal flow of code continue so that the results can be recorded to guide future cleanup. That code should notice if the system call was aborted by a cancellation signal and start cancellation processing. The signal handler cannot safely do that directly; it must simply set a flag for other code to deal with.

There are quite a number of system calls that can both wait for an event and allocate resources; accept() is a good example as it waits for an incoming network connection and then allocates and returns a file descriptor describing that connection. For this class of system calls, both requirements must be met: a signal arriving immediately before the system call must be handled differently than a signal arriving during or immediately after the system call.

There are precisely three Linux system calls for which the distinction between "before" and "after" is straightforward to manage: pselect(), ppoll(), and epoll_pwait(). Each of these takes a sigset_t argument that lists some signals that are normally blocked before the system call is entered. These system calls will unblock the listed signals, perform the required action, then block them again before returning to the calling thread. This behavior allows a caller to block the cancellation signal, check if a signal has already arrived, and then proceed to make the system call without any risk of the signal being delivered just before the system call actually starts. Rich Felker, the primary author of musl, did lament that if all system calls took a sigset_t and used it this way, then implementing cancellation points correctly would be trivial. Of course, as he acknowledged, "this is obviously not a practical change to make."

Without this ability to unblock signals as part of every system call, many implementations of Pthread cancellation are racy. The ewontfix.com web site goes into quite some detail on this race and its history and reports that the approach taken in glibc is:

    ENABLE_ASYNC_CANCEL();
    ret = DO_SYSCALL(...);
    RESTORE_OLD_ASYNC_CANCEL();
    return ret;

where ENABLE_ASYNC_CANCEL() directs the signal handler to terminate the thread immediately and RESTORE_OLD_ASYNC_CANCEL() directs it to restore the behavior appropriate for the pthread_setcanceltype() setting.

If the signal is delivered before or during the system call this works correctly. If, however, the signal is delivered after the system call completes but before RESTORE_OLD_ASYNC_CANCEL() is called, then any resource allocation or deallocation performed by the system call will go unrecorded. The ewontfix.com site provides a simple test case that reportedly can demonstrate this race.

A clever hack

The last piece of background before we can understand the debate about signal handling is that musl has a solution for this difficulty that is "clever" if you ask Andy Lutomirski and "a hack" if you ask Linus Torvalds. The solution is almost trivially obvious once the problem is described as above so it should be no surprise that the description was developed with the solution firmly in mind.

The signal handler's behavior must differ depending on whether the signal arrives just before or just after a system call. The handler can make this determination by looking at the code address (i.e. instruction pointer) that control will return to when the handler completes. The details of getting this address may require poking around on the stack and will differ between different architectures but the information is reliably available.

As Lutomirski explained when starting the thread, musl uses a single code fragment (a thunk) like:

    cancellable_syscall:
        test whether a cancel is queued
        jnz cancel_me
        int $0x80
    end_cancellable_syscall:

to make cancellable system calls. ("int $0x80" is the traditional way to enter the kernel for a system call by behaving like an interrupt). If the signal handler finds the return address to be at or beyond cancellable_syscall but before end_cancellable_syscall, then it must arrange for termination to happen without ever returning to that code or letting the system call be performed. If it has any other value, then it must record that a cancel has been requested so that the next cancellable system call can detect that and jump to cancel_me.

This "clever hack" works correctly and is race free, but is not perfect. Different architectures have different ways to enter a system call, including sysenter on x86_64 and svc (supervisor call) on ARM. For 32-bit x86 code there are three possibilities depending on the particular hardware: int $0x80 always works but is not always the fastest. The syscall and sysenter instructions may be available and are significantly faster. To achieve best results, the preferred way to make system calls on a 32-bit x86 CPU is to make an indirect call through the kernel_vsyscall() entry point in the "vDSO" virtual system call area. This function will use whichever instruction is best for the current platform. If musl tried to use this for cancellable system calls it would run into difficulties, though, as it has no way to know where the instruction is, or to be certain that any other instructions that run before the system call are all located before that instruction in memory. So musl currently uses int $0x80 on 32-bit x86 systems and suffers the performance cost.

Cancellation for faster system calls

Now, at last, we come to Lutomirski's simple patch that started the thread of discussion. This patch adds a couple of new entry points to the vDSO, the important one for us is pending_syscall_return_address, which determines if the current signal happened during kernel_vsyscall handling and reports the address of the system call instruction. The caller can then determine if the signal happened before, during, or after that system call.

Neither Linus nor Ingo Molnar liked this approach, though their exact reasons weren't made clear. Part of the reason may have been that the semantics of cancellation appear clumsy so it is hard to justify much effort to support them. According to Molnar, "it's a really bad interface to rely on". Even Lutomirski expressed surprise that musl "didn't take the approach of 'pthread cancellation is not such a great idea -- let's just not support it'." Szabolcs Nagy's succinct response "because of standards" seemed to settle that issue.

One clear complaint from Molnar was that there was "so much complexity" and it is true that the code would require some deep knowledge to fully understand. This concern is borne out by the fact that Lutomirski, who has that knowledge, hastily withdrew his first and second attempts. While complexity is best avoided where possible, complexity should not be, by itself, itself a justification for keeping something out of Linux.

Torvalds and Molnar contributed both by exploring the issues to flesh out the shared understanding and by proposing extra semantics that could be added to the Linux signal facility so that a more direct approach could be used.

Molnar proposed "sticky signals" that could be enabled with an extra flag when setting up a signal handler. The idea was that if the signal is handled other than while a system call is active, then the signal remains pending but is blocked in a special new way. When the next system call is attempted, it is aborted with EINTR and the signal is only then cleared. This change would remove the requirement that the signal handler must not allow the system call to be entered at all if the signal arrives just before the call, since the system call would now immediately exit.

Torvalds's proposal was similar but involved "synchronous" signals. He saw the root problem being that signals can happen at any time and this is what leads to races. If a signal were marked as "synchronous" then it would only be delivered during a system call. This is exactly the effect achieved with pselect() and friends and so could result in a race-free implementation.

The problem with both of these approaches is that they are not selective in the correct way. POSIX does not declare all system calls to be cancellation points and, in fact, does not refer to system calls at all. It is only certain API functions that are defined as cancellation points and, as Torvalds clearly agreed that being able to use the faster system call entry made available in the vDSO was important, but neither he nor Molnar managed to provide a workable alternative to the solution proposed by Lutomirski.

Felker made his feelings on the progress of the discussion quite clear:

I'm really frustrated that, again and again, we have kernel folks with no experience with libc implementation trying to redesign something that already has a simple zero-cost design that works on all existing systems, and proposing things that have a mix of immediately-obvious flaws and potential future problems we haven't even thought of yet.

It is certainly important to get the best design, and exploring alternatives to understand why they were rejected is a valid part of the oversight provided by a maintainer. When that leads to the design being improved, we can all rejoice. When it leads to an understanding that the original design, while not as elegant as might be hoped, is the best we can have, it shouldn't prevent that design from being accepted. Once Lutomirski is convinced that he has all the problems resolved, it is to be hoped that a re-submission results in further progress towards efficient race-free cancellation points. Maybe that would even provide the incentive to get race-free cancellation points in other libraries like glibc.

Index entries for this article
Kernel	System calls
GuestArticles	Brown, Neil

This is why we can't have safe cancellation points

Posted Apr 14, 2016 1:49 UTC (Thu) by kjp (guest, #39639) [Link] (10 responses)

FYI I just learned ewontfix's author is the musl author.

http://ewontfix.com/4/ is also a good one. Things that make you shake your head and wonder how anything works. Also apparently everything assumes malloc + threads + fork is safe, but I saw some recent glibc bugs on that too.

This is why we can't have safe cancellation points

Posted Apr 14, 2016 8:52 UTC (Thu) by gutschke (subscriber, #27910) [Link] (9 responses)

fork() and threads are just mutually incompatible. Period.

The preferred solution in the POSIX world is to fork() a helper process as the first thing in main(), before any threads get created. And this helper is then used whenever another process needs to be created. Unfortunately, this is not always possible in existing code bases. It also breaks, if libraries decide to start threads before main() executes. And it can make reaping of child processes more difficult.

If the above concerns pose an insurmountable problem, there are ways to make forking possible; but they require direct system calls, intimate knowledge of the underlying libc implementation and of the kernel, and also knowledge of how the compiler and the dynamic linker interact. It's amazingly difficult to get right.

For many years, I have maintained commercial code that had to implement this hack. It generally worked fine, but the amount of engineering effort to get there was insanely high, and every couple of years new bugs popped up, when the tool chain gradually changed.

This is why we can't have safe cancellation points

Posted Apr 14, 2016 22:46 UTC (Thu) by pikhq (subscriber, #98351) [Link] (6 responses)

The actually preferred POSIX solution is to use the posix_spawn function, which works correctly in the presence of threads. I admit it's a bit of a little-known function, but it's a handy one.

This is why we can't have safe cancellation points

Posted Apr 14, 2016 23:51 UTC (Thu) by wahern (subscriber, #37304) [Link] (5 responses)

posix_spawn exists solely because POSIX makes implementing fork optional so that systems without virtual memory can still meet the letter of the standard.

posix_spawn doesn't magically solve thread race problems. Using fork+exec+dup2 can be just as safe posix_spawn and be significantly more clear. For example, descriptors without the FD_CLOEXEC descriptor flag set will still leak into the new process instance unless you explicitly close them. But which is easier: tediously filling out a posix_spawn_file_actions object, or simply calling close? Often the latter.

Implementations generally implement posix_spawn by using vfork+exec. That does mean that they bypass pthread_atfork handlers. But that's not intrinsically safer: it can cut both ways--maybe one was going to close a descriptor. Anyhow, pthread_atfork handlers are fundamentally broken wrt the original intention, and the specification admits this.

This is why we can't have safe cancellation points

Posted Apr 15, 2016 12:07 UTC (Fri) by oshepherd (guest, #90163) [Link] (4 responses)

posix_spawn is not optional in POSIX. Yes, one of the motivating reasons for it is that it permits profiles of POSIX for no-MMU platforms to spawn new processes, but that is not the only motivation. posix_spawn can also be more efficient (because it does not have to go through the completely generic fork path, and it additionally makes error handling much easier (Have you ever tried reporting execve, dup or close errors back to the parent process? It's not a trivial matter...)

I don't buy the argument that fork+dup+close+exec is easier. Not once you actually handle errors properly or do other things that a Robust Application (TM) should.

This is why we can't have safe cancellation points

Posted Apr 15, 2016 14:04 UTC (Fri) by MrWim (subscriber, #47432) [Link] (1 responses)

> posix_spawn is not optional in POSIX.

A minor point: I understood the parent to mean that fork is optional in POSIX, whereas posix_spawn is mandatory.

This is why we can't have safe cancellation points

Posted Apr 15, 2016 15:05 UTC (Fri) by oshepherd (guest, #90163) [Link]

Uh, thinko. I meant to say "fork is not optional in POSIX." Indeed, posix_spawn is optional, but fork is not (that said, if there are any important platforms where POSIX spawn doesn't exist - I'm thinking probably OS X here - it'd be relatively easy for somebody to produce a compatibility shim which implemented it on top of fork/execve/pipe/close/dup/etc)

This is why we can't have safe cancellation points

Posted Apr 21, 2016 5:13 UTC (Thu) by wahern (subscriber, #37304) [Link] (1 responses)

You're completely right: posix_spawn is optional (part of the Spawn extension) and fork is not. I even had the standard open but didn't bother confirming what I asserted. Shameful....

Regarding error checking: routines like posix_spawn_file_actions_addclose can fail with ENOMEM, and AFAICT both the glibc and musl implementations allocate memory even on the first add--musl for each individual action, glibc for 8 actions. Because allocation can fail even on Linux with OOM (e.g. policy-based resource limits), regardless of allocation size, correct code needs to check for failure on each individual descriptor action added to the queue. So posix_spawn requires the same number of error checks.

OTOH, even the most pedantic of developers could choose to ignore errors from close() (or posix_close() if and when http://austingroupbugs.net/view.php?id=529 is adopted) in the child process. Apropos this article, you no longer need to worry about close being a thread cancellation point in the child, and blocking all signals is easy, so EINTR won't happen. And EBADF shouldn't happen in correctly written software.[1] Alternatively, you could choose to set FD_CLOEXEC, which doesn't even have an EINTR failure mode. Arguably dup2 could correctly be ignored--I have a hard time imagining a failure condition where dup'ing a descriptor over an already open stdio descriptor could fail, though that does depend on some assumptions and it's not something I would do anyhow.

Point being, explicit fork+exec could in some situations take less code than posix_spawn because you could elide some error checks. And I can't imagine a situation where it could take appreciably more code.

More importantly, though, is the point that posix_spawn doesn't solve threading race conditions. The only possibly plus in this regard is that posix_spawn will correctly block signals during the operation so that, e.g., a signal handler isn't wrongly called in the child but before exec.[2] Conspicuously missing, on the other hand, is the ability to set the umask in the child process. Setting or even querying the umask simply can't be done in a race-free manner in a threaded application, unless no other thread relies on the umask, or if you fork and report back the umask.

While there's nothing intrinsically wrong with using posix_spawn, it shouldn't be used for the wrong reasons. You still have to carefully consider the important stuff.

[1] EBADF invariably means you have a bug in your application, often a thread race or in single-threaded non-blocking I/O code an ordering issue. I refuse to ignore EBADF in my event loop and polling libraries (unlike libevent and similar libraries) despite people complaining to me how annoying it is to propagate it. Such a bug could easily lead to stalled network I/O. I'm convinced it's is a very common problem in non-blocking I/O networking daemons, but that its rare enough that people chaulk it up to network hiccups. So I propagate EBADF when manipulating a descriptor event because it's not the library's prerogative to hide such an error, and it can't possibly know whether the error is benign, recoverable, or panic-worthy. Though as with ENOMEM, library state remains consistent after the error so that recovery isn't foreclosed.

[2] pthread_sigmask has no failure mode when used correctly, so it's just two lines of condition-less code when using fork+exec. Though I learned a few years ago over on comp.unix.programmer that one should initialize a sigset_t object with sigemptyset before passing as the _output_ argument to pthread_sigmask and similar routines. Some implementations will logical-OR the signal set, rather than writing over the entire sigset_t object. See also http://pubs.opengroup.org/onlinepubs/9699919799/functions.... I admit this is one case where using posix_spawn has a clear benefit over fork+exec. I just don't think that in the grand scheme of things it amounts to much. Descriptor leakages and umask races, for example, are arguably far-and-above the bigger problem, especially from a security perspective, and posix_spawn provides no benefit and in some cases is more limited.

This is why we can't have safe cancellation points

Posted Apr 25, 2016 14:39 UTC (Mon) by nix (subscriber, #2304) [Link]

Arguably dup2 could correctly be ignored--I have a hard time imagining a failure condition where dup'ing a descriptor over an already open stdio descriptor could fail, though that does depend on some assumptions and it's not something I would do anyhow.

A brief glance at do_dup2() in the kernel (or, for that matter, at a sufficiently recent manpage) reveals that it can fail with -EBUSY if the file descriptor it's being asked to dup over is still being opened, so (just as with -EINTR) a retry loop would be needed for perfect safety in this situation.

This is why we can't have safe cancellation points

Posted Apr 14, 2016 23:30 UTC (Thu) by wahern (subscriber, #37304) [Link] (1 responses)

Forking a helper process from main just to execute other programs sounds like cargo cult advice.

Calling fork from a thread is just as safe or unsafe as another common asynchronous context: signals. Pthread mutexes aren't async-signal safe, either. So you _can_ use fork from a threaded application, just make sure that between fork and exec you only call async-safe routines--typically syscalls. It's perfectly safe to call routines like close, dup2, mmap, etc. Definitely do not use FILE handles, malloc, etc.

Threading in general is incompatible with any code which doesn't pay attention to global state and other race conditions. But if you use a third-party library which doesn't internally implement locking, but which is very carefully designed not to touch global state and which behaves correctly as long the caller synchronizes on an instance object before calling its methods, that's not something I would call "incompatible" with threading even though safe behavior it's not automatic. The POSIX standard and implementations are likewise very carefully designed to permit mixed-use, but likewise require you to heed certain well-specified constraints. It's not magic.

The lesson here isn't that it's impossible to mix these things; it's that you really need to pay attention. If you can't be bothered to think these things through; if you can't be bothered to read the freely available and relatively concise standards, if only to confirm what you've read on Stack Overflow or elsewhere; if you can't be bothered to otherwise catalog and verify your assumptions; then not only should you not mix forking and threading, you probably shouldn't be doing any kind of threaded programming, period.

That advice applies to everybody, regardless of skill, myself included, because you devise solutions to problems based on your resources. If you don't have the time (presuming you have the knowledge and faculties) to implement something correctly using certain tools, you should use safer tools. This is why even the most expert of C lovers regularly use scripting languages, and why people using so-called "safe" languages use even safer languages on occasion.

FWIW, forking and threading is well-defined by POSIX. From the specification for fork in IEEE Std 1003.1, 2013 Edition,

A process shall be created with a single thread. If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called. Fork handlers may be established by means of the pthread_atfork() function in order to maintain application invariants across fork() calls. When the application calls fork() from a signal handler and any of the fork handlers registered by pthread_atfork() calls a function that is not async-signal-safe, the behavior is undefined.

Likewise, the advice that "the only safe thing to do from a signal handler is to set a flag" is also not true. It's intended to scare junior programmers (or lazy programmers) away from interfaces that are difficult to use. They require enough experience and knowledge to be able verify correct usage. But IMO, if you're programming any non-trivial application in C or C++, you should at least understand how to _disprove_ correct usage. Which is to say, you should be able to explain _why_ code implementing a signal handler is unsafe, even if you're not sure if or how it could be made safe, and even if it requires consulting a specification or other primary source.

A corollary is that good C programmers should be well practiced at reading the C, POSIX, or other relevant standard. Manual pages, especially on Linux, are definitely not up to snuff, FWIW. They're woefully incomplete, and too often resort to scare tactics. Whereas the POSIX specification is more complete, well laid-out, and has useful overviews and discussions relevant to subsystems and specific routines. It's the best starting point, though for many situations there's no substitute for verifying behavior by consulting the source code.

This is why we can't have safe cancellation points

Posted Apr 18, 2016 17:10 UTC (Mon) by quotemstr (subscriber, #45331) [Link]

Thank you for dispelling some of the incorrect received wisdom one sees circulating around programming communities. Signals are not magic. Fork is not magic. All of these things are just powerful tools that become useful when understood. It really is a wonder anything works at all when people program with the kind of incorrect models you rail against.

This is why we can't have safe cancellation points

Posted Apr 14, 2016 5:24 UTC (Thu) by mjthayer (guest, #39183) [Link] (28 responses)

It is still early in the morning, so I may write something silly now... if the interface is so bad it is hardly worth supporting and there is a solution now (int 0x80) which works, albeit slowly, is it feasible to only use this solution when cancelling is enabled? Or some similar fast and slow path solution?

This is why we can't have safe cancellation points

Posted Apr 14, 2016 7:51 UTC (Thu) by khim (subscriber, #9252) [Link] (27 responses)

For embedded library like newlib - may be. For general-purpose library like musl - nope. You don't "enable" cancelling, you just use it.

This is why we can't have safe cancellation points

Posted Apr 14, 2016 19:40 UTC (Thu) by luto (guest, #39314) [Link] (25 responses)

You do enable cancellation, though -- by default, system calls are not cancellable AFAIK.

So musl could (and, AFAICT, does) use int $0x80 only for cancellable syscalls.

(FWIW, the actual meat of the patch I wrote was fine, I think. The issue was that building vdso code at all is a giant mess and I broke the build system.)

This is why we can't have safe cancellation points

Posted Apr 14, 2016 21:18 UTC (Thu) by khim (subscriber, #9252) [Link] (23 responses)

Uhm. Perhaps I'm misreading something but AFAICS the cancelability state and type of any newly created threads, including the thread in which main() was first invoked, shall be PTHREAD_CANCEL_ENABLE and PTHREAD_CANCEL_DEFERRED respectively means exactly what I wrote: you don't enable cancellation, you just use it. You can disable it, sure - but that's not default.

This is why we can't have safe cancellation points

Posted Apr 14, 2016 21:28 UTC (Thu) by luto (guest, #39314) [Link] (22 responses)

IMO that's simply daft.

Perhaps someone should attempt to change the standard to work the other way (default is PTHREAD_CANCEL_DISABLE and PTHREAD_CANCEL_DEFERRED). After all, no sensible program uses cancellation, so why make them pay the price?

This is why we can't have safe cancellation points

Posted Apr 14, 2016 22:30 UTC (Thu) by khim (subscriber, #9252) [Link]

POSIX had that requirement for the last 20 years or so, I'm afraid it's too late to change it.

If someone wants to introduce drastic, potentially disruptive, change to POSIX then it would be significantly more sane to just make them optional in POSIX and remove them from libraries like Musl and GLibc, don't you think?

This is why we can't have safe cancellation points

Posted Apr 15, 2016 0:04 UTC (Fri) by neilbrown (subscriber, #359) [Link] (6 responses)

> After all, no sensible program uses cancellation, so why make them pay the price?

This is the part of the story that didn't make much sense to me. Why do you think cancellation is such a bad idea?
I appreciate that it only gets about a 3 or 4 on Rusty's API Design scale but they provide an extremely light-weight mechanism to protect threads from dying at awkward moments.
The only alternative I can see is for an application to use an ad-hoc signaling mechanism and for threads to only use non-blocking versions of 'accept' and other interfaces that allocate resources.
Apart from wheel-reinvention, that would be an interface that libraries couldn't share.
You and Rich have made it clear that cancellation can be implemented correctly and efficiently. So there isn't really any price to be paid. Let's just do it and move on ???

This is why we can't have safe cancellation points

Posted Apr 15, 2016 0:11 UTC (Fri) by luto (guest, #39314) [Link] (5 responses)

Using it correctly is a real pain. Using it correctly in C++ is even worse.

AFAICT the only way to use it safely is to have cancellation off *except* at very carefully selected points and to turn it on at those points. Every cancellation point then needs to be aware that the thread can go away without unwinding.

ISTM any code that actually does this would be better off using ppoll, etc.

This is why we can't have safe cancellation points

Posted Apr 15, 2016 0:38 UTC (Fri) by neilbrown (subscriber, #359) [Link] (4 responses)

This is assertion without substance.

Surely I can:
1/ create a data structure that contains a list of all resources I might hold (file descriptor, byte range locks).
2/ register a cleanup handler which walks that data structure and frees everything.
3/ write simple wrappers for open/accept/whatever which record the results in the data structure
4/ just call those wrappers, never the bare API.

Then if I ever get canceled, everything will be cleaned up nicely.

I would need to disable cancellation while manipulating a data structure shared with other threads, but I see cancellation more as being appropriate for largely independent threads.

What specific risks do you see if cancellation is mostly enabled?

This is why we can't have safe cancellation points

Posted Apr 15, 2016 7:21 UTC (Fri) by khim (subscriber, #9252) [Link] (2 responses)

What specific risks do you see if cancellation is mostly enabled?

I think the problem is simple inefficiency.

Cancellation support is not free - even if it's not used.

And even your "simple" scheme includes many steps and couldn't arrive in a random program by accident.

Surely if you change a design of your program that much to make it possible to use cancellation you could as well go and create wrapper for pthread_create which will call pthread_setcancelstate(p), too?

This is why we can't have safe cancellation points

Posted Apr 15, 2016 8:00 UTC (Fri) by neilbrown (subscriber, #359) [Link] (1 responses)

> I think the problem is simple inefficiency.

Specifically? The solution used by musl costs almost nothing except on x86_32 and the change to make it work well on x86_32 has zero extra performance cost.

> And even your "simple" scheme includes many steps and couldn't arrive in a random program by accident.

I'm failing to parse that... Certainly you wouldn't put any code in any program by accident (I hope) ??
The random program probably never cancels threads so it wouldn't want these steps anyway.

> Surely if you change a design of your program that much to make it possible to use cancellation you could as well go and create wrapper for pthread_create which will call pthread_setcancelstate(p), too?

I fail to see how this would solve anything at all.

Many applications never cancel any threads. They are irrelevant. They need do nothing and they suffer no cost (maybe a couple of instructions per syscall. If you can't afford that, hand-code your systemcalls).

Some applications do find value in the ability to cancel threads. Those threads clearly need to be prepared to be canceled. Being prepared is not zero work, but it is not too onerous.

If the thread is not doing any resource allocation, maybe just computing pi to a few million bits, then it can deliberately request async cancellation and go about its business.
If the thread is allocating resources then it naturally needs to make sure they get de-allocated. Any code already needs to worry about this. Code that can be canceled needs to do maybe 10% more work.
It can disable cancellation over a short allocate/use/deallocate sequence that won't block. Or it can register a cleanup helper and record the allocation in some array or something.
If your allocations follow a strict LIFO discipline you can even
- alloc
- push cleanup handler
- use the allocation
- pop the cleanup handler

Which makes for nice clean code with the certainty that the cleanup handler will run even if the thread is canceled.
The point of deferred cancellation is that this can be done with no locking, no extra system calls. It just needs a little care - like not logging any messages between the allocation and pushing the cleanup handler.

This is why we can't have safe cancellation points

Posted Apr 15, 2016 15:10 UTC (Fri) by khim (subscriber, #9252) [Link]

Many applications never cancel any threads. They are irrelevant.

99.9% of all programs (and I've picked conservative number) are irrelevant? That's novel idea to me.

They need do nothing and they suffer no cost (maybe a couple of instructions per syscall. If you can't afford that, hand-code your systemcalls).

When proportion is this skewed even these two instructions make no sense: why should 99.9% of all the apps suffer at all if this could be avoided?

The natural response would: because I could just take bits and pieces from these 99.9% apps and use these to build these rare few apps which do use cancellation.

But as you've shown you couldn't just take random working code from working library, plug it in a program which uses cancellation and hope that the end result would work.

ALL code must be carefully designed in such a program. And if ALL code is specifically written for such a program then additional burden of adding couple of pthread_setcancelstate calls here and there wouldn't be large at all!

The argument that "hey, I don't know where and how threads are created in this large program" wouldn't fly: if you don't know even that much about your program/library/whatever then how could you be sure that you control is enough to even try to attempt to use cancellation of threads?

It just needs a little care - like not logging any messages between the allocation and pushing the cleanup handler.

Sure. But if that's called "a little care" then "you also need to call pthread_setcancelstate(PTHREAD_CANCEL_ENABLE) in each thread" wouldn't a large problem...

This is why we can't have safe cancellation points

Posted Apr 15, 2016 17:17 UTC (Fri) by nix (subscriber, #2304) [Link]

Quite. The program I work on for my day job does precisely that, with subsidiary threads whose primary job is ptrace()ing and waitpid()ing, but which may be commanded to go away by a controlling 'master' thread with many more jobs. All the work of such subsidiary threads is associated with a single structure relating to a specific subprocess under monitoring, and it is easy (and good style) to make sure that this structure is properly freeable at all times (you need that for decent error handling anyway). The cleanup handler then just needs to do the same 'shut down and free everything associated with the process structure' that we have to do when the subprocess dies anyway (indeed, we simply call the cleanup handler by hand in that situation).

I've had problems with the multithreading in that code, but they were all races associated with mutexes and condition variables. The nature of synchronous cancellation has caused me zero problems.

This is why we can't have safe cancellation points

Posted Apr 15, 2016 19:07 UTC (Fri) by ballombe (subscriber, #9523) [Link] (13 responses)

There are lot of programs using cancellation without mixing them with syscall and they work fine.
The trick is to have the parent allocate and free all resources in advance, and use robust
data structures like stacks.
If C++ is broken, do not use it for thread.

The fact that something is broken in some corner case does not make is useless in other case.

This is why we can't have safe cancellation points

Posted Apr 15, 2016 19:16 UTC (Fri) by luto (guest, #39314) [Link] (12 responses)

This code that supposedly works fine only works fine if it never calls into libraries, since those libraries might use malloc, or open, or mutexes, or anything else that acquires resources.

pthread cancellation is very dangerous, is useful only for specialized cases, and IMO should never have been enabled by default.

If it were simply disabled by default, then this performance issue would be irrelevant.

This is why we can't have safe cancellation points

Posted Apr 15, 2016 19:40 UTC (Fri) by nix (subscriber, #2304) [Link] (10 responses)

But this breaks all programs that assume it is enabled by default and then proceed to call pthread_cancel() on their own threads.

Changing longstanding defaults like this constitutes a break of userspace. You need a new -D flag (which, perhaps, sets a new ELF note, or simply triggers the linking in of a new crt1.o which flips the default) to ensure that this only happens to programs that are prepared for it.

This is why we can't have safe cancellation points

Posted Apr 15, 2016 19:49 UTC (Fri) by luto (guest, #39314) [Link] (9 responses)

Agreed. I'm not saying that changing the default is actually wise. But it might be enough of a simplification and a performance benefit to make it worthwhile.

This is why we can't have safe cancellation points

Posted Apr 15, 2016 20:19 UTC (Fri) by nix (subscriber, #2304) [Link] (8 responses)

A performance benefit? What performance cost is there to a single address comparison in a relatively rare path? (And as for complexity cost, well, hell, that sort of backward-compatibility burden is why systems get more complex over time, but that doesn't mean we can cavalierly throw users over the wall. codesearch.debian.net shows quite a lot of users, and yes, many of them are real users. :) )

(Now I'd agree that *asynchronous* cancellation is nearly impossible to program to and has an even smaller use case than synchronous cancellation, but even *it* is useful sometimes, particularly as a transient thing; e.g. when a thread that otherwise is synchronously cancellable is doing a long-running computation that it knows does no syscalls and can be safely unwound from the cleanup handler.)

This is why we can't have safe cancellation points

Posted Apr 15, 2016 20:34 UTC (Fri) by luto (guest, #39314) [Link] (7 responses)

It prevents syscall inlining. The impact is small but nonzero.

This is why we can't have safe cancellation points

Posted Apr 17, 2016 23:12 UTC (Sun) by nix (subscriber, #2304) [Link] (6 responses)

Hmmm. I see now -- when HJ's sycall inlining does turn up, or anything like it, you don't have one address to compare to any more, you have a great heap of them all across glibc, and there is obviously no way to do any similar comparison (I can think of ridiculously overdesigned ways to do it involving searching a tree of address ranges, but they'd all be *far* too slow and blow the dcache sky-high: just no).

Given that syscall inlining isn't something you can possibly turn on and off at runtime -- the inlining is, after all, into glibc, so you'd need multiple copies of glibc via hwcaps, which seems total overkill for this and would totally negate any saving via massive icache bloat -- you'd not be able to fix this by changing a *default*. You'd need to basically give up on fixing this race, or give up on fixing it this way, or break cancellation completely for everyone (a total non-starter).

Hmm. Too late at night, but I'll think on this. Either I have a niggling germ of a possible idea for a fix for this at the edge of my brain, or I'm just tired and hallucinating. (Or both!)

This is why we can't have safe cancellation points

Posted Apr 18, 2016 0:21 UTC (Mon) by luto (guest, #39314) [Link] (1 responses)

I think you could do it the way the kernel does the exception table -- just make a sorted list of pairs of starts and ends of cancellable regions. You only need to check it when your cancellation signal is delivered, and the data cache impact of *that* is basically irrelevant.

But you could do it by flipping the default if you're willing to accept a branch: just test the cancellable flag and jump out of line if needed. This is no worse than the existing musl thing in which each cancellable syscall needs to test the cancallable flag anyway to see if it needs to cancel even without a signal being sent.

This is why we can't have safe cancellation points

Posted Apr 18, 2016 10:28 UTC (Mon) by nix (subscriber, #2304) [Link]

Oh, I assumed you were necessarily taking the cost of the cancellable test anyway (the branch is near-zero cost in the common case, because it obviously has a prediction hint). Were you trying to avoid even that?

This is why we can't have safe cancellation points

Posted Apr 18, 2016 4:20 UTC (Mon) by neilbrown (subscriber, #359) [Link] (3 responses)

> you don't have one address to compare to any more

No, but you probably have one sequence of op-codes to compare.
The comparisons might be a little more complex than "memcmp" but could you not test "is this EIP value within a thunk" by comparing surrounding bytes against the standard thunk at each of the (very few) possible offsets?

This is why we can't have safe cancellation points

Posted Apr 18, 2016 10:27 UTC (Mon) by nix (subscriber, #2304) [Link] (2 responses)

Good point. And yes, you do have one sequence, it's extremely stereotyped, and it'll always be in the icache (at syscall entry, anyway). (On x86-64 -- on x86-32, this is irrelevant, because there, even the 'inlined' syscalls (INTERNAL_SYSCALL users) are still doing a GOT lookup and an indirect jump (on x86-32, anyway) to the vDSO syscall entry point.)

If you can get away with scanning for this only when cancellation is actually detected, it seems that the cost would be very low, though the complexity would obviously be higher than a simple address comparison, and it would tie that part of the kernel to these fairly fine and arch-dependent details of glibc's implementation, in a way that would probably not be spotted fast if it broke :(

This is why we can't have safe cancellation points

Posted Apr 18, 2016 11:10 UTC (Mon) by itvirta (guest, #49997) [Link] (1 responses)

> And yes, you do have one sequence, it's extremely stereotyped, and it'll always be in the icache

Stupid question: Does the instruction cache help if you're reading the instruction bytes as data?

This is why we can't have safe cancellation points

Posted Apr 20, 2016 16:56 UTC (Wed) by nix (subscriber, #2304) [Link]

Hm, no, but the L2+ caches are unified on many models (e.g. on all the Intel x86-64 CPUs I have access to, Nehalem and later), and getting stuff from L2 cache is still immensely faster than getting it from RAM, fast enough that you can often consider it free for applications like this.

This is why we can't have safe cancellation points

Posted Apr 16, 2016 13:52 UTC (Sat) by ballombe (subscriber, #9523) [Link]

> This code that supposedly works fine only works fine if it never calls into libraries, since those libraries might use malloc, or open, or mutexes, or anything else that acquires resources.

Yes, so ? This is a static property of the code.

This is why we can't have safe cancellation points

Posted Apr 19, 2016 14:40 UTC (Tue) by mjthayer (guest, #39183) [Link]

Another potentially daft idea: have the signal handler do nothing but set the cancel flag, but also have a flag to say when the thread is in a critical code section, that is, about to start a cancellable system call, already (or nearly already) checked the cancel flag and not yet (or only just) exited the system call. Then pthread_cancel() in the library just keeps resending the signal until the critical section flag is cleared again. Not nice, but not much cost for people who don't need cancellation.

This is why we can't have safe cancellation points

Posted Apr 18, 2016 17:11 UTC (Mon) by quotemstr (subscriber, #45331) [Link]

Ah, now if only musl supported dlclose, by the same logic. dlclose is also one of those routines that's perfectly safe when you understand how to use it. Felker just doesn't like it.

This is why we can't have safe cancellation points

Posted Apr 14, 2016 15:41 UTC (Thu) by karkhaz (subscriber, #99844) [Link] (1 responses)

Editor: Musl's lead developer is called Rich Felker, not Rick.

This is why we can't have safe cancellation points

Posted Apr 14, 2016 15:45 UTC (Thu) by jake (editor, #205) [Link]

> Editor: Musl's lead developer is called Rich Felker, not Rick.

Indeed. Fixed now. Thanks for the report.

(I will take this opportunity to remind folks that typo reports and such should go to lwn@lwn.net)

jake

This is why we can't have safe cancellation points

Posted Apr 15, 2016 17:21 UTC (Fri) by pm215 (subscriber, #98099) [Link]

A lot of this post sounded remarkably familiar, because it turns out that QEMU's linux-user code (where it emulates a binary for one architecture on a host with a different architecture, passing system calls through to the host) needs to do a very similar trick with a signal handler that has to look at the interrupted PC to see whether it was just before or just after the syscall instruction.

(For QEMU the problem that has to be solved is making sure that incoming signals interrupt emulated guest system calls -- if the signal arrives before we execute the host syscall instruction we must abandon emulation of the guest syscall, otherwise we might block forever. There's no way to close the race window completely without having the signal handler check the PC to see "did we actually execute that instruction yet?".)