|
|
Subscribe / Log in / New account

How to fix an ancient GDB problem

By Jonathan Corbet
September 29, 2022

Cauldron
The GDB debugger has a long history; it was first created in 1986. It may thus be unsurprising that some GDB development happens over relatively long time frames but, even when taking that into account, the existence of an open bug first reported in 2007 may be a little surprising. At the 2022 GNU Tools Cauldron, GDB maintainer Pedro Alves talked about why this problem has been difficult to solve, and what the eventual solution looks like.

The problem in question, Alves said, has to do with the handling of keyboard interrupts, which normally result from the user hitting control-C. The user's normal expectation is that an interrupt within GDB while the target program is running will stop the program and return the GDB prompt. If, however, that program has blocked the SIGINT signal, the interrupt will never be delivered. At best, GDB will not stop; at worst, the entire debugging session can become stuck and need to be killed from another terminal. GDB users, it seems, tend not to like that behavior.

This problem results from how GDB handles both terminals and interrupt signals. A "session", in the Unix sense, is a set of process groups, all of which share a single controlling terminal. Normally, the debugged process runs in the same session as — and shares the terminal with — GDB, but GDB puts that process into a different process group. Multiple process groups can share a terminal, but only one of those — the foreground group — will receive signals generated by the user at that terminal. GDB normally runs as the foreground group but, when it runs the target program, it designates that program's group as the foreground group instead.

[Pedro Alves] Normally, if the target process receives a SIGINT signal, it will be intercepted by GDB; that happens as part of how tracing with ptrace() works. GDB will respond by stopping the target program and putting out a prompt; the signal is never actually delivered to that program. If, however, the program has blocked SIGINT then the signal remains pending; since it is never delivered, ptrace() has nothing to intercept. That can result in everything getting stuck. There are other paths to the same situation; sigwait() calls, for example, can consume pending signals in a way that causes them to never actually be delivered.

The solution, Alves said, is the same as for any other problem in computer science: add another layer of indirection. In this case, that layer takes the form of a pseudo-terminal (PTY) that is given to the target process rather than the real controlling terminal. GDB then acts as an intermediary between the two terminals. Any output written by the target program to the PTY is simply copied to the real terminal. Input is a bit trickier, since the target can have changed the terminal's modes; GDB has to put the real terminal into raw mode, then copy all of the input from the real terminal into the PTY. When the target is not running, the terminal is put back into "readline mode" for interaction with GDB.

Now, the target can do anything it wants with SIGINT without affecting GDB, which, as the foreground process on the real terminal, can handle events directly. Since that terminal is in raw mode, that means recognizing the interrupt character and responding accordingly. There are other advantages as well; since GDB remains in control of when output goes to the (real) terminal, it can avoid intermixing its own output with that from the target. Another advantage is that GDB is now able to preserve the user's thread selection (the specific thread that debugging activity is focused on) after an interrupt; this wasn't possible before.

There is, he said, an "escape hatch" for anybody wanting the previous behavior; it needs to be there to support other Unix systems in any case.

There were a few other remaining problems, he said. The first process in the foreground process group is considered the "session leader" by the kernel; if that process exits, then its children will be sent a SIGHUP signal. Most applications are not prepared for that and will be killed as a result. Now that the target has its own terminal, it becomes the session leader once it starts. If that process forks and exits, its child processes are likely to meet an untimely end — not the debugging experience that the user is likely to have had in mind.

The solution in this case is a variation on the double-fork technique; before launching the target, GDB will fork twice, with the first process doing nothing but waiting. It will become the session leader; since it doesn't exit, no SIGHUP signals will be generated if the target does surprising things.

GDB still has to be able to stop programs that block SIGINT; for obvious reasons, it cannot use SIGINT for that purpose. The solution here, he said, is to use SIGSTOP, which cannot be blocked, instead.

As is often the case, Emacs users present their own special challenges. Emacs uses control-C for its own purposes, and remaps SIGINT to control-G instead. In cases like this, the user almost certainly wants control-C to be passed through to the target. The answer is a GDB command that allows the user to specify which key should interrupt the process and return to the GDB prompt.

This patch was first prototyped in 2019, but didn't make it to the GDB list until 2021. There were a few problems that turned up at that point, including the session-leader difficulty. Those have all been resolved, and Alves intends to post the patch set again sometime soon. His objective, he concluded, is to post it at least once per year until the problem is finally solved.

[Thanks to LWN subscribers for supporting my travel to this event.]

Index entries for this article
ConferenceGNU Tools Cauldron/2022


to post comments

How to fix an ancient GDB problem

Posted Sep 29, 2022 22:41 UTC (Thu) by dullfire (guest, #111432) [Link] (2 responses)

Wow. I've long felt gdb to be extremely unstable. I suspect much of it was this bug. I have tried digging into the issues... however getting into cases were gdb segfaults... And attempting to run gdb under gdb yields the top level gdb segfault... is NOT a happy place to be (the segfaults are definitely unrelated to this... and sadly I never managed to figure out was causing them).

Anyhow, This is sure to make my debugging time less of "fury/anguish at my debugging tools". Thanks for the explanation and write up.

How to fix an ancient GDB problem

Posted Sep 30, 2022 7:36 UTC (Fri) by fw (subscriber, #26023) [Link] (1 responses)

The segfault when running GDB under GDB might come from Guile support, or more precisely, from the Boehm-Demers-Weiser garbage collector library Guile uses: in some version, that library probes mapping boundaries, deliberately triggering segfaults, rather than enumerating shared objects using the dl_iterate_phdr function. If that's indeed the problem you could build GDB without Guile support, or apply this garbage collector patch.

How to fix an ancient GDB problem

Posted Sep 30, 2022 13:17 UTC (Fri) by gray_-_wolf (subscriber, #131074) [Link]

I've encountered this too (when trying to debug segfaulting of guile itself). Yet another option (that I've used) is to just continue over the initial segfault. Worked for me.

How to fix an ancient GDB problem

Posted Sep 30, 2022 2:53 UTC (Fri) by nyanpasu64 (guest, #135579) [Link] (3 responses)

My personal experience with a possibly-related bug: when debugging programs which handle Ctrl+C themselves (like PipeWire and Audacious), pressing Ctrl+C will trigger the app's graceful shutdown process (terminating execution) rather than breaking into the debugger like I want. Will the changes to gdb fix this issue?

Also is "a GDB command that allows the user to specify which key should interrupt the process and return to the GDB prompt" already present, or will it be added in the rewrite?

How to fix an ancient GDB problem

Posted Sep 30, 2022 15:58 UTC (Fri) by ballombe (subscriber, #9523) [Link]

> My personal experience with a possibly-related bug: when debugging programs which handle Ctrl+C themselves (like PipeWire and Audacious), pressing Ctrl+C will trigger the app's graceful shutdown process (terminating execution) rather than breaking into the debugger like I want. Will the changes to gdb fix this issue?

Unlikely. Usually 'graceful shutdown' is implemented by forking and waiting for the child to terminate (or crash).
This means that you are running gdb against the wrong program, so of course you do not get anything useful.
You need to tell gdb to follow the fork.

How to fix an ancient GDB problem

Posted Oct 3, 2022 18:30 UTC (Mon) by palves (guest, #91099) [Link] (1 responses)

> My personal experience with a possibly-related bug: when debugging programs which handle Ctrl+C themselves
> (like PipeWire and Audacious), pressing Ctrl+C will trigger the app's graceful shutdown process
> (terminating execution) rather than breaking into the debugger like I want. Will the changes to gdb fix this issue?

Back when v2 was posted to the list last year, someone tested it with PipeWire specifically, and confirmed it works there.

From https://inbox.sourceware.org/gdb-patches/1c54ccee2e4a2980... :

>> "I was recently fixing some bug in Pipewire. To be exact, I was working on
>> `pipewire-media-session` (one of daemons that Pipewire is comprised of). Process
>> of debugging basically included setting breakpoints somewhere, then hopefully
>> triggering the breakpoint, then inspecting the state. Nothing unusual.
>>
>> So, the thing that was slowing me down and annoying is that I couldn't just
>> pause the debuggee through the usual means of ^C, then add a breakpoint.
>> Pressing ^C was causing the process to exit!
>>
>> Pipewire project seems to have a number of various daemons, and all of them set
>> SIGINT handlers (judging by git-grepping for SIGINT). So basically, it is hard
>> to debug Pipewire with GDB due to GDB not being able to interactively pause the
>> process.
>>
>> Now, I tested the patchset, and I confirm it solves the problem! GDB built from
>> the palves' branch pauses pipewire-media-session correctly on ^C! "

..

> Also is "a GDB command that allows the user to specify which key should interrupt the process
> and return to the GDB prompt" already present, or will it be added in the rewrite?

It does not exist yet. It will be added in the next version of the patches.

How to fix an ancient GDB problem

Posted Nov 18, 2022 21:04 UTC (Fri) by Hi-Angel (guest, #110915) [Link]

> Back when v2 was posted to the list last year, someone tested it with PipeWire specifically, and confirmed it works there.

That was me. The patchset still haven't been merged…? Oh, my :c

How to fix an ancient GDB problem

Posted Sep 30, 2022 7:53 UTC (Fri) by rwmj (subscriber, #5474) [Link] (3 responses)

Perhaps I misunderstand how this works, but if GDB connects to or disconnects from an existing process is it able to change the process's terminal to the new pty and then reset it back to the regular terminal later?

How to fix an ancient GDB problem

Posted Sep 30, 2022 13:19 UTC (Fri) by gray_-_wolf (subscriber, #131074) [Link]

I would assume this will work only when starting new process and not when attaching to existing one?

How to fix an ancient GDB problem

Posted Sep 30, 2022 13:51 UTC (Fri) by fw (subscriber, #26023) [Link] (1 responses)

When attaching to a running process, you just have to hit Ctrl+C in the terminal running GDB, as before. This mode of operation hasn't really suffered from these problems, as far as I understand it.

How to fix an ancient GDB problem

Posted Oct 4, 2022 11:52 UTC (Tue) by palves (guest, #91099) [Link]

> When attaching to a running process, you just have to hit Ctrl+C in the terminal running GDB, as before. This mode of
> operation hasn't really suffered from these problems, as far as I understand it.

When you attach to process running in another terminal, and hit Ctrl-C in the terminal running GDB, that Ctrl-C is turned into a SIGINT sent to GDB. So that half of the problem does not exist in that scenario, GDB sees the SIGINT first, not the inferior. However, currently, in "I am attached" scenario, GDB forwards that SIGINT to the target process, using plain "kill(pid, SIGINT)", and then relies on ptrace intercepting that SIGINT. If the target process blocks SIGINT, then you're back to square 1. To get that scenario working properly, we stil need part of the proposal in place, specifically the part about pausing the target process in a different way, with SIGSTOP.

How to fix an ancient GDB problem

Posted Sep 30, 2022 8:24 UTC (Fri) by kleptog (subscriber, #1183) [Link] (1 responses)

Sounds like what we really need is is to be able to attach an eBPF program to the PTY that can manipulate the I/O and customise the signal handling/generation.

I'm only half joking.

How to fix an ancient GDB problem

Posted Sep 30, 2022 14:43 UTC (Fri) by leromarinvit (subscriber, #56850) [Link]

That would actually be useful I think. Some time ago, I half-seriously looked into in-kernel ways to colorize stderr (so as to include programs that don't use glibc, so LD_PRELOAD tricks don't work), and BPF would be a nice, non-intrusive (at least for the actual colorization logic) solution if it had the capabilities.

Since the tty code can't be built as a module, the only other way to achieve what I wanted would either require rebuilding the kernel (which I'm usually too lazy for) or ugly runtime patching.

why blocking sigint ?

Posted Sep 30, 2022 14:13 UTC (Fri) by ballombe (subscriber, #9523) [Link] (5 responses)

Why would a program blocks SIGINT ?
I can understand installing a signal handler, but blocking it outright ?
This does not seems very user friendly.

why blocking sigint ?

Posted Sep 30, 2022 14:46 UTC (Fri) by farnz (subscriber, #17727) [Link] (3 responses)

I've worked on a codebase that blocked SIGINT whenever it entered a critical section, and unblocked it afterwards. This was a poor way to avoid being killed by the end user hitting Ctrl-C to escape the program at an inopportune moment (not least because it didn't account for Ctrl-\ or Ctrl-Z), but it worked well enough for the use case in question, since users just spammed Ctrl-C until it died.

A proper fix was to replace this with a signal handler, and teach the critical sections to check to see if there had been a SIGINT while they were in progress (exiting immediately once you were out of the critical section). But that took maintenance work being done for other reasons - it wasn't something that was going to get fixed until I had to do some work to port to a new OS.

why blocking sigint ?

Posted Sep 30, 2022 15:15 UTC (Fri) by ballombe (subscriber, #9523) [Link] (2 responses)

> A proper fix was to replace this with a signal handler, and teach the critical sections to check to see if there had been a SIGINT while they were in progress (exiting immediately once you were out of the ).

Yes, that is how my project does it too. Just adds a signal handler that set some flag, and check the flag when leaving the critical section. Much nicer than blocking SIGINT.

why blocking sigint ?

Posted Sep 30, 2022 15:17 UTC (Fri) by farnz (subscriber, #17727) [Link]

Yes, but this also requires competent programmers, as opposed to people who flail around until they find something that fixes the reported bug (users killing the program with Ctrl-C when it can't safely be interrupted without triggering a full DB recheck).

why blocking sigint ?

Posted Oct 3, 2022 18:06 UTC (Mon) by iabervon (subscriber, #722) [Link]

If you're really just going to set a flag, the fact that a blocked signal is pending is effectively a flag that the kernel handles for you without any fuss and without interrupting syscalls. If you install a signal handler, you get to find out how well your program handles getting a ton of short writes or EINTR in a critical section when the user hammers ctrl-c. Your program should handle it well, but that path is probably not being exercised at all under normal operation.

why blocking sigint ?

Posted Sep 30, 2022 19:52 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

Signal handlers suck, because they can only call fully reentrant functions. It is often desirable to run code on a "real" stack (i.e. not from the signal handler), at which point you might decide to forego the signal handler altogether, by doing one of the following instead:

* Make a signalfd(2) and add it to your main event loop's select(2)/poll(2)/etc. call.
* Make a dedicated "handle the signals" thread, block all the signals you care about (including SIGINT), and call sigwaitinfo(2) from that thread.

You might also block signals for a short time while the thread or process is holding a mutex, in cases where this could cause deadlock or violate an invariant.

Frankly, there is absolutely nothing in either POSIX or Linux to suggest that blocking signals is "wrong" so long as the process actually responds to them in a valid and appropriate way. SIGINT means "please stop what you're doing," not "please push a weird extra frame on the stack so the debugger can break in."

Since only 2007?

Posted Oct 13, 2022 20:40 UTC (Thu) by rmano (guest, #49886) [Link]


Copyright © 2022, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds