|
|
Subscribe / Log in / New account

The ORCs are coming

By Jonathan Corbet
July 20, 2017
There are a few reasons for wanting the ability to get proper stack traces out of the kernel, including profiling, tracing, and debugging kernel crashes. Historically, the kernel's tracebacks have been unreliable for a number of reasons, most of which have been fixed in recent years. Now it seems likely that the 4.14 kernel will include a new mechanism that should put our traceback problems behind us — for now.

The state of the kernel's call stack can be surprisingly hard to interpret. Normally, it is made up of normal C function calls, but then assembly-language code, interrupts, processor traps, etc. tend to confuse the picture. A confusing stack can, naturally, cause the "unwinder" code that tries to derive the current call chain from that stack's contents to do strange things; as a result, the kernel has long eschewed any sort of complicated unwinding code. For the most part, developers who deal with kernel tracebacks have learned to cope with occasional bad data.

The live patching effort, though, depends on accurate call-stack information for its consistency model; in short, it needs to be able to tell which functions appear in the call stack of any thread in the system. Getting there involved implementing the compile-time stack validation mechanism to ensure that all kernel code keeps the stack in reasonable condition at all times. The final step is a proper unwinder that uses this now-reliable stack information.

Last May, an attempt to add such an unwinder based on the DWARF debugging records emitted by the compiler ran into trouble when Linus Torvalds saw it. He noted that, the last time this experiment was tried, the unwinder ran into continual problems from changes to assembly-language code or problems with incorrect DWARF records and, as a result, proved to be unmaintainable. Thus, he said: "I do not ever again want to see fancy unwinders with complex state machine handling used by the oopsing code." So DWARF, which requires that sort of complexity, did not appear to be a good option.

That might have been the end of the story, given that Torvalds was firm in his position, but Josh Poimboeuf mentioned an idea he had been pondering for a bit. The objtool utility that performs stack validation at compile time builds a model of the state of the stack at every point in the built kernel. Perhaps, he thought, objtool could emit the debugging records to make that information available to the unwinder in a format rather simpler than DWARF. The result could be a more reliable unwinder using a more efficient data format that, importantly, is fully under the control of the kernel community and, one would hope, relatively unlikely to break.

Two months or so later, the result is the ORC unwinder. The name ostensibly stands for "oops rewind capability", though it's obviously a play on DWARF (which, in turn, is a play on the ELF executable format). The new ORC format is simple at its core; it is based on this structure:

    struct orc_entry {
	s16		sp_offset;
	s16		bp_offset;
	unsigned	sp_reg:4;
	unsigned	bp_reg:4;
	unsigned	type:2;
    };

The purpose of an orc_entry structure is to tell the unwinder code how to orient itself on the stack. There is one of these structures associated with each executable address in the kernel, along with a simple data structure allowing the unwinder to find the correct entry given a program-counter address.

The interpretation of the structure depends on the type field. If it is ORC_TYPE_CALL, the code is running within a normal C-style call frame, and the beginning of that frame can be found by adding the sp_offset value to the value found in the register indicated by sp_reg. If, instead, type is ORC_TYPE_REGS, then that sum points to a pt_regs structure describing the processor (and stack) state prior to a system call. Finally, ORC_TYPE_REGS_IRET says that sp_reg and sp_offset can be used to find a return frame for a hardware interrupt. Those three possibilities appear to be enough to describe any situation that will be encountered, at least on the x86 architecture. (The bp_reg and bp_offset fields don't appear to have much use in the current implementation).

The resulting mechanism is far simpler than the DWARF mechanism. Among other things, that means it's quite a bit faster — a factor of at least 20x is claimed. Unwinding performance may not matter much when responding to a kernel oops, but it's a big deal for function tracing and profiling. The ORC approach is also claimed to be more reliable than telling the compiler to use frame pointers, and it doesn't suffer from the significant performance hit that frame pointers bring with them. And, as noted above, the ORC format is entirely under the control of the kernel community, so it shouldn't break with new compiler versions and, if it does, kernel developers can fix it.

Of course, it's hard to predict just how creative the compiler developers of the future may be when it comes to breaking call-stack information. Poimboeuf acknowledges that risk in the patch posting, but notes that:

If newer versions of GCC come up with some optimizations which break objtool, we may need to revisit the current implementation. Some possible solutions would be asking GCC to make the optimizations more palatable, or having objtool use DWARF as an additional input, or creating a GCC plugin to assist objtool with its analysis.

The other disadvantage is that the ORC format takes more space than DWARF, occupying 1MB or so of extra memory. Poimboeuf suggested that the memory use could be reduced if it turns out to be a real problem. "However, it will probably require sacrificing some combination of speed and simplicity".

Torvalds has not yet made his feelings known regarding the ORC patches, though he had in the past indicated that he thought the combination of objtool and a simpler format might work. Ingo Molnar, meanwhile, has applied the patches to the tip tree, indicating that they are likely to show up in a 4.14 pull request. So, barring last-minute problems, the multi-year effort to get a reliable stack unwinder in the kernel may be close to completion.

Index entries for this article
KernelStack unwinder


to post comments

The ORCs are coming

Posted Jul 20, 2017 19:53 UTC (Thu) by jhoblitt (subscriber, #77733) [Link] (4 responses)

I haven't tried to grok the pathset... Will ORC track file/lineno or just symbol name?

The ORCs are coming

Posted Jul 20, 2017 19:57 UTC (Thu) by corbet (editor, #1) [Link] (3 responses)

ORC will track neither of those things; it just provides the information needed to make sense of the kernel stack. The kallsyms mechanism can associate symbols with addresses, as always.

The ORCs are coming

Posted Jul 20, 2017 20:58 UTC (Thu) by Sesse (subscriber, #53779) [Link] (2 responses)

A related question; will this eventually seep down into userspace, so that we can get reliable perf backtraces without frame pointers? (Yes, there's --call-graph=dwarf, but it requires dumping the entire stack to the perf trace, since DWARF is too slow to trace in realtime. So it makes for slow, huge traces.)

User space

Posted Jul 20, 2017 21:06 UTC (Thu) by corbet (editor, #1) [Link] (1 responses)

That's a question that came up in the conversation; I didn't manage to work it into the article, sorry. There is definitely interest in doing that, and it seems possible, but nobody is working on it at the moment.

User space

Posted Jul 22, 2017 2:02 UTC (Sat) by alkbyby (subscriber, #61687) [Link]

User space is likely to need more complex unwinding support. Since it has wider set of possible compilers/runtimes and programming language features. I.e. at least unwinding RBP is likely to be needed.

Is objtool for x86_64 only?

Posted Aug 1, 2017 5:36 UTC (Tue) by alison (subscriber, #63752) [Link] (4 responses)

There are a legion of readers who work exclusively on ARM processors. I'm disappointed to read the entire article and then realize, looking at the source tree, that "at least on the x86 architecture" meant that the ORC format is not supported on any processors I work with. It would be nice to see a clearer statement of architecture-specificity in the first paragraph.

Is objtool for x86_64 only?

Posted Aug 1, 2017 12:24 UTC (Tue) by corbet (editor, #1) [Link] (3 responses)

Lots of kernel features show up on x86 first; it doesn't usually take all that long for the interesting ones to spread to the other architectures. I don't know of anybody working on ARMing the ORCs at the moment, but it would not surprise me if it happened fairly soon once the x86 stuff lands.

Is objtool for x86_64 only?

Posted Aug 1, 2017 13:38 UTC (Tue) by nix (subscriber, #2304) [Link] (1 responses)

ARMing ORCs seems distinctly dangerous to me: they are not renowned for friendliness as neighbours. However, in this case the ARMs have already taken over the world so I'm not sure there's much the ORCs could do. :P

Is objtool for x86_64 only?

Posted Aug 3, 2017 9:50 UTC (Thu) by Wol (subscriber, #4433) [Link]

Ring the hobbits, maybe?

Cheers,
Wol

Is objtool for x86_64 only?

Posted Sep 14, 2017 3:54 UTC (Thu) by ajdlinux (subscriber, #82125) [Link]

ARM people feel neglected? I work exclusively on powerpc! :P

From what I can tell of the initially-x86-only features that do get ported to other architectures, arm/arm64 gets them first, powerpc can be (an often distant) second, and everything else may well be never...

The ORCs are coming

Posted Aug 1, 2017 14:32 UTC (Tue) by vbabka (subscriber, #91706) [Link]

Matt Fleming yesterday also posted a blog post about ORC unwinder, with some more details for some aspects of the topic: http://www.codeblueprint.co.uk/2017/07/31/the-orc-unwinde...

The ORCs are coming

Posted Aug 6, 2017 15:08 UTC (Sun) by vineetg (subscriber, #85161) [Link]

Is this x86 specific or can this be adapted to all arches ?

ARMing ORCs

Posted Sep 7, 2017 19:46 UTC (Thu) by vomlehn (guest, #45588) [Link]

Nice. Having done truly shameful things to get MIPS stack backtraces, I spent some time on an approach like this but changed jobs and it fell out of my universe. I have, apparently, a stack backtrace fetish and would take on the ARM version if I could just clone myself. Sigh.

The ORCs are coming

Posted Sep 14, 2017 7:09 UTC (Thu) by johill (subscriber, #25196) [Link]

Now that this is merged ...

What happens if we crash in eBPF JIT'ed code? Clearly there cannot be any ORC annotation for that? I'm not sure the JIT ever emits stack usage though.


Copyright © 2017, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds