|From:||Mathieu Desnoyers <firstname.lastname@example.org>|
|To:||email@example.com, firstname.lastname@example.org, email@example.com|
|Subject:||[RFC] Common Trace Format Requirements (v1.4)|
|Date:||Tue, 5 Oct 2010 15:32:57 -0400|
|Cc:||Steven Rostedt <firstname.lastname@example.org>, email@example.com, Peter Zijlstra <firstname.lastname@example.org>, Frederic Weisbecker <email@example.com>, Elena Zannoni <firstname.lastname@example.org>, Arnaldo Carvalho de Melo <email@example.com>, Thomas Gleixner <firstname.lastname@example.org>, Aaron Spear <email@example.com>, Tasneem Brutch - SISA <firstname.lastname@example.org>, Tim Bird <email@example.com>, firstname.lastname@example.org, Andi Kleen <email@example.com>, "Paul E. McKenney" <firstname.lastname@example.org>, Philippe Maisonneuve <Philippe.Maisonneuve@windriver.com>, Felix Burton <Felix.Burton@windriver.com>, Andrew McDermott <Andrew.McDermott@windriver.com>, Stefan Hajnoczi <email@example.com>|
Mathieu Desnoyers, EfficiOS Inc. The goal of the present document is to gather the trace format requirements from the embedded, telecom, high-performance and kernel communities. It consists of an overview of the trace format, tracer and trace analyzer requirements to consider for a Common Trace Format proposal. This document includes requirements from: Steven Rostedt <firstname.lastname@example.org> Dominique Toupin <email@example.com> Aaron Spear <firstname.lastname@example.org> Philippe Maisonneuve <Philippe.Maisonneuve@windriver.com> Felix Burton <Felix.Burton@windriver.com> Andrew McDermott <Andrew.McDermott@windriver.com> "Frank Ch. Eigler" <email@example.com> Michel Dagenais <firstname.lastname@example.org> Stefan Hajnoczi <email@example.com> Multi-Core Association Tool Infrastructure Workgroup (http://www.multicore-association.org/workgroup/tiwg.php) * Trace Format Requirements These are requirements on the trace format per se. This section discusses the layout of data in the trace, explaining the rationale behind the choices. The rationale for the trace format choices may refer to the tracer and trace analyzer requirements stated below. This section starts by presenting the common trace model, and then specifies the requirements of an instance of this model specifically tailored to efficient kernel- and user-space tracing requirements. 1) Architecture This high-level model is meant to be an industry-wide, common model, fulfilling the tracing requirements. It is meant to be application-, architecture-, and language-agnostic. 1.1) Core model - Event An event is an information record contained within the trace. - Events must be in physical order within a section. Their physical position relative to other events within the section specify their order relative to other events within the same section. - Event type (numeric identifier: maps to metadata) - Unique ID assigned within a section. - Event payload - Variable event size - Size limitations: maximum event size should be configurable. - Size information available through metadata. - Support various data alignment for architectures, standards, and languages: - Natural alignment of data for architectures with slow non-aligned writes. - Packed layout of headers for architecture with efficient non-aligned writes. - Section A section within the trace can be thought of as the ELF sections in a ELF binary. They contain a sequence of physically contiguous event records. - Multi-level section identifier - e.g.: section name / CPU number - Contains a subset of event types The parallel with ELF sections is used here to conceptually demonstrate the idea of section, but the similarity stops there. A trace is peculiar in that we have to continuously append to each sections, and we need to have ideally no interaction between sections. Therefore, for storage, recording all sections into a single file is not recommended; a directory made of one file per section is better suited. - Metadata Metadata is the description of the setting of the environment of the application. Defines the basic types of the domains. Will define the mapping between the event, and the type of the event fields. The metadata scope (what it describes) is a whole trace, which consists of one or many sections. The metadata can be either contained in the trace (better usability for telecom scenarios) or added alongside the trace data by a separate module (for DSP scenarios). Metadata checksumming (only for statically generated metadata) and/or versioning can be used to ensure consistency between sections and metadata in the latter. - Trace version - Major number (increment breaks compabilility) - Minor number (increment keeps compatibility) - Describe the invariant properties of the environment where the trace was generated. - Contain unique domain identifier (kernel, process ID and timestamp, hypervisor) - Describes the runtime environment. - Report target bitness - Report target byte order - Data types (see section 1.2 Extensions below) - Architecture-agnostic (text-based) - Ought to be parsed with a regular grammar - Mapping to event types, e.g. (section, event) tuples, with: ( section identifier, event numerical identifier ) - Description of event context fields (per section) - Can be streamed along with the trace as a trace section - Support dynamic addition of new event types while trace is active (required to support module/shared object loading and dynamic probes) - Metadata section should be efficient and reliable. Additional information could be kept in separate sections, outside of metadata. - Metadata description language not imposed by standard - Metadata format identifier placed at the beginning of the metadata. 1.2) Extensions (optional capabilities) - Event - Optional context (thread id, virtual cpu id, execution mode (irq/bh/thread), CPU/board/node id, event ordering identifier, timestamp, current hardware performance counter information, event size) - Optional ordering capability across sections: - Ordering identifier required for trace containing many event streams - Either timestamp-based or based on unique sequence numbers - Optional time-flow capability: per-event timestamps - It should be possible to have context information only in some event records within a section. E.g., timestamp written every few events. - Section - Optional context applying to all events contained in that section (thread id, virtual cpu id, execution mode (irq/bh/thread), CPU/board/node id) - Support piece-wise compression - Support checksumming - Metadata - Execution environment information - Data types available: integer, strings, arrays, sequence, floats, structures, maps (aka enumerations), bitfields, ... - Describe type alignment. - Describe type size. - Describe type signedness. - Other type examples: - gcc "vector" type. (packed data) http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html - gcc complex type (e.g. complex short, float, double...) - gcc _Fract and _Accum http://gcc.gnu.org/wiki/FixedPointArithmetic http://gcc.gnu.org/onlinedocs/gcc/Fixed_002dPoint.html - Describes trace capabilities, for instance: - Event ordering across sections - Time flow information - In event header - Or possibly payload of pre-specified sections and/or events - Ability to perform event ordering across traces - Optional per-event "current state tracking" information. This per-event taxonomy allows automated creation of a state machine that keeps track of state updates within the taxonomy tree. Described in an file-system path-like taxonomy with additional  operator which indicates a lookup by value, e.g.: * For events in the trace stream updating the current state only based on information known from the context (either derived from the per-section or per-event context information): E.g., associated with a scheduling change event: "cpu[section/cpu]/thread = field/next_pid" Updates the current value of the current section's cpu "thread" attribute (e.g. currently running thread). E.g., associated with a system call: "thread[cpu[section/cpu]/thread]/syscall[field/syscall_id]/id = field/syscall_id" Updates the state value of the current thread "syscall" attribute. * For events in the trace stream targeting a path that depends on other fields into that same event (would be common for full system state dump at trace start): E.g., associated with a thread listing event: "thread[field/pid]/pid = field/pid" E.g., associated with a thread memory maps listing event: "thread[field/pid]/mmap[field/address]/address = field/address" "thread[field/pid]/mmap[field/address]/end = field/end" "thread[field/pid]/mmap[field/address]/flags = field/flags" "thread[field/pid]/mmap[field/address]/pgoff = field/pgoff" "thread[field/pid]/mmap[field/address]/inode = field/inode" All per-event context information (e.g. repeating the current PID and CPU for each event) can be represented with this taxonomy, e.g., in the section description: "section/pid = field/pid" "section/cpu = field/cpu" 2) Linux-specific Model (Linux instance, specific to the reference implementation) Instance of the model specifically tailored to the Linux kernel and C programs/libraries requirements. Allows for either packed events, or events aligned following the ISO/C standard. - Event - Payload - Initially support ISO C naturally aligned and packed type layouts. - Each section represented as a trace stream (typically 1 trace stream per cpu per section) to allow the tracer to easily append to these sections. Identifier: section name / CPU ID Each section has a CPU ID identifier in its context information. - Trace stream - Should have no hard-coded limit on size of a file generated by saving the trace stream (64 bit file position is fine) - Event lost count should be localized. It should apply to a limited time interval and to a tracefile, hence to a specific section, so the trace analyzer can provide basic information about what kind of events were lost and where they were lost in the trace. - A stream is divided into packets, which each consists of one or many event records. - Should be optionally compressible piece-wise (packet per packet). - Optional checksum on the packet content (except packet header), with a selection of checksum algorithms. Performed on a per-packet basis. - Packet headers should contain a sequence number to help UDP streaming reassembly. - Packet headers should be allowed to contain extra space reserved for encapsulation into a UDP packet encapsulation without copy. - Compact representation - Minimize the overhead in terms of disk/network/serial port/memory bandwidth. - A compact representation can keep more information in smaller buffers, thus needs less memory to keep the same amount of information around. Also useful to improve cache locality in flight recorder mode. - Natural alignment of headers for architectures with slow non-aligned writes. - Packed layout of headers for architecture with efficient non-aligned writes. - Should have a 1 to 1 mapping between the memory buffers and the generated trace files: allows zero-copy with splice(). - Use target endianness - Portable across different host target (tracer)/host (analyzer) architectures - It should be possible to generate metadata from descriptions written in header files (extraction with C preprocessor macros is one solution). * Requirements on the Tracers Higher-level tracer requirements that seem appropriate to support some of the trace format requirements stated above. Enumerating these higher-level requirements influence the trace format in many ways. For instance, a requirement for compactness leads to schemes where all information repetition should be eliminated. Thus the need for optional per-section context information. Another example is the requirement for speed and streaming. The requirement for speed ans treaming leads to zero-copy implementations, which imply that the trace format should be written natively by the tracer. The tracer requirements stated in this section are stated to ensure that the trace format structure makes it possible for a tracer to cope with the requirements, not to require that all tracer do so. *Fast* - Low-overhead - Handle large trace throughput (multi-GB per minutes) - Scalable to high number of cores - Per-cpu memory buffers - Scalability and performance-aware synchronization *Compact* - Environments without filesystem - Need to buffer events in target RAM to send them in group a host for analysis - Ability to tune the size of buffers and transmission medium to minimize the impact on the traced system. - Streaming (live monitoring) - Through sockets (USB, network) - Through serial ports - There must be a related protocol for streaming this event data. - Availability of flight recorder (synonym: overwrite) mode - Exclusive ownership of reader data. - Buffer size should be per group of events. - Output trace to disk - Trace buffers available in crash dump to allow post-mortem analysis - Fine-grained timestamps - Lockless (lock-free, ideally wait-free; aka starvation-free) - Buffer introspection: event written, read and lost counts. - Ability to iteratively narrow the level of details and traced time window following an initial high level "state" overview provided by an initial trace collecting everything. - Support kernel module instrumentation - Standard way(s) for a host to upload/access trace log data from a target/JTAG device/simulator/etc. - Conditional tracing in kernel space. - Compatibility with power management subsystem (trace collection shall not be a reason for waking up a device) - Well defined and stable trace configuration and control API across kernel versions. - Create and run more than one trace session in parallel at the same time - monitoring from system administrators - field engineered to troubleshoot a specific problem * Trace Analyzer Requirements The trace analyzer requirements stated in this section are stated to ensure that the trace format structure makes it possible for a trace analyzer to cope with the requirements, not to require that all trace analyzers do so. - Ability to cope with huge traces (> 10 GB) - Should be possible to do a binary search on the file to find events by time at least. (combined with smart indexing/ summary data perhaps) - File format should be as dense as possible, but not at the expense of analysis performance (faster is more important than bigger since disks are getting cheaper) - Must not be required to scan through all events in order to start analyzing (by time anyway) - Support live viewing of trace streams - Standard description of a trace event context. (PERI-XML calls it "Dimensions") - Manage system-wide event scoping with the following hierarchy: (address space identifier, section name, event name) -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com
Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds