|  | ====================================== | 
|  | XRay Flight Data Recorder Trace Format | 
|  | ====================================== | 
|  |  | 
|  | :Version: 1 as of 2017-07-20 | 
|  |  | 
|  | .. contents:: | 
|  | :local: | 
|  |  | 
|  |  | 
|  | Introduction | 
|  | ============ | 
|  |  | 
|  | When gathering XRay traces in Flight Data Recorder mode, each thread of an | 
|  | application will claim buffers to fill with trace data, which at some point | 
|  | is finalized and flushed. | 
|  |  | 
|  | A goal of the profiler is to minimize overhead, the flushed data directly | 
|  | corresponds to the buffer. | 
|  |  | 
|  | This document describes the format of a trace file. | 
|  |  | 
|  |  | 
|  | General | 
|  | ======= | 
|  |  | 
|  | Each trace file corresponds to a sequence of events in a particular thread. | 
|  |  | 
|  | The file has a header followed by a sequence of discriminated record types. | 
|  |  | 
|  | The endianness of byte fields matches the endianness of the platform which | 
|  | produced the trace file. | 
|  |  | 
|  |  | 
|  | Header Section | 
|  | ============== | 
|  |  | 
|  | A trace file begins with a 32-byte header. | 
|  |  | 
|  | +-------------------+-----------------+----------------------------------------+ | 
|  | | Field             | Size (bytes)    | Description                            | | 
|  | +===================+=================+========================================+ | 
|  | | version           | ``2``           | Anticipates versioned  readers. This   | | 
|  | |                   |                 | document describes the format when     | | 
|  | |                   |                 | version == 1                           | | 
|  | +-------------------+-----------------+----------------------------------------+ | 
|  | | type              | ``2``           | An enumeration encoding the type of    | | 
|  | |                   |                 | trace. Flight Data Recorder mode       | | 
|  | |                   |                 | traces have type == 1                  | | 
|  | +-------------------+-----------------+----------------------------------------+ | 
|  | | bitfield          | ``4``           | Holds parameters that are not aligned  | | 
|  | |                   |                 | to bytes. Further described below.     | | 
|  | +-------------------+-----------------+----------------------------------------+ | 
|  | | cycle_frequency   | ``8``           | The frequency in hertz of the CPU      | | 
|  | |                   |                 | oscillator used to measure duration of | | 
|  | |                   |                 | events in ticks.                       | | 
|  | +-------------------+-----------------+----------------------------------------+ | 
|  | | buffer_size       | ``8``           | The size in bytes of the data portion  | | 
|  | |                   |                 | of the trace following the header.     | | 
|  | +-------------------+-----------------+----------------------------------------+ | 
|  | | reserved          | ``8``           | Reserved for future use.               | | 
|  | +-------------------+-----------------+----------------------------------------+ | 
|  |  | 
|  | The bitfield parameter of the file header is composed of the following fields. | 
|  |  | 
|  | +-------------------+----------------+-----------------------------------------+ | 
|  | | Field             | Size (bits)    | Description                             | | 
|  | +===================+================+=========================================+ | 
|  | | constant_tsc      | ``1``          | Whether the platform's timestamp        | | 
|  | |                   |                | counter used to record ticks between    | | 
|  | |                   |                | events ticks at a constant frequency    | | 
|  | |                   |                | despite CPU frequency changes.          | | 
|  | |                   |                | 0 == non-constant. 1 == constant.       | | 
|  | +-------------------+----------------+-----------------------------------------+ | 
|  | | nonstop_tsc       | ``1``          | Whether the tsc continues to count      | | 
|  | |                   |                | despite whether the CPU is in a low     | | 
|  | |                   |                | power state. 0 == stop. 1 == non-stop.  | | 
|  | +-------------------+----------------+-----------------------------------------+ | 
|  | | reserved          | ``30``         | Not meaningful.                         | | 
|  | +-------------------+----------------+-----------------------------------------+ | 
|  |  | 
|  |  | 
|  | Data Section | 
|  | ============ | 
|  |  | 
|  | Following the header in a trace is a data section with size matching the | 
|  | buffer_size field in the header. | 
|  |  | 
|  | The data section is a stream of elements of different types. | 
|  |  | 
|  | There are a few categories of data in the sequence. | 
|  |  | 
|  | - ``Function Records``: Function Records contain the timing of entry into and | 
|  | exit from function execution. Function Records have 8 bytes each. | 
|  |  | 
|  | - ``Metadata Records``: Metadata records serve many purposes. Mostly, they | 
|  | capture information that may be too costly to record for each function, but | 
|  | that is required to contextualize the fine-grained timings. They also are used | 
|  | as markers for user-defined Event Data payloads. Metadata records have 16 | 
|  | bytes each. | 
|  |  | 
|  | - ``Event Data``: Free form data may be associated with events that are traced | 
|  | by the binary and encode data defined by a handler function. Event data is | 
|  | always preceded with a marker record which indicates how large it is. | 
|  |  | 
|  | - ``Function Arguments``: The arguments to some functions are included in the | 
|  | trace. These are either pointer addresses or primitives that are read and | 
|  | logged independently of their types in a high level language. To the tracer, | 
|  | they are all numbers. Function Records that have attached arguments will | 
|  | indicate their presence on the function entry record. We only support logging | 
|  | contiguous function argument sequences starting with argument zero, which will | 
|  | be the "this" pointer for member function invocations. For example, we don't | 
|  | support logging the first and third argument. | 
|  |  | 
|  | A reader of the memory format must maintain a state machine. The format makes no | 
|  | attempt to pad for alignment, and it is not seekable. | 
|  |  | 
|  |  | 
|  | Function Records | 
|  | ---------------- | 
|  |  | 
|  | Function Records have an 8-byte layout. This layout encodes information to | 
|  | reconstruct a call stack of instrumented function and their durations. | 
|  |  | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | Field         | Size (bits)  | Description                                   | | 
|  | +===============+==============+===============================================+ | 
|  | | discriminant  | ``1``        | Indicates whether a reader should read a      | | 
|  | |               |              | Function or Metadata record. Set to ``0`` for | | 
|  | |               |              | Function records.                             | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | action        | ``3``        | Specifies whether the function is being       | | 
|  | |               |              | entered, exited, or is a non-standard entry   | | 
|  | |               |              | or exit produced by optimizations.            | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | function_id   | ``28``       | A numeric ID for the function. Resolved to a  | | 
|  | |               |              | name via the xray instrumentation map. The    | | 
|  | |               |              | instrumentation map is built by xray at       | | 
|  | |               |              | compile time into an object file and pairs    | | 
|  | |               |              | the function ids to addresses. It is used for | | 
|  | |               |              | patching and as a lookup into the binary's    | | 
|  | |               |              | symbols to obtain names.                      | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | tsc_delta     | ``32``       | The number of ticks of the timestamp counter  | | 
|  | |               |              | since a previous record recorded a delta or   | | 
|  | |               |              | other TSC resetting event.                    | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  |  | 
|  | On little-endian machines, the bitfields are ordered from least significant bit | 
|  | bit to most significant bit. A reader can read an 8 bit value and apply the mask | 
|  | ``0x01`` for the discriminant. Similarly, they can read 32 bits and unsigned | 
|  | shift right by ``0x04`` to obtain the function_id field. | 
|  |  | 
|  | On big-endian machine, the bitfields are written in order from most significant | 
|  | bit to least significant bit. A reader would read an 8 bit value and unsigned | 
|  | shift right by 7 bits for the discriminant. The function_id field could be | 
|  | obtained by reading a 32 bit value and applying the mask ``0x0FFFFFFF``. | 
|  |  | 
|  | Function action types are as follows. | 
|  |  | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | Type          | Number       | Description                                   | | 
|  | +===============+==============+===============================================+ | 
|  | | Entry         | ``0``        | Typical function entry.                       | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | Exit          | ``1``        | Typical function exit.                        | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | Tail_Exit     | ``2``        | An exit from a function due to tail call      | | 
|  | |               |              | optimization.                                 | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | Entry_Args    | ``3``        | A function entry that records arguments.      | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  |  | 
|  | Entry_Args records do not contain the arguments themselves. Instead, metadata | 
|  | records for each of the logged args follow the function record in the stream. | 
|  |  | 
|  |  | 
|  | Metadata Records | 
|  | ---------------- | 
|  |  | 
|  | Interspersed throughout the buffer are 16-byte Metadata records. For typically | 
|  | instrumented binaries, they will be sparser than Function records, and they | 
|  | provide a fuller picture of the binary execution state. | 
|  |  | 
|  | Metadata record layout is partially record dependent, but they share a common | 
|  | structure. | 
|  |  | 
|  | The same bit field rules described for function records apply to the first byte | 
|  | of MetadataRecords. Within this byte, little endian machines use lsb to msb | 
|  | ordering and big endian machines use msb to lsb ordering. | 
|  |  | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | Field         | Size         | Description                                   | | 
|  | +===============+==============+===============================================+ | 
|  | | discriminant  | ``1 bit``    | Indicates whether a reader should read a      | | 
|  | |               |              | Function or Metadata record. Set to ``1`` for | | 
|  | |               |              | Metadata records.                             | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | record_kind   | ``7 bits``   | The type of Metadata record.                  | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | data          | ``15 bytes`` | A data field used differently for each record | | 
|  | |               |              | type.                                         | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  |  | 
|  | Here is a table of the enumerated record kinds. | 
|  |  | 
|  | +--------+---------------------------+ | 
|  | | Number | Type                      | | 
|  | +========+===========================+ | 
|  | | 0      | NewBuffer                 | | 
|  | +--------+---------------------------+ | 
|  | | 1      | EndOfBuffer               | | 
|  | +--------+---------------------------+ | 
|  | | 2      | NewCPUId                  | | 
|  | +--------+---------------------------+ | 
|  | | 3      | TSCWrap                   | | 
|  | +--------+---------------------------+ | 
|  | | 4      | WallTimeMarker            | | 
|  | +--------+---------------------------+ | 
|  | | 5      | CustomEventMarker         | | 
|  | +--------+---------------------------+ | 
|  | | 6      | CallArgument              | | 
|  | +--------+---------------------------+ | 
|  |  | 
|  |  | 
|  | NewBuffer Records | 
|  | ----------------- | 
|  |  | 
|  | Each buffer begins with a NewBuffer record immediately after the header. | 
|  | It records the thread ID of the thread that the trace belongs to. | 
|  |  | 
|  | Its data segment is as follows. | 
|  |  | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | Field         | Size (bytes) | Description                                   | | 
|  | +===============+==============+===============================================+ | 
|  | | thread_Id     | ``2``        | Thread ID for buffer.                         | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | reserved      | ``13``       | Unused.                                       | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  |  | 
|  |  | 
|  | WallClockTime Records | 
|  | --------------------- | 
|  |  | 
|  | Following the NewBuffer record, each buffer records an absolute time as a frame | 
|  | of reference for the durations recorded by timestamp counter deltas. | 
|  |  | 
|  | Its data segment is as follows. | 
|  |  | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | Field         | Size (bytes) | Description                                   | | 
|  | +===============+==============+===============================================+ | 
|  | | seconds       | ``8``        | Seconds on absolute timescale. The starting   | | 
|  | |               |              | point is unspecified and depends on the       | | 
|  | |               |              | implementation and platform configured by the | | 
|  | |               |              | tracer.                                       | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | microseconds  | ``4``        | The microsecond component of the time.        | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | reserved      | ``3``        | Unused.                                       | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  |  | 
|  |  | 
|  | NewCpuId Records | 
|  | ---------------- | 
|  |  | 
|  | Each function entry invokes a routine to determine what CPU is executing. | 
|  | Typically, this is done with readtscp, which reads the timestamp counter at the | 
|  | same time. | 
|  |  | 
|  | If the tracing detects that the execution has switched CPUs or if this is the | 
|  | first instrumented entry point, the tracer will output a NewCpuId record. | 
|  |  | 
|  | Its data segment is as follows. | 
|  |  | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | Field         | Size (bytes) | Description                                   | | 
|  | +===============+==============+===============================================+ | 
|  | | cpu_id        | ``2``        | CPU Id.                                       | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | absolute_tsc  | ``8``        | The absolute value of the timestamp counter.  | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | reserved      | ``5``        | Unused.                                       | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  |  | 
|  |  | 
|  | TSCWrap Records | 
|  | --------------- | 
|  |  | 
|  | Since each function record uses a 32 bit value to represent the number of ticks | 
|  | of the timestamp counter since the last reference, it is possible for this value | 
|  | to overflow, particularly for sparsely instrumented binaries. | 
|  |  | 
|  | When this delta would not fit into a 32 bit representation, a reference absolute | 
|  | timestamp counter record is written in the form of a TSCWrap record. | 
|  |  | 
|  | Its data segment is as follows. | 
|  |  | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | Field         | Size (bytes) | Description                                   | | 
|  | +===============+==============+===============================================+ | 
|  | | absolute_tsc  | ``8``        | Timestamp counter value.                      | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | reserved      | ``7``        | Unused.                                       | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  |  | 
|  |  | 
|  | CallArgument Records | 
|  | -------------------- | 
|  |  | 
|  | Immediately following an Entry_Args type function record, there may be one or | 
|  | more CallArgument records that contain the traced function's parameter values. | 
|  |  | 
|  | The order of the CallArgument Record sequency corresponds one to one with the | 
|  | order of the function parameters. | 
|  |  | 
|  | CallArgument data segment: | 
|  |  | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | Field         | Size (bytes) | Description                                   | | 
|  | +===============+==============+===============================================+ | 
|  | | argument      | ``8``        | Numeric argument (may be pointer address).    | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | reserved      | ``7``        | Unused.                                       | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  |  | 
|  |  | 
|  | CustomEventMarker Records | 
|  | ------------------------- | 
|  |  | 
|  | XRay provides the feature of logging custom events. This may be leveraged to | 
|  | record tracing info for RPCs or similarly trace data that is application | 
|  | specific. | 
|  |  | 
|  | Custom Events themselves are an unstructured (application defined) segment of | 
|  | memory with arbitrary size within the buffer. They are preceded by | 
|  | CustomEventMarkers to indicate their presence and size. | 
|  |  | 
|  | CustomEventMarker data segment: | 
|  |  | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | Field         | Size (bytes) | Description                                   | | 
|  | +===============+==============+===============================================+ | 
|  | | event_size    | ``4``        | Size of preceded event.                       | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | absolute_tsc  | ``8``        | A timestamp counter of the event.             | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  | | reserved      | ``3``        | Unused.                                       | | 
|  | +---------------+--------------+-----------------------------------------------+ | 
|  |  | 
|  |  | 
|  | EndOfBuffer Records | 
|  | ------------------- | 
|  |  | 
|  | An EndOfBuffer record type indicates that there is no more trace data in this | 
|  | buffer. The reader is expected to seek past the remaining buffer_size expressed | 
|  | before the start of buffer and look for either another header or EOF. | 
|  |  | 
|  |  | 
|  | Format Grammar and Invariants | 
|  | ============================= | 
|  |  | 
|  | Not all sequences of Metadata records and Function records are valid data. A | 
|  | sequence should be parsed as a state machine. The expectations for a valid | 
|  | format can be expressed as a context free grammar. | 
|  |  | 
|  | This is an attempt to explain the format with statements in EBNF format. | 
|  |  | 
|  | - Format := Header ThreadBuffer* EOF | 
|  |  | 
|  | - ThreadBuffer := NewBuffer WallClockTime NewCPUId BodySequence* End | 
|  |  | 
|  | - BodySequence := NewCPUId | TSCWrap | Function | CustomEvent | 
|  |  | 
|  | - Function := (Function_Entry_Args CallArgument*) | Function_Other_Type | 
|  |  | 
|  | - CustomEvent := CustomEventMarker CustomEventUnstructuredMemory | 
|  |  | 
|  | - End := EndOfBuffer RemainingBufferSizeToSkip | 
|  |  | 
|  |  | 
|  | Function Record Order | 
|  | --------------------- | 
|  |  | 
|  | There are a few clarifications that may help understand what is expected of | 
|  | Function records. | 
|  |  | 
|  | - Functions with an Exit are expected to have a corresponding Entry or | 
|  | Entry_Args function record precede them in the trace. | 
|  |  | 
|  | - Tail_Exit Function records record the Function ID of the function whose return | 
|  | address the program counter will take. In other words, the final function that | 
|  | would be popped off of the call stack if tail call optimization was not used. | 
|  |  | 
|  | - Not all functions marked for instrumentation are necessarily in the trace. The | 
|  | tracer uses heuristics to preserve the trace for non-trivial functions. | 
|  |  | 
|  | - Not every entry must have a traced Exit or Tail Exit. The buffer may run out | 
|  | of space or the program may request for the tracer to finalize toreturn the | 
|  | buffer before an instrumented function exits. |