| # MLIR Bytecode Format |
| |
| This document describes the MLIR bytecode format and its encoding. |
| This format is versioned and stable: we don't plan to ever break |
| compatibility, that is a dialect should be able to deserialize any |
| older bytecode. Similarly, we support back-deployment so that an |
| older version of the format can be targetted. |
| |
| That said, it is important to realize that the promises of the |
| bytecode format are made assuming immutable dialects: the format |
| allows backward and forward compatibility, but only when nothing |
| in a dialect changes (operations, types, attributes definitions). |
| |
| A dialect can opt-in to handle its own versioning through the |
| `BytecodeDialectInterface`. Some hooks are exposed to the dialect |
| to allow managing a version encoded into the bytecode file. The |
| version is loaded lazily and allows to retrieve the version |
| information while decoding the input IR, and gives an opportunity |
| to each dialect for which a version is present to perform IR |
| upgrades post-parsing through the `upgradeFromVersion` method. |
| There is no restriction on what kind of information a dialect |
| is allowed to encode to model its versioning. |
| |
| [TOC] |
| |
| ## Magic Number |
| |
| MLIR uses the following four-byte magic number to |
| indicate bytecode files: |
| |
| '\[‘M’<sub>8</sub>, ‘L’<sub>8</sub>, ‘ï’<sub>8</sub>, ‘R’<sub>8</sub>\]' |
| |
| In hex: |
| |
| '\[‘4D’<sub>8</sub>, ‘4C’<sub>8</sub>, ‘EF’<sub>8</sub>, ‘52’<sub>8</sub>\]' |
| |
| ## Format Overview |
| |
| An MLIR Bytecode file is comprised of a byte stream, with a few simple |
| structural concepts layered on top. |
| |
| ### Primitives |
| |
| #### Fixed-Width Integers |
| |
| ``` |
| byte ::= `0x00`...`0xFF` |
| ``` |
| |
| Fixed width integers are unsigned integers of a known byte size. The values are |
| stored in little-endian byte order. |
| |
| TODO: Add larger fixed width integers as necessary. |
| |
| #### Variable-Width Integers |
| |
| Variable width integers, or `VarInt`s, provide a compact representation for |
| integers. Each encoded VarInt consists of one to nine bytes, which together |
| represent a single 64-bit value. The MLIR bytecode utilizes the "PrefixVarInt" |
| encoding for VarInts. This encoding is a variant of the |
| [LEB128 ("Little-Endian Base 128")](https://en.wikipedia.org/wiki/LEB128) |
| encoding, where each byte of the encoding provides up to 7 bits for the value, |
| with the remaining bit used to store a tag indicating the number of bytes used |
| for the encoding. This means that small unsigned integers (less than 2^7) may be |
| stored in one byte, unsigned integers up to 2^14 may be stored in two bytes, |
| etc. |
| |
| The first byte of the encoding includes a length prefix in the low bits. This |
| prefix is a bit sequence of '0's followed by a terminal '1', or the end of the |
| byte. The number of '0' bits indicate the number of _additional_ bytes, not |
| including the prefix byte, used to encode the value. All of the remaining bits |
| in the first byte, along with all of the bits in the additional bytes, provide |
| the value of the integer. Below are the various possible encodings of the prefix |
| byte: |
| |
| ``` |
| xxxxxxx1: 7 value bits, the encoding uses 1 byte |
| xxxxxx10: 14 value bits, the encoding uses 2 bytes |
| xxxxx100: 21 value bits, the encoding uses 3 bytes |
| xxxx1000: 28 value bits, the encoding uses 4 bytes |
| xxx10000: 35 value bits, the encoding uses 5 bytes |
| xx100000: 42 value bits, the encoding uses 6 bytes |
| x1000000: 49 value bits, the encoding uses 7 bytes |
| 10000000: 56 value bits, the encoding uses 8 bytes |
| 00000000: 64 value bits, the encoding uses 9 bytes |
| ``` |
| |
| ##### Signed Variable-Width Integers |
| |
| Signed variable width integer values are encoded in a similar fashion to |
| [varints](#variable-width-integers), but employ |
| [zigzag encoding](https://en.wikipedia.org/wiki/Variable-length_quantity#Zigzag_encoding). |
| This encoding uses the low bit of the value to indicate the sign, which allows |
| for more efficiently encoding negative numbers. If a negative value were encoded |
| using a normal [varint](#variable-width-integers), it would be treated as an |
| extremely large unsigned value. Using zigzag encoding allows for a smaller |
| number of active bits in the value, leading to a smaller encoding. Below is the |
| basic computation for generating a zigzag encoding: |
| |
| ``` |
| (value << 1) ^ (value >> 63) |
| ``` |
| |
| #### Strings |
| |
| Strings are blobs of characters with an associated length. |
| |
| ### Sections |
| |
| ``` |
| section { |
| idAndIsAligned: byte // id | (hasAlign << 7) |
| length: varint, |
| |
| alignment: varint?, |
| padding: byte[], // Padding bytes are always `0xCB`. |
| |
| data: byte[] |
| } |
| ``` |
| |
| Sections are a mechanism for grouping data within the bytecode. They enable |
| delayed processing, which is useful for out-of-order processing of data, |
| lazy-loading, and more. Each section contains a Section ID, whose high bit |
| indicates if the section has alignment requirements, a length (which allows for |
| skipping over the section), and an optional alignment. When an alignment is |
| present, a variable number of padding bytes (0xCB) may appear before the section |
| data. The alignment of a section must be a power of 2. |
| |
| ## MLIR Encoding |
| |
| Given the generic structure of MLIR, the bytecode encoding is actually fairly |
| simplistic. It effectively maps to the core components of MLIR. |
| |
| ### Top Level Structure |
| |
| The top-level structure of the bytecode contains the 4-byte "magic number", a |
| version number, a null-terminated producer string, and a list of sections. Each |
| section is currently only expected to appear once within a bytecode file. |
| |
| ``` |
| bytecode { |
| magic: "MLïR", |
| version: varint, |
| producer: string, |
| sections: section[] |
| } |
| ``` |
| |
| ### String Section |
| |
| ``` |
| strings { |
| numStrings: varint, |
| reverseStringLengths: varint[], |
| stringData: byte[] |
| } |
| ``` |
| |
| The string section contains a table of strings referenced within the bytecode, |
| more easily enabling string sharing. This section is encoded first with the |
| total number of strings, followed by the sizes of each of the individual strings |
| in reverse order. The remaining encoding contains a single blob containing all |
| of the strings concatenated together. |
| |
| ### Dialect Section |
| |
| The dialect section of the bytecode contains all of the dialects referenced |
| within the encoded IR, and some information about the components of those |
| dialects that were also referenced. |
| |
| ``` |
| dialect_section { |
| numDialects: varint, |
| dialectNames: dialect_name_group[], |
| opNames: dialect_ops_group[] // ops grouped by dialect |
| } |
| |
| dialect_name_group { |
| nameAndIsVersioned: varint // (dialectID << 1) | (hasVersion), |
| version: dialect_version_section // only if versioned |
| } |
| |
| dialect_version_section { |
| size: varint, |
| version: byte[] |
| } |
| |
| dialect_ops_group { |
| dialect: varint, |
| numOpNames: varint, |
| opNames: op_name_group[] |
| } |
| |
| op_name_group { |
| nameAndIsRegistered: varint // (nameID << 1) | (isRegisteredOp) |
| } |
| ``` |
| |
| Dialects are encoded as a `varint` containing the index to the name string |
| within the string section, plus a flag indicating whether the dialect is |
| versioned. Operation names are encoded in groups by dialect, with each group |
| containing the dialect, the number of operation names, and the array of indexes |
| to each name within the string section. The version is encoded as a nested |
| section for each dialect. |
| |
| ### Attribute/Type Sections |
| |
| Attributes and types are encoded using two [sections](#sections), one section |
| (`attr_type_section`) containing the actual encoded representation, and another |
| section (`attr_type_offset_section`) containing the offsets of each encoded |
| attribute/type into the previous section. This structure allows for attributes |
| and types to always be lazily loaded on demand. |
| |
| ``` |
| attr_type_section { |
| attrs: attribute[], |
| types: type[] |
| } |
| attr_type_offset_section { |
| numAttrs: varint, |
| numTypes: varint, |
| offsets: attr_type_offset_group[] |
| } |
| |
| attr_type_offset_group { |
| dialect: varint, |
| numElements: varint, |
| offsets: varint[] // (offset << 1) | (hasCustomEncoding) |
| } |
| |
| attribute { |
| encoding: ... |
| } |
| type { |
| encoding: ... |
| } |
| ``` |
| |
| Each `offset` in the `attr_type_offset_section` above is the size of the |
| encoding for the attribute or type and a flag indicating if the encoding uses |
| the textual assembly format, or a custom bytecode encoding. We avoid using the |
| direct offset into the `attr_type_section`, as a smaller relative offsets |
| provides more effective compression. Attributes and types are grouped by |
| dialect, with each `attr_type_offset_group` in the offset section containing the |
| corresponding parent dialect, number of elements, and offsets for each element |
| within the group. |
| |
| #### Attribute/Type Encodings |
| |
| In the abstract, an attribute/type is encoded in one of two possible ways: via |
| its assembly format, or via a custom dialect defined encoding. |
| |
| ##### Assembly Format Fallback |
| |
| In the case where a dialect does not define a method for encoding the attribute |
| or type, the textual assembly format of that attribute or type is used as a |
| fallback. For example, a type `!bytecode.type<42>` would be encoded as the null |
| terminated string "!bytecode.type<42>". This ensures that every attribute and |
| type can be encoded, even if the owning dialect has not yet opted in to a more |
| efficient serialization. |
| |
| TODO: We shouldn't redundantly encode the dialect name here, we should use a |
| reference to the parent dialect instead. |
| |
| ##### Dialect Defined Encoding |
| |
| As an alternative to the assembly format fallback, dialects may also provide a |
| custom encoding for their attributes and types. Custom encodings are very |
| beneficial in that they are significantly smaller and faster to read and write. |
| |
| Dialects can opt-in to providing custom encodings by implementing the |
| `BytecodeDialectInterface`. This interface provides hooks, namely |
| `readAttribute`/`readType` and `writeAttribute`/`writeType`, that will be used |
| by the bytecode reader and writer. These hooks are provided a reader and writer |
| implementation that can be used to encode various constructs in the underlying |
| bytecode format. A unique feature of this interface is that dialects may choose |
| to only encode a subset of their attributes and types in a custom bytecode |
| format, which can simplify adding new or experimental components that aren't |
| fully baked. |
| |
| When implementing the bytecode interface, dialects are responsible for all |
| aspects of the encoding. This includes the indicator for which kind of attribute |
| or type is being encoded; the bytecode reader will only know that it has |
| encountered an attribute or type of a given dialect, it doesn't encode any |
| further information. As such, a common encoding idiom is to use a leading |
| `varint` code to indicate how the attribute or type was encoded. |
| |
| ### Resource Section |
| |
| Resources are encoded using two [sections](#sections), one section |
| (`resource_section`) containing the actual encoded representation, and another |
| section (`resource_offset_section`) containing the offsets of each encoded |
| resource into the previous section. |
| |
| ``` |
| resource_section { |
| resources: resource[] |
| } |
| resource { |
| value: resource_bool | resource_string | resource_blob |
| } |
| resource_bool { |
| value: byte |
| } |
| resource_string { |
| value: varint |
| } |
| resource_blob { |
| alignment: varint, |
| size: varint, |
| padding: byte[], |
| blob: byte[] |
| } |
| |
| resource_offset_section { |
| numExternalResourceGroups: varint, |
| resourceGroups: resource_group[] |
| } |
| resource_group { |
| key: varint, |
| numResources: varint, |
| resources: resource_info[] |
| } |
| resource_info { |
| key: varint, |
| size: varint |
| kind: byte, |
| } |
| ``` |
| |
| Resources are grouped by the provider, either an external entity or a dialect, |
| with each `resource_group` in the offset section containing the corresponding |
| provider, number of elements, and info for each element within the group. For |
| each element, we record the key, the value kind, and the encoded size. We avoid |
| using the direct offset into the `resource_section`, as a smaller relative |
| offsets provides more effective compression. |
| |
| ### IR Section |
| |
| The IR section contains the encoded form of operations within the bytecode. |
| |
| ``` |
| ir_section { |
| block: block; // Single block without arguments. |
| } |
| ``` |
| |
| #### Operation Encoding |
| |
| ``` |
| op { |
| name: varint, |
| encodingMask: byte, |
| location: varint, |
| |
| attrDict: varint?, |
| |
| numResults: varint?, |
| resultTypes: varint[], |
| |
| numOperands: varint?, |
| operands: varint[], |
| |
| numSuccessors: varint?, |
| successors: varint[], |
| |
| numUseListOrders: varint?, |
| useListOrders: uselist[], |
| |
| regionEncoding: varint?, // (numRegions << 1) | (isIsolatedFromAbove) |
| |
| // regions are stored in a section if isIsolatedFromAbove |
| regions: (region | region_section)[] |
| } |
| |
| uselist { |
| indexInRange: varint?, |
| useListEncoding: varint, // (numIndices << 1) | (isIndexPairEncoding) |
| indices: varint[] |
| } |
| ``` |
| |
| The encoding of an operation is important because this is generally the most |
| commonly appearing structure in the bytecode. A single encoding is used for |
| every type of operation. Given this prevalence, many of the fields of an |
| operation are optional. The `encodingMask` field is a bitmask which indicates |
| which of the components of the operation are present. |
| |
| ##### Location |
| |
| The location is encoded as the index of the location within the attribute table. |
| |
| ##### Attributes |
| |
| If the operation has attribues, the index of the operation attribute dictionary |
| within the attribute table is encoded. |
| |
| ##### Results |
| |
| If the operation has results, the number of results and the indexes of the |
| result types within the type table are encoded. |
| |
| ##### Operands |
| |
| If the operation has operands, the number of operands and the value index of |
| each operand is encoded. This value index is the relative ordering of the |
| definition of that value from the start of the first ancestor isolated region. |
| |
| ##### Successors |
| |
| If the operation has successors, the number of successors and the indexes of the |
| successor blocks within the parent region are encoded. |
| |
| ##### Use-list orders |
| |
| The reference use-list order is assumed to be the reverse of the global |
| enumeration of all the op operands that one would obtain with a pre-order walk |
| of the IR. This order is naturally obtained by building blocks of operations |
| op-by-op. However, some transformations may shuffle the use-lists with respect |
| to this reference ordering. If any of the results of the operation have a |
| use-list order that is not sorted with respect to the reference use-list order, |
| an encoding is emitted such that it is possible to reconstruct such order after |
| parsing the bytecode. The encoding represents an index map from the reference |
| operand order to the current use-list order. A bit flag is used to detect if |
| this encoding is of type index-pair or not. When the bit flag is set to zero, |
| the element at `i` represent the position of the use `i` of the reference list |
| into the current use-list. When the bit flag is set to `1`, the encoding |
| represent index pairs `(i, j)`, which indicate that the use at position `i` of |
| the reference list is mapped to position `j` in the current use-list. When only |
| less than half of the elements in the current use-list are shuffled with respect |
| to the reference use-list, the index-pair encoding is used to reduce the |
| bytecode memory requirements. |
| |
| ##### Regions |
| |
| If the operation has regions, the number of regions and if the regions are |
| isolated from above are encoded together in a single varint. Afterwards, each |
| region is encoded inline. |
| |
| #### Region Encoding |
| |
| ``` |
| region { |
| numBlocks: varint, |
| |
| numValues: varint?, |
| blocks: block[] |
| } |
| ``` |
| |
| A region is encoded first with the number of blocks within. If the region is |
| non-empty, the number of values defined directly within the region are encoded, |
| followed by the blocks of the region. |
| |
| #### Block Encoding |
| |
| ``` |
| block { |
| encoding: varint, // (numOps << 1) | (hasBlockArgs) |
| arguments: block_arguments?, // Optional based on encoding |
| ops : op[] |
| } |
| |
| block_arguments { |
| numArgs: varint?, |
| args: block_argument[] |
| numUseListOrders: varint?, |
| useListOrders: uselist[], |
| } |
| |
| block_argument { |
| typeAndLocation: varint, // (type << 1) | (hasLocation) |
| location: varint? // Optional, else unknown location |
| } |
| ``` |
| |
| A block is encoded with an array of operations and block arguments. The first |
| field is an encoding that combines the number of operations in the block, with a |
| flag indicating if the block has arguments. |
| |
| Use-list orders are attached to block arguments similarly to how they are |
| attached to operation results. |