This document is intended to capture some of the alternatives considered and open debates in the design of MLIR, along with the rationale for certain decisions we made. This is not intended to be a “finely groomed” document - we prefer the ability to dump in interesting tidbits without worrying too much about their consistency or readability.
MLIR is a compiler intermediate representation with similarities to traditional three-address SSA representations (like LLVM IR or SIL), but which introduces notions from the polyhedral loop optimization works as first class concepts. This hybrid design is optimized to represent, analyze, and transform high level dataflow graphs as well as target-specific code generated for high performance data parallel systems. Beyond its representational capabilities, its single continuous design provides a framework to lower from dataflow graphs to high performance target specific code.
MLIR stands for one of “Multi-Level IR” or “Multi-dimensional Loop IR” or “Machine Learning IR” or “Mid Level IR”, we prefer the first. This document only provides the rationale behind MLIR -- its actual specification document and other content is hosted elsewhere.
The Multi-Level Intermediate Representation (MLIR) is intended for easy expression and optimization of computations involving deep loop nests and dense matrices of high dimensionality. It is thus well-suited to deep learning computations in particular. Yet it is general enough to also represent arbitrary sequential computation. The representation allows high-level optimization and parallelization for a wide range of parallel architectures including those with deep memory hierarchies --- general-purpose multicores, GPUs, and specialized neural network accelerators.
MLIR uses ideas drawn from IRs of LLVM and Swift for lower level constructs while combining them with ideas from the polyhedral abstraction to represent loop nests, multidimensional data (tensors), and transformations on these entities as first class concepts in the IR.
MLIR is a multi-level IR, i.e., it represents code at a domain-specific representation such as HLO or TensorFlow graphs, all the way down to the machine level. MLIR is able to represent arbitrary control flow and arbitrary data accesses, and is general enough to represent nearly all sequential computation. This is a key distinction from existing polyhedral representation implementations (such as LLVM Polly) that are able to use the polyhedral abstraction in a way isolated from the LLVM IR and only for affine loop nests, i.e., portions of the code where array accesses, loop bounds, and conditionals are regular (involve linear functions of loop iterators and constant symbols). The presence of statically unpredictable data accesses or control flow does not preclude representation in MLIR, but only limits to a certain extent the ability to reason about and apply transformations using the polyhedral abstraction.
Maps, sets, and relations with affine constraints are the core structures underlying a polyhedral representation of high-dimensional loop nests and multidimensional arrays. These structures are represented as textual expressions in a form close to their mathematical form. These structures are used to capture loop nests, tensor data structures, and how they are reordered and mapped for a target architecture. All structured or “conforming” loops are captured as part of the polyhedral information, and so are tensor variables, their layouts, and subscripted accesses to these tensors in memory.
The information captured in the IR allows a compact expression of all loop transformations, data remappings, explicit copying necessary for explicitly addressed memory in accelerators, mapping to pre-tuned expert-written primitives, and mapping to specialized vector instructions. Loop transformations that can be easily implemented include the body of affine transformations: these subsume all traditional loop transformations (unimodular and non-unimodular) such as loop tiling, interchange, permutation, skewing, scaling, relative shifting, reversal, fusion, and distribution/fission. Transformations on data layout such as padding and transforming to blocked layouts are also represented well via affine layout maps.
MLIR‘s design allows a progressive lowering to target-specific forms. Besides high-level transformations for loop nests and data layouts that a typical mid-level optimizer is expected to deal with, MLIR is also designed to perform certain low-level scheduling and mapping decisions that a typical backend IR is entrusted with: these include mapping to specialized vector instructions, auto-vectorization, and software pipelining. The need to support these transformations stems from the fact that neural network accelerators have specialized units that deal with large chunks of data whose computation maps back to chunks of more than one loop of the loop nests as viewed by a program at a level closer to the original specification. Such specialized units or instructions operate on multidimensional data chunks from a programmer’s viewpoint. It thus makes it hard or infeasible for a backend operating on a very low-level IR close to assembly to lift and reconstruct loops and perform such a mapping. This is in contrast to classic instruction selection and scheduling in today's compilers that primarily only deals with the body of the innermost loop. MLIR also facilitates automatic mapping to expert pre-tuned primitives or vendor libraries operating on data at higher levels (or at the highest level) of the memory hierarchy.
In summary, MLIR is convenient for and closed under the kind of transformations needed to lower to general-purpose as well as specialized accelerators. It also allows one to build modular and reusable target independent and target dependent passes.
This section sheds light on some of the design decisions -- some of these are indirectly implied by the specification document.
The ‘load’ and ‘store’ instructions are specifically crafted to fully resolve to an element of a memref. These instructions take as arguments n+1 indices for an n-ranked tensor. This disallows the equivalent of pointer arithmetic or the ability to index into the same memref in other ways (something which C arrays allow for example). Furthermore, for the affine constructs, the compiler can follow use-def chains (e.g. through affine.apply operations or through the map attributes of affine operations) to precisely analyze references at compile-time using polyhedral techniques. This is possible because of the restrictions on dimensions and symbols.
A scalar of element-type (a primitive type or a vector type) that is stored in memory is modeled as a 0-d memref. This is also necessary for scalars that are live out of for loops and if conditionals in a function, for which we don't yet have an SSA representation -- an extension to allow that is described later in this doc.
The current MLIR disallows use of symbols in types. For example, when a tensor or memref dimension is statically unknown, it is denoted in the type as ‘?’. An SSA symbol is then bound to it when a memref is created. The actual value of the unknown dimension can be queried using the “dim” builtin as shown below.
Example:
func foo(...) { %A = memref.alloc <8x?xf32, #lmap> (%N) ... call bar(%A) : (memref<8x?xf32, #lmap>) } func bar(%A : memref<8x?xf32, #lmap>) { // Type of %A indicates that %A has dynamic shape with 8 rows // and unknown number of columns. The number of columns is queried // dynamically using dim instruction. %N = memref.dim %A, 1 : memref<8x?xf32, #lmap> affine.for %i = 0 to 8 { affine.for %j = 0 to %N { // A[i,j] += 1 %s1 = affine.load %A[%i, %j] : memref<8x?xf32, #lmap> %s2 = add %s1, 1 affine.store %s2, %A[%i, %j] : memref<8x?xf32, #lmap> } } return }
An alternative design is to embed the reference to symbols directly in the type - memref<8x%Nxf32>. We went for the current approach in MLIR because it simplifies the design --- types remain immutable when the values of symbols change.
MLIR Regions represent SSA using “block arguments” rather than PHI instructions used in LLVM. This choice is representationally identical (the same constructs can be represented in either form) but block arguments have several advantages:
For more context, block arguments were previously used in the Swift SIL Intermediate Representation, and described in a talk on YouTube. The section of interest starts here.
Index types are intended to be used for platform-specific “size” values and may appear in subscripts, sizes of aggregate types and affine expressions. They are also tightly coupled with affine.apply
and affine.load/store operations; having index
type is a necessary precondition of a value to be acceptable by these operations.
We allow index
types in tensors, vectors, and memrefs as a code generation strategy has to map index
to an implementation type and hence needs to be able to materialize corresponding values. However, the target might lack support for vector
values with the target specific equivalent of the index
type.
Data layout information such as the bit width or the alignment of types may be target and ABI-specific and thus should be configurable rather than imposed by the compiler. Especially, the layout of compound or index
types may vary. MLIR specifies default bit widths for certain primitive types, in particular for integers and floats. It is equal to the number that appears in the type definition, e.g. the bit width of i32
is 32
, so is the bit width of f32
. The bit width is not necessarily related to the amount of memory (in bytes) or the register size (in bits) that is necessary to store the value of the given type. For example, vector<3xi57>
is likely to be lowered to a vector of four 64-bit integers, so that its storage requirement is 4 x 64 / 8 = 32
bytes, rather than (3 x 57) ceildiv 8 = 22
bytes as can be naively computed from the bit width. MLIR makes such data layout information configurable using attributes that can be queried during lowering, for example, when allocating a compound type.
The data layout of dialect-specific types is undefined at MLIR level. Yet dialects are free to define their own quantities and make them available via the data layout infrastructure.
Integers in the builtin MLIR type system have a bitwidth (note that the index
type has a symbolic width equal to the machine word size), and they may additionally have signedness semantics. The purpose is to satisfy the needs of different dialects, which can model different levels of abstractions. Certain abstraction, especially closer to source language, might want to differentiate signedness with integer types; while others, especially closer to machine instruction, might want signless integers. Instead of forcing each abstraction to adopt the same integer modelling or develop its own one in house, Integer type provides this as an option to help code reuse and consistency.
For the standard dialect, the choice is to have signless integer types. An integer value does not have an intrinsic sign, and it‘s up to the specific op for interpretation. For example, ops like arith.addi
and arith.muli
do two’s complement arithmetic, but some other operations get a sign, e.g. arith.divsi
vs arith.divui
.
LLVM uses the same design, which was introduced in a revamp rolled out in the LLVM 2.0 integer type. Prior to that, from LLVM 1.0 to 1.9, LLVM uses signed types like “sbyte” and “ubyte”. This shift was important and has served LLVM well over the years. The reason this is important is that it is a good thing for an intermediate representation to represent the same computation with the same instruction. Signed types got in the way, because (e.g.) an “add of an sbyte” does the same computation as an “add of a ubyte”, but the type system made them look artificially different. This split also required casts like “cast from sbyte to ubyte” which do nothing at the machine level. Removing signs from the type system eliminated these problems, making the compiler simpler.
More information about this split is available in an old talk on youtube talking about LLVM 2.0.
Note that this rationale only applies to the “standard ops” dialect in which we can express an opinion about its design. Other dialects generally try to model an external system, and should aim to reflect its design as closely as possible.
The MLIR “Arithmetic” dialect splits many integer and floating point operations into different categories, for example arith.addf
vs arith.addi
and arith.cmpf
vs arith.cmpi
(following the design of LLVM). These instructions are polymorphic on the number of elements in the type though, for example addf
is used with scalar floats, vectors of floats, and tensors of floats (LLVM does the same thing with its scalar/vector types).
This split is important because floating point and integer operations are quite different in practice: for example, floating point values include NaN‘s, so integer comparisons and floating point comparisons should use different comparison opcodes. On the arithmetic side of things, floating point operations support rounding modes, floating point contractions, “fast math”, and integers may want to have two’s complement overflow behavior or be undefined on various forms of wrapping for performance.
We are a long way from this sort of thing being a priority to care about in MLIR, but since we have experience and know the right way to do this, we'd rather design it in from the beginning.
Note that this rationale only applies to the “standard ops” dialect in which we can express an opinion about its design. Other dialects generally try to model an external system, and should aim to reflect its design as closely as possible.
Since integers are signless, it is necessary to define the sign for integer comparison operations. This sign indicates how to treat the foremost bit of the integer: as sign bit or as most significant bit. For example, comparing two i4
values 0b1000
and 0b0010
yields different results for unsigned (8 > 3
) and signed (-8 < 3
) interpretations. This difference is only significant for order comparisons, but not for equality comparisons. Indeed, for the latter all bits must have the same value independently of the sign. Since both arguments have exactly the same bit width and cannot be padded by this operation, it is impossible to compare two values whose bit representations would differ while the values are interpreted as equal.
Unlike arithmetic, comparison operators share several common properties, e.g. they cannot be considered associative. In practice, comparisons are sometimes implemented by the same instruction or its variants so it makes sense to group them together at the IR level.
An alternative would be introducing ten distinct operators for all currently supported kinds of integer comparisons. These operators would have increased the number of “reserved” names used by standard operations as well as the size of the C++ API while their implementations would have been mostly identical.
The comparison kind is internally an integer attribute. However, for the sake of readability by humans, custom assembly form accepts string literals that are mapped to the underlying integer values: cmpi "eq", %lhs, %rhs
better implies integer equality comparison than cmpi 0, %lhs, %rhs
where it is unclear what gets compared to what else. This syntactic sugar is possible thanks to parser logic redefinitions for custom assembly form of non-builtin operations. Supporting it in the full notation would have required changing how the main parsing algorithm works and may have unexpected repercussions. While it had been possible to store the predicate as string attribute, it would have rendered impossible to implement switching logic based on the comparison kind and made attribute validity checks (one out of ten possible kinds) more complex.
We considered representing regions through ArrayAttr
s containing a list of a special type IRBlockAttr
, which in turn would contain a list of operations. All attributes in MLIR are unique’d within the context, which would make the IR inside the regions immortal for no good reason.
We considered attaching a “force-inline” attribute on a function and/or a function call
operation. Even the minimal region support (use cases in affine.for and affine.if existing before the regions) requires access to the values defined in the dominating block, which is not supported by functions. Conceptually, function bodies are instances of regions rather than the inverse; regions can also be device kernels, alternative sections, etc.
region
operationThis would mean we have a special kind of operation that is allowed to have regions while other operations are not. Such distinction is similar to the Stmt/Op difference we have had and chose to remove to make the IR simpler and more flexible. It would also require analyses and passes to consider the interplay between operations (e.g., an affine.for
operation must be followed by a region operation). Finally, a region operation can be introduced using the current implementation, among other operations and without being special in any sense.
Being able to use values defined outside the region implies that use-def chains may contain uses from different nested regions. Consequently, IR transformations and analyses can pull the instruction defining the value across region boundaries, for example in case of TableGen-defined canonicalization patterns. This would not be the case if all used values had been passed as region arguments. One of the motivations for introducing regions in the IR is precisely to enable cross-region analyses and transformations that are simpler than inter-procedural transformations. Having uses from different regions appear in the same use-def chain, contrary to an additional data structure maintaining correspondence between function call arguments as uses of the original definitions and formal arguments as new definitions, enables such simplification. Since individual operations now belong to blocks, which belong to regions, it is always possible to check if the definition of the value belongs to the same region as its particular use. The risk is that any IR traversal will need to handle explicitly this situation and it is easy to forget a check (or conversely it isn’t easy to design the right check in a tablegen pattern for example): traversing use-def chains potentially crosses implicitly semantic barriers, making it possible to unknowingly break region semantics. This is expected to be caught in the verifier after the transformation.
At the same time, one may choose to pass certain or all values as region arguments to explicitly break the use-def chains in the current proposal. This can be combined with an attribute-imposed semantic requirement disallowing the body of the region to refer to any value from outside it.
This section describes the design decisions that shaped the dialect extensible type system present in MLIR.
There are two different interactions between dialects that are important to understand. When types of a dialect are:
In operations of other dialects
In types of other dialects
Following the separation between the built-in and standard dialect, it makes sense to separate built-in types and standard dialect types. Built-in types are required for the validity of the IR itself, e.g. the function type (which appears in function signatures and generic assembly forms of operations). Integer, float, vector, memref and tensor types, while important, are not necessary for IR validity.
MLIR supports unregistered operations in generic assembly form. MLIR also supports a similar concept for types. When parsing, if the dialect for dialect type has not been registered the type is modeled as an ‘OpaqueType’. This allows for types to be round-tripped without needing to link in the dialect library that defined them. No additional information about opaque types, outside of parsing/printing, will be available.
Dialect extended types are represented as string literals wrapped inside of the dialect namespace. This means that the parser delegates to the dialect for parsing specific type instances. This differs from the representation of dialect defined operations, of which have an identifier name that the parser uses to identify and parse them.
This representation was chosen for several reasons:
Dialect type parsing cannot plug into the existing parser infrastructure as operations do with the OpAsmParser/Printer. Operations have a defined syntax structure that is the same across all dialects. Types, on the other hand, may have many different, and sometimes conflicting, parsing constraints that would be difficult/unmaintainable to provide within a single interface.
This also has the added benefit of encouraging dialects to reuse existing external type parsers. For example, an LLVM dialect may provide an MLIR LLVM type that is simply a wrapper around LLVM types. The LLVM dialect would then use the existing LLVM type parsing infrastructure.
Example:
%s = "foo"() : () -> !llvm<"i32*">
Unlike operations, types generally do not have a formal canonical name. For example, function types have no defined keyword and integer types are defined by a regular expression to support arbitrary bitwidth. Dialects with existing type systems, e.g. LLVM, are likely to provide wrappers around their existing type systems. For these wrapper types there is no simple canonical name, it's logical to think of these types as existing within the namespace of the dialect. If a dialect wishes to assign a canonical name to a type, it can be done via type aliases.
The MLIR type system provides first class support for defining tuple types. This is due to the fact that Tuple
represents a universal concept that is likely to, and has already begun to, present itself in many different dialects. Though this type is first class in the type system, it merely serves to provide a common mechanism in which to represent this concept in MLIR. As such, MLIR provides no standard operations for interfacing with tuple
types. It is up to dialect authors to provide operations, e.g. extract_tuple_element, to interpret and manipulate them. When possible, operations should prefer to use multiple results instead. These provide a myriad of benefits, such as alleviating any need for tuple-extract operations that merely get in the way of analysis and transformation.
MLIR decides to support both generic and custom assembly forms under the following considerations:
MLIR is an open system; it is designed to support modular and pluggable dialects. Depending on whether there exists a corresponding dialect and whether the dialect is plugged in, operations may or may not be registered into MLIR system. Yet we still need a way to investigate these operations. So the generic assembly form is mandated by this aspect of MLIR system. It provides a default textual form for operations.
On the other hand, an assembly form is for assisting developers to investigate the IR. The generic form serves as a safe fallback but it can be too verbose for certain ops. Therefore, MLIR gives each dialect the choice to define a custom assembly form for each operation according to the operation's semantics and specific needs. The custom assembly form can de-duplicate information from the operation to derive a more concise form, thus better facilitating the comprehension of the IR.
This section describes a few very simple examples that help understand how MLIR represents computation.
// A simple linear search in every row of a matrix for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { // dynamic control flow if (a[i][j] == key) { s[i] = j; break; } } }
The presence of dynamic control flow leads to an inner non-affine function nested in an outer function that uses affine loops.
func @search(%A: memref<?x?xi32>, %S: <?xi32>, %key : i32) { %ni = memref.dim %A, 0 : memref<?x?xi32> // This loop can be parallelized affine.for %i = 0 to %ni { call @search_body (%A, %S, %key, %i) : (memref<?x?xi32>, memref<?xi32>, i32, i32) } return } func @search_body(%A: memref<?x?xi32>, %S: memref<?xi32>, %key: i32, %i : i32) { %nj = memref.dim %A, 1 : memref<?x?xi32> br ^bb1(0) ^bb1(%j: i32) %p1 = arith.cmpi "lt", %j, %nj : i32 cond_br %p1, ^bb2, ^bb5 ^bb2: %v = affine.load %A[%i, %j] : memref<?x?xi32> %p2 = arith.cmpi "eq", %v, %key : i32 cond_br %p2, ^bb3(%j), ^bb4 ^bb3(%j: i32) affine.store %j, %S[%i] : memref<?xi32> br ^bb5 ^bb4: %jinc = arith.addi %j, 1 : i32 br ^bb1(%jinc) ^bb5: return }
As per the MLIR spec, the restrictions on dimensions and symbol identifiers to be used with the affine.apply operation only apply to accesses inside affine.for
and affine.if
operations. However, an analysis of accesses inside the called function (@search_body
) is necessary to determine if the %i
loop could be parallelized: such function access analysis is calling context sensitive.
Loop bounds that are not affine lead to a nesting of functions as shown below.
for (i = 0; i < N; i++) for (j = 0; j < N; j++) // Non-affine loop bound for k loop. for (k = 0; k < pow(2, j); k++) for (l = 0; l < N; l++) { // block loop body ... }
func @outer_nest(%n : index) { affine.for %i = 0 to %n { affine.for %j = 0 to %n { %pow = call @pow(2, %j) : (index, index) -> index call @inner_nest(%pow, %n) : ... } } return } func @inner_nest(%m : index, %n : index) { affine.for %k = 0 to %m { affine.for %l = 0 to %n { ... } } return }
The following example illustrates a reference implementation of a 2D convolution, which uses an integer set #domain
to represent valid input data in a dilated convolution.
// Dilation factors S0 and S1 can be constant folded if constant at compile time. #domain = (d0, d1)[S0,S1,S2,S3]: (d0 % S0 == 0, d1 % S1 == 0, d0 >= 0, d1 >= 0, S3 - d0 - 1 >= 0, S4 - d1 - 1 >= 0) // Identity map (shown here for illustration). #map0 = (d0, d1, d2, d3, d4, d5, d6) -> (d0, d1, d2, d3, d4, d5, d6) // Affine map from output to input coordinate space. // d0 = output_h, d1 = output_w, d2 = kernel_h, d3 = kernel_w // S0 = h_stride, S1 = w_stride, S2 = h_kernel_dilation, S3 = w_kernel_dilation // S4 = h_pad_low, S5 = w_pad_low // %out0 = %0#1 * %h_stride + %0#4 * %h_kernel_dilation - %h_pad_low // %out1= %0#2 * %w_stride + %0#5 * %w_kernel_dilation - %w_pad_low #map1_0 = (d0, d1, d2, d3) [S0, S1, S2, S3, S4, S5] -> (d0 * S0 + d2 * S2 - %S4) #map1_1 = (d0, d1, d2, d3) [S0, S1, S2, S3, S4, S5] -> (d1 * S1 + d3 * S3 - %S5) // Semi-affine map to undilated input coordinate space. // d0 = input_h, d1 = input_w, S0 = h_base_dilation, S1 = w_base_dilation. #map2_0 = (d0, d1) [S0, S1] -> (d0 / S0) #map2_1 = (d0, d1) [S0, S1] -> (d1 / S1) // Conv2D shapes: // input: [batch, input_height, input_width, input_feature] // kernel: [kernel_height, kernel_width, input_feature, output_feature] // output: [batch, output_height, output_width, output_feature] func @conv2d(%input: memref<16x1024x1024x3xf32, #lm0, /*scratchpad=*/1>, %kernel: memref<5x5x3x32xf32, #lm0, /*scratchpad=*/1>, %output: memref<16x512x512x32xf32, #lm0, /*scratchpad=*/1>) { affine.for %b = 0 to %batch { affine.for %oh = 0 to %output_height { affine.for %ow = 0 to %output_width { affine.for %of = 0 to %output_feature { affine.for %kh = 0 to %kernel_height { affine.for %kw = 0 to %kernel_width { affine.for %if = 0 to %input_feature { // Calculate input indices. %1_0 = affine.apply #map1_0 (%0#1, %0#2, %0#4, %0#5) [%h_stride, %w_stride, %h_kernel_dilation, %w_kernel_dilation, %h_pad_low, %w_pad_low] %1_1 = affine.apply #map1_1 (%0#1, %0#2, %0#4, %0#5) [%h_stride, %w_stride, %h_kernel_dilation, %w_kernel_dilation, %h_pad_low, %w_pad_low] // Check if access is not in padding. affine.if #domain(%1_0, %1_1) [%h_base_dilation, %w_kernel_dilation, %h_bound, %w_bound] { %2_0 = affine.apply #map2 (%1_0, %1_1) %2_1 = affine.apply #map2 (%1_0, %1_1) // Compute: output[output_indices] += input[input_indices] * kernel[kernel_indices] call @multiply_accumulate(%input, %kernel, %output, %b, %oh, %ow, %of, %kh, %kw, %if, %2_0, %2_1) } } } } } } } } return }
TODO: (Add more examples showing the IR for a variety of interesting cases)
This is a list of some design alternatives and extensions that we discussed in detail but did not include in the spec or postponed them for future consideration on demand. We will revisit these discussions when we have more implementation experience and learn more about the challenges and limitations of our current design in practice.
The current MLIR uses a representation of polyhedral schedules using a tree of if/for loops. We extensively debated the tradeoffs involved in the typical unordered polyhedral instruction representation (where each instruction has multidimensional schedule information), discussed the benefits of schedule tree forms, and eventually decided to go with a syntactic tree of affine if/else conditionals and affine for loops. Discussion of the tradeoff was captured in this document: MLIR: The case for a simplified polyhedral form.
At a high level, we have two alternatives here:
This representation is based on a simplified form of the domain/schedule representation used by the polyhedral compiler community. Domains represent what has to be executed while schedules represent the order in which domain elements are interleaved. We model domains as non-piece-wise convex integer sets, and schedules as affine functions; however, the former can be disjunctive, and the latter can be piece-wise affine relations. In the schedule tree representation, domain and schedules for instructions are represented in a tree-like structure which is called a schedule tree. Each non-leaf node of the tree is an abstract polyhedral dimension corresponding to an abstract fused loop for each ML instruction that appears in that branch. Each leaf node is an ML Instruction.
// A tiled matmul code (128x128x128) represented in schedule tree form // #map0 = (d0, d1, d2, d3, d4, d5) -> (128*d0 + d3, 128*d1 + d4, 128*d2 + d5) #intset_ij = (i, j) [M, N, K] : i >= 0, -i + N - 1 >= 0, j >= 0, -j + N-1 >= 0 #intset_ijk = (i, j, k) [M, N, K] : i >= 0, -i + N - 1 >= 0, j >= 0, -j + M-1 >= 0, k >= 0, -k + N - 1 >= 0) func @matmul(%A, %B, %C, %M, %N, %K) : (...) { // %M, N, K are symbols // t1, t2, t3, t4, t5, t6 are abstract polyhedral loops mldim %t1 : {S1,S2,S3,S4,S5} floordiv (i, 128) { mldim %t2 : {S1,S2,S3,S4,S5} floordiv (j, 128) { // (%i, %j) = affine.apply (d0, d1) -> (128*d0, 128*d1) (%t1, %t2) call dma_mem_to_scratchpad(%C, %i, %j, %M, %N, %K) with @intset_ij(%i, %j) [%M, %N, %K] mldim %t3 : {S2,S3,S4,S5} floordiv (k, 128) { // (%i, %j, %k) = affine.apply (d0, d1, d2) // -> (128*d0, 128*d1, 128*d2) (%t1, %t2, %t3) call dma_mem_to_scratchpad(%A, ...) with #inset_ijk (%i, %j, %k) [%M, %N, %K] // (%i, %j, %k) = affine.apply (d0, d1, d2) // -> (128*d0, 128*d1, 128*d2) (%t1, %t2, %t3) call dma_mem_to_scratchpad(%B, ...) with #inset_ijk (%i, %j, %k) [%M, %N, %K] mldim %t4 : {S4} i mod 128 { mldim %t5 : {S4} j mod 128 { mldim %t6 : {S4} k mod 128 { // (%i, %j, %k) = affine.apply #map0 (%t1, %t2, %t3, %t4, %t5, %t6) call matmul_body(A, B, C, %i, %j, %k, %M, %N, %K) with #inset_ijk(%i, %j, %k) [%M, %N, %K] } // end mld4im t6 } // end mldim t5 } // end mldim t4 } // end mldim t3 // (%i, %j) = affine.apply (d0, d1) -> (128*d0, 128*d1) (%t1, %t2) call $dma_scratchpad_to_mem_C ... with #intset(%i, %j) [%M, %N, %K] } // end mldim t2 } // end mldim t1 return }
The current MLIR spec includes affine maps and integer sets, but not affine relations. Affine relations are a natural way to model read and write access information, which can be very useful to capture the behavior of external library calls where no implementation is available, high-performance vendor libraries, or user-provided / user-tuned routines.
An affine relation is a relation between input and output dimension identifiers while being symbolic on a list of symbolic identifiers and with affine constraints on the identifiers.
Syntax:
// Affine relation definition at the top of file affine-rel-def ::= affine-rel-id `=` affine-relation-inline affine-rel-id ::= `##` prefixed-id affine-relation-inline ::= `(` input-dims `)` (`[` symbols `]`)? `->` `(` output-dims `)` : affine-constraint-conjunction input-dims ::= bare-id-list output-dims ::= bare-id-list symbols ::= bare-id-list affine-rel ::= affine-rel-id | affine-relation-inline // Usage affine-rel-spec ::= affine-rel dim-and-symbol-use-list
All identifiers appearing in input-dims, output-dims, and symbol-dims are pairwise distinct. All affine-constraint non-terminals in the above syntax are allowed to contain identifiers only from input-dims, output-dims, and symbol-dims.
Affine relations are used to model read, write, may_read, and may_write sets of functions in the IR. The output dimension identifiers correspond to the data dimensions.
Example:
// read relation: two elements ( d0 <= r0 <= d0+1 ) ##aff_rel9 = (d0) -> (r0) : r0 - d0 >= 0, d0 - r0 + 1 >= 0 func @count (%A : memref<128xf32>, %pos : i32) -> f32 reads: {%A ##aff_rel9 (%pos)} writes: /* empty */ may_reads: /* empty */ may_writes: /* empty */ { bb0 (%0, %1: memref<128xf32>, i64): %val = affine.load %A [%pos] %val = affine.load %A [%pos + 1] %p = arith.mulf %val, %val : f32 return %p : f32 }
MLIR supports values of a Function type. Instead of having first-class IR concept for functions, one could define an operation with a body region that defines a function value. The particularity of functions is that their names are globally visible and can be referred to before being defined, unlike SSA values that must be defined first. Implementing a “function definition” operation would require to relax some of the SSA constraints in a region, and also make the IR Module a region as well. It would also affect the core infrastructure (e.g., function passes) only for the sake of concept unification.
Instead of inspecting the types of arguments of the first block, one could give the region itself a type. This type would be redundant with block argument types, which must have values and create room for type mismatches. While functions do have types that are partly redundant with the arguments of the first block in the function, this is necessary to support function declarations that do not have a body which we can refer to in order to obtain the argument types. A region is always contained in an operation or a function that can be queried to obtain the “type” of the region if necessary.
A type on a region can be justified if Regions were to be considered separately from the enclosing entity (operation or function) and had their own semantics that should be checked.
Regions could be annotated with dialect attributes to use attribute verification hooks. An operation could take multiple regions as arguments, and each of them may require different attributes. However, there are currently very few practical cases where this would be necessary. Instead, one could simulate per-region attributes with array attributes attached to the entity containing the region (operation or function). This decreases the overall complexity of the IR and enables more concise and op-specific forms, e.g., when all regions of an op have the same attribute that can be only mentioned once. Since the semantics of the region is entirely defined by the enclosing entity, it also makes sense to have attributes attached to that entity rather than to the region itself.
This can be reconsidered in the future if we see a non-neglectable amount of use cases.
Having read, write, may_read, and may_write sets for external functions which include opaque ones, high-performance vendor libraries such as CuDNN, CuB, MKL, FFT libraries, user-provided/optimized functions, or data movement runtimes such as DMA ones is a powerful feature. It allows the compiler to perform analysis, composition/transformation in the presence of such calls and with loops around such calls on sub-tensors. For user-provided or custom hand-tuned functions, the read/write/may_read/may_write sets could be provided a-priori by a user as part of the external function signature or they could be part of a database.
TODO: Design this, and update to use function attribute syntax.
Example:
##rel9 ( ) [s0] -> (r0, r1) : 0 <= r0 <= 1023, 0 <= r1 <= s0 - 1 func @cblas_reduce_ffi(%M: memref<1024 x ? x f32, #layout_map0, /*mem=*/0>) -> f32 [ reads: {%M, ##rel9() } writes: /* empty */ may_reads: /* empty */ may_writes: /* empty */ ] func @dma_mem_to_scratchpad(%a : memref<1024 x f32, #layout_map0, /*mem=*/0>, %b : memref<1024 x f32, #layout_map0, 1>, %c : memref<1024 x f32, #layout_map0>) [ reads: {%M, ##rel9() } writes: /* empty */ may_reads: /* empty */ may_writes: /* empty */ ]
Arbitrary polyhedral shapes for tensors: e.g., triangular shapes in tensor dimensions where there is symmetry: use integer set (affine constraints) to model tensor data space (instead of just extents). Requires some changes to the IR and the in-memory form.
Layout maps
Proposal 2(a) requires non-trivial changes to the IR and the in-memory representation. 2(b) requires no change, but impacts how cost models look at index and layout maps.
affine.if
and affine.for
Extensions for “Escaping Scalars”We considered providing a representation for SSA values that are live out of if/else
conditional bodies and loop carried in affine.for
loops. We ultimately abandoned this approach due to its complexity. In the current design of MLIR, scalar variables cannot escape for loops or if instructions. In situations, where escaping is necessary, we use zero-dimensional tensors and memrefs instead of scalars.
TODO: This whole section is obsolete and should be updated to use block arguments and a yield like terminator in for/if instructions.
The abandoned design of supporting escaping scalars is as follows:
Syntax:
[<out-var-list> =] for %<index-variable-name> = <lower-bound> ... <upper-bound> step <step> [with <in-var-list>] { <loop-instruction-list> }
out-var-list is a comma separated list of SSA values defined in the loop body and used outside the loop body. in-var-list is a comma separated list of SSA values used inside the loop body and their initializers. loop-instruction-list is a list of instructions that may also include a yield instruction.
Example:
// Return sum of elements in 1-dimensional mref A func i32 @sum(%A : memref<?xi32>, %N : i32) -> (i32) { %init = 0 %result = affine.for %i = 0 to N with %tmp(%init) { %value = affine.load %A[%i] %sum = %value + %tmp yield %sum } return %result : i32 }
Syntax:
<out-var-list> = affine.if (<cond-list>) {...} [else {...}]
Out-var-list is a list of SSA values defined by the if-instruction. The values are arguments to the yield-instruction that occurs in both then and else clauses when else clause is present. When if instruction contains only if clause, the escaping value defined in the then clause should be merged with the value the variable had before the if instruction. The design captured here does not handle this situation.
Example:
// Compute sum of half of the array func i32 @sum_half(%A : memref<?xi32>, %N : i32) -> (i32) { %s0 = 0 %s1 = affine.for %i = 1 ... N step 1 with %s2 (%s0) { %s3 = if (%i >= %N / 2) { %v0 = affine.load %A[%i] %s4 = %s2 + %v0 yield %s4 } yield %s3 } return %s1 : i32 }
People want compilers to go fast, and one simple way to do that is to multi-thread them. There are multiple strategies for this, but a simple one is to optimize and compile separate functions in parallel. LLVM's original pass manager anticipated this demand, and the CallGraphSCCPass manager is even designed to support this as well, but unfortunately, a few early design decisions in LLVM prevent this from ever happening. Instead, things like ThinLTO are forced to split programs into separate LLVM modules/context and optimize those chunks independently.
The problem is that LLVM has several objects in its IR that are globally uniqued and also mutable: notably constants like i32 0
. In LLVM, these constants are Value
‘s, which allow them to be used as operands to instructions, and that they also have SSA use lists. Because these things are uniqued, every i32 0
in any function shares a use list. This means that optimizing multiple functions in parallel won’t work (at least without some sort of synchronization on the use lists, which would be unbearably inefficient).
MLIR now supports a multithreaded pass manager. We do this through several design choices:
This allows MLIR function passes to support efficient multithreaded compilation and code generation.