Bufferization in MLIR is the process of converting ops with tensor
semantics to ops with memref
semantics. MLIR provides an infrastructure that bufferizes an entire program in a single pass (One-Shot Bufferize). This infrastructure bufferizes all ops that implement the BufferizableOpInterface
can be bufferized.
MLIR has an older bufferization infrastructure built around dialect conversion. Most dialect conversion bufferization patterns have been migrated to One-Shot Bufferize, but some functionality such as function boundary bufferization still depends on dialect conversion and its type converter. New projects should use One-Shot Bufferize, as the dialect conversion-based bufferization will eventually be deprecated. Moreover, One-Shot Bufferize results in better bufferization with fewer memory allocations and buffer copies. This documentation is mostly about One-Shot Bufferize, but also describes how to gradually migrate a project from dialect conversion-based bufferization to One-Shot Bufferize.
One-Shot Bufferize is a new tensor bufferization pass designed for IR in destination-passing style, and with aggressive in-place bufferization.
One-Shot Bufferize is:
Monolithic: A single MLIR pass does the entire work, whereas the previous bufferization in MLIR was split across multiple passes residing in different dialects. In One-Shot Bufferize, BufferizableOpInterface
implementations are spread across different dialects.
A whole-function at a time analysis. In-place bufferization decisions are made by analyzing SSA use-def chains on tensors. Op interface implementations not only provide the rewrite logic from tensor ops to memref ops, but also helper methods for One-Shot Bufferize‘s analysis to query information about an op’s bufferization/memory semantics.
Extensible via an op interface: All ops that implement BufferizableOpInterface
can be bufferized.
2-Pass: Bufferization is internally broken down into 2 steps: First, analyze the entire IR and make bufferization decisions. Then, bufferize (rewrite) the IR. The analysis has access to exact SSA use-def information. It incrementally builds alias and equivalence sets and does not rely on a posteriori-alias analysis from preallocated memory.
Greedy: Operations are analyzed one-by-one and it is decided on the spot whether a tensor OpOperand must be copied or not. Heuristics determine the order of analysis.
Modular: The current One-Shot Analysis can be replaced with a different analysis. The result of the analysis are queried by the bufferization via AnalysisState
, in particular AnalysisState::isInPlace
. Any derived class of AnalysisState
that implements a small number virtual functions can serve as a custom analysis. It is even possible to run One-Shot Bufferize without any analysis (AlwaysCopyAnalysisState
), in which case One-Shot Bufferize behaves exactly like the old dialect conversion-based bufferization (i.e., copy every buffer before writing to it).
To reduce complexity, One-Shot Bufferize should be run after other transformations, typically as one of the last steps right before lowering memref ops. Many transformations are easier in tensor land; e.g., tile/fuse/… on tensors first, then bufferize the remaining IR.
From an architecture perspective, One-Shot Bufferize consists of BufferizableOpInterface (and its implementations) and an analysis of tensor SSA values that decides if a buffer can be used directly or must be copied. The [bufferize] method of the op interface inspects analysis results and rewrites tensor ops into memref ops.
The high-level goal of every bufferization technique is to: 1. Use as little memory as possible. 2. Copy as little memory as possible.
This implies reusing already allocated buffers when possible, turning bufferization into an algorithmically complex problem with similarities to register allocation.
Depending on the concrete use case, there may be additional bufferization requirements. If the contents of a buffer are expensive to compute, there could be a tradeoff between recomputation and compute once and copy. On the contrary, it may not even be possible to allocate new buffers at runtime on some architectures.
Bufferization is an algorithmically complex problem. Given an op with a tensor result, bufferization has to choose a memref buffer in which the result can be stored. It is always safe to allocate a brand new buffer, but such a bufferization strategy would be unacceptable for high-performance codegen. When choosing an already existing buffer, we must be careful not to accidentally overwrite data that is still needed later in the program.
To simplify this problem, One-Shot Bufferize was designed to take advantage of destination-passing style. This form exists in itself independently of bufferization and is tied to SSA semantics: many ops are “updating” part of their input SSA variable. For example the LLVM instruction insertelement
is inserting an element inside a vector. Since SSA values are immutable, the operation returns a copy of the input vector with the element inserted. Another example in MLIR is linalg.generic
, which always has an extra outs
operand which provides the initial values to update (for example when the operation is doing a reduction).
This input is referred to as “destination” in the following (quotes are important as this operand isn't modified in place but copied) and comes into place in the context of bufferization as a possible “anchor” for the bufferization algorithm. This allows the user to shape the input in a form that guarantees close to optimal bufferization result when carefully choosing the SSA value used as “destination”.
For every tensor result, a “destination-passing” style op has a corresponding tensor operand. If there aren't any other uses of this tensor, the bufferization can alias it with the op result and perform the operation “in-place” by reusing the buffer allocated for this “destination” input.
As an example, consider the following op: %0 = tensor.insert %cst into %t[%idx] : tensor<?xf32>
%t
is the “destination” in this example. When choosing a buffer for the result %0
, denoted as buffer(%0)
, One-Shot Bufferize considers only two options:
buffer(%0) = buffer(%t)
: alias the “destination” tensor with the result and perform the operation in-place.buffer(%0)
is a newly allocated buffer.There may be other buffers in the same function that could potentially be used for buffer(%0)
, but those are not considered by One-Shot Bufferize to keep the bufferization simple. One-Shot Bufferize could be extended to consider such buffers in the future to achieve a better quality of bufferization.
Tensor ops that are not in destination-passing style always bufferized to a memory allocation. E.g.:
%0 = tensor.generate %sz { ^bb0(%i : index): %cst = arith.constant 0.0 : f32 tensor.yield %cst : f32 } : tensor<?xf32>
The result of tensor.generate
does not have a “destination” operand, so bufferization allocates a new buffer. This could be avoided by choosing an op such as linalg.generic
, which can express the same computation with a “destination” operand, as specified behind outputs (outs
):
#map = affine_map<(i) -> (i)> %0 = linalg.generic {indexing_maps = [#map], iterator_types = ["parallel"]} outs(%t : tensor<?xf32>) { ^bb0(%arg0 : f32): %cst = arith.constant 0.0 : f32 linalg.yield %cst : f32 } -> tensor<?xf32>
At first glance, the above linalg.generic
op may not seem very useful because the output tensor %t
is entirely overwritten. Why pass the tensor %t
as an operand in the first place? As an example, this can be useful for overwriting a slice of a tensor:
%t = tensor.extract_slice %s [%idx] [%sz] [1] : tensor<?xf32> to tensor<?xf32> %0 = linalg.generic ... outs(%t) { ... } -> tensor<?xf32> %1 = tensor.insert_slice %0 into %s [%idx] [%sz] [1] : tensor<?xf32> into tensor<?xf32>
The above example bufferizes to a memref.subview
, followed by a “linalg.generic
on memrefs” that overwrites the memory of the subview, assuming that the slice %t
has no other user. The tensor.insert_slice
then bufferizes to a no-op (in the absence of RaW conflicts such as a subsequent read of %s
).
RaW conflicts are detected with an analysis of SSA use-def chains (details later). One-Shot Bufferize works best if there is a single SSA use-def chain, where the result of a tensor op is the operand of the next tensor ops, e.g.:
%0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>) %1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>) %2 = "my_dialect.yet_another_op"(%1) : (tensor<?xf32>) -> (tensor<?xf32>)
Buffer copies are likely inserted if the SSA use-def chain splits at some point, e.g.:
%0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>) %1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>) %2 = "my_dialect.yet_another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>)
One-Shot Bufferize has debug flags (test-analysis-only print-conflicts
) that print the results of the analysis and explain to the user why buffer copies were inserted.
MLIR provides a pass -one-shot-bufferize
that performs an analysis and bufferizes all ops with tensor semantics that implement BufferizableOpInterface
. For modularity reasons, these op interface implementations are typically external models that live in a dialect‘s “Transforms” build unit. (External models are a mechanism for implementing an op interface in a different build unit.) It is the user’s responsibility to ensure that all needed external models are registered before running One-Shot Bufferize.
By default, One-Shot Bufferize fails when it encounters an op with tensor semantics (i.e., tensor result or tensor operand) that is not bufferizable (i.e., does not implement BufferizableOpInterface
). This can be avoided with allow-unknown-ops
. In that case, One-Shot Bufferize inserts to_memref
/to_tensor
ops around the bufferization boundary. These ops are named versions of unrealized_conversion_cast
. Note that One-Shot Bufferize's analysis can currently not analyze these ops, so input IR with such ops may fail bufferization. Therefore, running One-Shot Bufferize multiple times in a sequence is also not supported at the moment.
One-Shot Bufferize can be configured to bufferize only ops from a set of dialects with dialect-filter
. This can be useful for gradually migrating from dialect conversion-based bufferization to One-Shot Bufferize. One-Shot Bufferize must run first in such a case, because dialect conversion-based bufferization generates to_tensor
/to_memref
ops which One-Shot Bufferize cannot analyze.
One-Shot Bufferize can also be called programmatically with bufferization::runOneShotBufferize
. Alternatively, bufferization::bufferizeOp
skips the analysis and inserts a copy on every buffer write, just like the dialect conversion-based bufferization.
Important: this pass is deprecated, please use the ownership based buffer deallocation pass instead
One-Shot Bufferize deallocates all buffers that it allocates. This is in contrast to the dialect conversion-based bufferization that delegates this job to the -buffer-deallocation
pass. By default, One-Shot Bufferize rejects IR where a newly allocated buffer is returned from a block. Such IR will fail bufferization.
A new buffer allocation is returned from a block when the result of an op that is not in destination-passing style is returned. E.g.:
%0 = scf.if %c -> (tensor<?xf32>) { %1 = tensor.generate ... -> tensor<?xf32> scf.yield %1 : tensor<?xf32> } else { scf.yield %another_tensor : tensor<?xf32> }
The scf.yield
in the “else” branch is OK, but the scf.yield
in the “then” branch will be rejected.
Another case in which a buffer allocation may be returned is when a buffer copy must be inserted due to a RaW conflict. E.g.:
%0 = scf.if %c -> (tensor<?xf32>) { %1 = tensor.insert %cst into %another_tensor[%idx] : tensor<?xf32> "my_dialect.reading_tensor_op"(%another_tensor) : (tensor<?xf32>) -> () ... scf.yield %1 : tensor<?xf32> } else { scf.yield %yet_another_tensor : tensor<?xf32> }
In the above example, a buffer copy of buffer(%another_tensor)
(with %cst
inserted) is yielded from the “then” branch.
Note: Buffer allocations that are returned from a function are not deallocated. It is the caller's responsibility to deallocate the buffer. For the full function boundary ABI for MemRefs w.r.t. buffer deallocation refer to the Function Boundary ABI section. In the future, this could be automated with allocation hoisting (across function boundaries) or reference counting.
One-Shot Bufferize leaks all memory and does not generate any buffer deallocations. The -buffer-deallocation-pipeline
has to be run afterwards to insert the deallocation operations.
Recommended compilation pipeline:
one-shot-bufferize | it's recommended to perform all bufferization here at latest, | <- any allocations inserted after this point have to be handled V manually expand-realloc V ownership-based-buffer-deallocation V canonicalize <- mostly for scf.if simplifications V buffer-deallocation-simplification V <- from this point onwards no tensor values are allowed lower-deallocations V CSE V canonicalize
One-Shot Bufferize does not deallocate any buffers that it allocates. This job is delegated to the -ownership-based-buffer-deallocation
pass, i.e., after running One-Shot Bufferize, the result IR may have a number of memref.alloc
ops, but no memref.dealloc
ops. This pass processes operations implementing FunctionOpInterface
one-by-one without analysing the call-graph. This means, that there have to be some rules on how MemRefs are handled when being passed from one function to another. The rest of the pass revolves heavily around the bufferization.dealloc
operation which is inserted at the end of each basic block with appropriate operands and should be optimized using the Buffer Deallocation Simplification pass (--buffer-deallocation-simplification
) and the regular canonicalizer (--canonicalize
). Lowering the result of the -ownership-based-buffer-deallocation
pass directly using --convert-bufferization-to-memref
without beforehand optimization is not recommended as it will lead to very inefficient code (the runtime-cost of bufferization.dealloc
is O(|memrefs|^2+|memref|*|retained|)
).
The Buffer Deallocation pass operates on the level of operations implementing the FunctionOpInterface
. Such operations can take MemRefs as arguments, but also return them. To ensure compatibility among all functions (including external ones), some rules have to be enforced:
For external functions (e.g., library functions written externally in C), the externally provided implementation has to adhere to these rules and they are just assumed by the buffer deallocation pass. Functions on which the deallocation pass is applied and the implementation is accessible are modified by the pass such that the ABI is respected (i.e., buffer copies are inserted as necessary).
bufferization.dealloc
operationsbufferization.dealloc
operations are unconditionally inserted at the end of each basic block (just before the terminator). The majority of the pass is about finding the correct operands for this operation. There are three variadic operand lists to be populated, the first contains all MemRef values that may need to be deallocated, the second list contains their associated ownership values (of i1
type), and the third list contains MemRef values that are still needed at a later point and should thus not be deallocated. This operation allows us to deal with any kind of aliasing behavior: it lowers to runtime aliasing checks when not enough information can be collected statically. When enough aliasing information is statically available, operands or the entire op may fold away.
Ownerships
To do so, we use a concept of ownership indicators of memrefs which materialize as an i1
value for any SSA value of memref
type, indicating whether the basic block in which it was materialized has ownership of this MemRef. Ideally, this is a constant true
or false
, but might also be a non-constant SSA value. To keep track of those ownership values without immediately materializing them (which might require insertion of bufferization.clone
operations or operations checking for aliasing at runtime at positions where we don't actually need a materialized value), we use the Ownership
class. This class represents the ownership in three states forming a lattice on a partial order:
forall X in SSA values. uninitialized < unique(X) < unknown forall X, Y in SSA values. unique(X) == unique(Y) iff X and Y always evaluate to the same value unique(X) != unique(Y) otherwise
Intuitively, the states have the following meaning:
arith.select
either has the ownership of the then or else case depending on the condition value, inserting another arith.select
for the ownership values can perform the merge and provide a ‘Unique’ ownership for the result), however, in the general case this ‘Unknown’ state has to be assigned.Implied by the above partial order, the pass combines two ownerships in the following way:
Ownership 1 | Ownership 2 | Combined Ownership |
---|---|---|
uninitialized | uninitialized | uninitialized |
unique(X) | uninitialized | unique(X) |
unique(X) | unique(X) | unique(X) |
unique(X) | unique(Y) | unknown |
unknown | unique | unknown |
unknown | uninitialized | unknown |
+ symmetric cases |
Collecting the list of MemRefs that potentially need to be deallocated
For a given block, the list of MemRefs that potentially need to be deallocated at the end of that block is computed by keeping track of all values for which the block potentially takes over ownership. This includes MemRefs provided as basic block arguments, interface handlers for operations like memref.alloc
and func.call
, but also liveness information in regions with multiple basic blocks. More concretely, it is computed by taking the MemRefs in the ‘in’ set of the liveness analysis of the current basic block B, appended by the MemRef block arguments and by the set of MemRefs allocated in B itself (determined by the interface handlers), then subtracted (also determined by the interface handlers) by the set of MemRefs deallocated in B.
Note that we don't have to take the intersection of the liveness ‘in’ set with the ‘out’ set of the predecessor block because a value that is in the ‘in’ set must be defined in an ancestor block that dominates all direct predecessors and thus the ‘in’ set of this block is a subset of the ‘out’ sets of each predecessor.
memrefs = filter((liveIn(block) U allocated(block) U arguments(block)) \ deallocated(block), isMemRef)
The list of conditions for the second variadic operands list of bufferization.dealloc
is computed by querying the stored ownership value for each of the MemRefs collected as described above. The ownership state is updated by the interface handlers while processing the basic block.
Collecting the list of MemRefs to retain
Given a basic block B, the list of MemRefs that have to be retained can be different for each successor block S. For the two basic blocks B and S and the values passed via block arguments to the destination block S, we compute the list of MemRefs that have to be retained in B by taking the MemRefs in the successor operand list of the terminator and the MemRefs in the ‘out’ set of the liveness analysis for B intersected with the ‘in’ set of the destination block S.
This list of retained values makes sure that we cannot run into use-after-free situations even if no aliasing information is present at compile-time.
toRetain = filter(successorOperands + (liveOut(fromBlock) insersect liveIn(toBlock)), isMemRef)
The pass uses liveness analysis and a few interfaces:
FunctionOpInterface
CallOpInterface
MemoryEffectOpInterface
RegionBranchOpInterface
RegionBranchTerminatorOpInterface
Due to insufficient information provided by the interface, it also special-cases on the cf.cond_br
operation and makes some assumptions about operations implementing the RegionBranchOpInterface
at the moment, but improving the interfaces would allow us to remove those dependencies in the future.
The Buffer Deallocation pass has some requirements and limitations on the input IR. These are checked in the beginning of the pass and errors are emitted accordingly:
RegionBranchOpInterface
, an error is emitted because the pass cannot know the semantics of the nested region (and does not make any default assumptions on it).RegionBranchTerminatorOpInterface
or BranchOpInterface
, but not both. Terminators with more than one successor are not supported (except cf.cond_br
). This is not a fundamental limitation, but there is no use-case justifying the more complex implementation at the moment.The following example contains a few interesting cases:
arith.select
initially has ‘Unknown’ assigned as ownership, but once the bufferization.dealloc
operation is inserted it is put in the ‘retained’ list (since it has uses in a later basic block) and thus the ‘Unknown’ ownership can be replaced with a ‘Unique’ ownership using the corresponding result of the dealloc operation.cf.cond_br
operation has more than one successor and thus has to insert two bufferization.dealloc
operations (one for each successor). While they have the same list of MemRefs to deallocate (because they perform the deallocations for the same block), it must be taken into account that some MemRefs remain live for one branch but not the other (thus set intersection is performed on the live-out of the current block and the live-in of the target block). Also, cf.cond_br
supports separate forwarding operands for each successor. To make sure that no MemRef is deallocated twice (because there are two bufferization.dealloc
operations with the same MemRefs to deallocate), the condition operands are adjusted to take the branch condition into account. While a generic lowering for such terminator operations could be implemented, a specialized implementation can take all the semantics of this particular operation into account and thus generate a more efficient lowering.func.func @example(%memref: memref<?xi8>, %select_cond: i1, %br_cond: i1) { %alloc = memref.alloc() : memref<?xi8> %alloca = memref.alloca() : memref<?xi8> %select = arith.select %select_cond, %alloc, %alloca : memref<?xi8> cf.cond_br %br_cond, ^bb1(%alloc : memref<?xi8>), ^bb1(%memref : memref<?xi8>) ^bb1(%bbarg: memref<?xi8>): test.copy(%bbarg, %select) : (memref<?xi8>, memref<?xi8>) return }
After running --ownership-based-buffer-deallocation
, it looks as follows:
// Function boundary ABI: ownership of `%memref` will never be acquired. func.func @example(%memref: memref<?xi8>, %select_cond: i1, %br_cond: i1) { %false = arith.constant false %true = arith.constant true // The ownership of a MemRef defined by the `memref.alloc` operation is always // assigned to be 'true'. %alloc = memref.alloc() : memref<?xi8> // The ownership of a MemRef defined by the `memref.alloca` operation is // always assigned to be 'false'. %alloca = memref.alloca() : memref<?xi8> // The ownership of %select will be the join of the ownership of %alloc and // the ownership of %alloca, i.e., of %true and %false. Because the pass does // not know about the semantics of the `arith.select` operation (unless a // custom handler is implemented), the ownership join will be 'Unknown'. If // the materialized ownership indicator of %select is needed, either a clone // has to be created for which %true is assigned as ownership or the result // of a `bufferization.dealloc` where %select is in the retain list has to be // used. %select = arith.select %select_cond, %alloc, %alloca : memref<?xi8> // We use `memref.extract_strided_metadata` to get the base memref since it is // not allowed to pass arbitrary memrefs to `memref.dealloc`. This property is // already enforced for `bufferization.dealloc` %base_buffer_memref, ... = memref.extract_strided_metadata %memref : memref<?xi8> -> memref<i8>, index, index, index %base_buffer_alloc, ... = memref.extract_strided_metadata %alloc : memref<?xi8> -> memref<i8>, index, index, index %base_buffer_alloca, ... = memref.extract_strided_metadata %alloca : memref<?xi8> -> memref<i8>, index, index, index // The deallocation conditions need to be adjusted to incorporate the branch // condition. In this example, this requires only a single negation, but might // also require multiple arith.andi operations. %not_br_cond = arith.xori %true, %br_cond : i1 // There are two dealloc operations inserted in this basic block, one per // successor. Both have the same list of MemRefs to deallocate and the // conditions only differ by the branch condition conjunct. // Note, however, that the retained list differs. Here, both contain the // %select value because it is used in both successors (since it's the same // block), but the value passed via block argument differs (%memref vs. // %alloc). %10:2 = bufferization.dealloc (%base_buffer_memref, %base_buffer_alloc, %base_buffer_alloca : memref<i8>, memref<i8>, memref<i8>) if (%false, %br_cond, %false) retain (%alloc, %select : memref<?xi8>, memref<?xi8>) %11:2 = bufferization.dealloc (%base_buffer_memref, %base_buffer_alloc, %base_buffer_alloca : memref<i8>, memref<i8>, memref<i8>) if (%false, %not_br_cond, %false) retain (%memref, %select : memref<?xi8>, memref<?xi8>) // Because %select is used in ^bb1 without passing it via block argument, we // need to update it's ownership value here by merging the ownership values // returned by the dealloc operations %new_ownership = arith.select %br_cond, %10#1, %11#1 : i1 // The terminator is modified to pass along the ownership indicator values // with each MemRef value. cf.cond_br %br_cond, ^bb1(%alloc, %10#0 : memref<?xi8>, i1), ^bb1(%memref, %11#0 : memref<?xi8>, i1) // All non-entry basic blocks are modified to have an additional i1 argument for // each MemRef value in the argument list. ^bb1(%13: memref<?xi8>, %14: i1): // 2 preds: ^bb0, ^bb0 test.copy(%13, %select) : (memref<?xi8>, memref<?xi8>) %base_buffer_13, ... = memref.extract_strided_metadata %13 : memref<?xi8> -> memref<i8>, index, index, index %base_buffer_select, ... = memref.extract_strided_metadata %select : memref<?xi8> -> memref<i8>, index, index, index // Here, we don't have a retained list, because the block has no successors // and the return has no operands. bufferization.dealloc (%base_buffer_13, %base_buffer_select : memref<i8>, memref<i8>) if (%14, %new_ownership) return }
The semantics of the bufferization.dealloc
operation provide a lot of opportunities for optimizations which can be conveniently split into patterns using the greedy pattern rewriter. Some of those patterns need access to additional analyses such as an analysis that can determine whether two MemRef values must, may, or never originate from the same buffer allocation. These patterns are collected in the Buffer Deallocation Simplification pass, while patterns that don't need additional analyses are registered as part of the regular canonicalizer pass. This pass is best run after --ownership-based-buffer-deallocation
followed by --canonicalize
.
The pass applies patterns for the following simplifications:
bufferization.dealloc
operations only containing this value in the ‘memref’ list when it is guaranteed to not alias with any other value in the ‘memref’ list. This avoids at least one aliasing check at runtime and enables using a more efficient lowering for this new bufferization.dealloc
operation.The -lower-deallocations
pass transforms all bufferization.dealloc
operations to memref.dealloc
operations and may also insert operations from the scf
, func
, and arith
dialects to make deallocations conditional and check whether two MemRef values come from the same allocation at runtime (when the buffer-deallocation-simplification
pass wasn't able to determine it statically).
The same lowering of the bufferization.dealloc
operation is also part of the -convert-bufferization-to-memref
conversion pass which also lowers all the other operations of the bufferization dialect.
We distinguish multiple cases in this lowering pass to provide an overall more efficient lowering. In the general case, a library function is created to avoid quadratic code size explosion (relative to the number of operands of the dealloc operation). The specialized lowerings aim to avoid this library function because it requires allocating auxiliary MemRefs of index values.
A library function is generated to avoid code-size blow-up. On a high level, the base-memref of all operands is extracted as an index value and stored into specifically allocated MemRefs and passed to the library function which then determines whether they come from the same original allocation. This information is needed to avoid double-free situations and to correctly retain the MemRef values in the retained
list.
Dealloc Operation Lowering
This lowering supports all features the dealloc operation has to offer. It computes the base pointer of each memref (as an index), stores it in a new memref helper structure and passes it to the helper function generated in buildDeallocationLibraryFunction
. The results are stored in two lists (represented as MemRefs) of booleans passed as arguments. The first list stores whether the corresponding condition should be deallocated, the second list stores the ownership of the retained values which can be used to replace the result values of the bufferization.dealloc
operation.
Example:
%0:2 = bufferization.dealloc (%m0, %m1 : memref<2xf32>, memref<5xf32>) if (%cond0, %cond1) retain (%r0, %r1 : memref<1xf32>, memref<2xf32>)
lowers to (simplified):
%c0 = arith.constant 0 : index %c1 = arith.constant 1 : index %dealloc_base_pointer_list = memref.alloc() : memref<2xindex> %cond_list = memref.alloc() : memref<2xi1> %retain_base_pointer_list = memref.alloc() : memref<2xindex> %m0_base_pointer = memref.extract_aligned_pointer_as_index %m0 memref.store %m0_base_pointer, %dealloc_base_pointer_list[%c0] %m1_base_pointer = memref.extract_aligned_pointer_as_index %m1 memref.store %m1_base_pointer, %dealloc_base_pointer_list[%c1] memref.store %cond0, %cond_list[%c0] memref.store %cond1, %cond_list[%c1] %r0_base_pointer = memref.extract_aligned_pointer_as_index %r0 memref.store %r0_base_pointer, %retain_base_pointer_list[%c0] %r1_base_pointer = memref.extract_aligned_pointer_as_index %r1 memref.store %r1_base_pointer, %retain_base_pointer_list[%c1] %dyn_dealloc_base_pointer_list = memref.cast %dealloc_base_pointer_list : memref<2xindex> to memref<?xindex> %dyn_cond_list = memref.cast %cond_list : memref<2xi1> to memref<?xi1> %dyn_retain_base_pointer_list = memref.cast %retain_base_pointer_list : memref<2xindex> to memref<?xindex> %dealloc_cond_out = memref.alloc() : memref<2xi1> %ownership_out = memref.alloc() : memref<2xi1> %dyn_dealloc_cond_out = memref.cast %dealloc_cond_out : memref<2xi1> to memref<?xi1> %dyn_ownership_out = memref.cast %ownership_out : memref<2xi1> to memref<?xi1> call @dealloc_helper(%dyn_dealloc_base_pointer_list, %dyn_retain_base_pointer_list, %dyn_cond_list, %dyn_dealloc_cond_out, %dyn_ownership_out) : (...) %m0_dealloc_cond = memref.load %dyn_dealloc_cond_out[%c0] : memref<2xi1> scf.if %m0_dealloc_cond { memref.dealloc %m0 : memref<2xf32> } %m1_dealloc_cond = memref.load %dyn_dealloc_cond_out[%c1] : memref<2xi1> scf.if %m1_dealloc_cond { memref.dealloc %m1 : memref<5xf32> } %r0_ownership = memref.load %dyn_ownership_out[%c0] : memref<2xi1> %r1_ownership = memref.load %dyn_ownership_out[%c1] : memref<2xi1> memref.dealloc %dealloc_base_pointer_list : memref<2xindex> memref.dealloc %retain_base_pointer_list : memref<2xindex> memref.dealloc %cond_list : memref<2xi1> memref.dealloc %dealloc_cond_out : memref<2xi1> memref.dealloc %ownership_out : memref<2xi1> // replace %0#0 with %r0_ownership // replace %0#1 with %r1_ownership
Library function
A library function is built per compilation unit that can be called at bufferization dealloc sites to determine whether two MemRefs come from the same allocation and their new ownerships.
The generated function takes two MemRefs of indices and three MemRefs of booleans as arguments:
This helper function is supposed to be called once for each bufferization.dealloc
operation to determine the deallocation need and new ownership indicator for the retained values, but does not perform the deallocation itself.
Generated code:
func.func @dealloc_helper( %dyn_dealloc_base_pointer_list: memref<?xindex>, %dyn_retain_base_pointer_list: memref<?xindex>, %dyn_cond_list: memref<?xi1>, %dyn_dealloc_cond_out: memref<?xi1>, %dyn_ownership_out: memref<?xi1>) { %c0 = arith.constant 0 : index %c1 = arith.constant 1 : index %true = arith.constant true %false = arith.constant false %num_dealloc_memrefs = memref.dim %dyn_dealloc_base_pointer_list, %c0 %num_retain_memrefs = memref.dim %dyn_retain_base_pointer_list, %c0 // Zero initialize result buffer. scf.for %i = %c0 to %num_retain_memrefs step %c1 { memref.store %false, %dyn_ownership_out[%i] : memref<?xi1> } scf.for %i = %c0 to %num_dealloc_memrefs step %c1 { %dealloc_bp = memref.load %dyn_dealloc_base_pointer_list[%i] %cond = memref.load %dyn_cond_list[%i] // Check for aliasing with retained memrefs. %does_not_alias_retained = scf.for %j = %c0 to %num_retain_memrefs step %c1 iter_args(%does_not_alias_aggregated = %true) -> (i1) { %retain_bp = memref.load %dyn_retain_base_pointer_list[%j] %does_alias = arith.cmpi eq, %retain_bp, %dealloc_bp : index scf.if %does_alias { %curr_ownership = memref.load %dyn_ownership_out[%j] %updated_ownership = arith.ori %curr_ownership, %cond : i1 memref.store %updated_ownership, %dyn_ownership_out[%j] } %does_not_alias = arith.cmpi ne, %retain_bp, %dealloc_bp : index %updated_aggregate = arith.andi %does_not_alias_aggregated, %does_not_alias : i1 scf.yield %updated_aggregate : i1 } // Check for aliasing with dealloc memrefs in the list before the // current one, i.e., // `fix i, forall j < i: check_aliasing(%dyn_dealloc_base_pointer[j], // %dyn_dealloc_base_pointer[i])` %does_not_alias_any = scf.for %j = %c0 to %i step %c1 iter_args(%does_not_alias_agg = %does_not_alias_retained) -> (i1) { %prev_dealloc_bp = memref.load %dyn_dealloc_base_pointer_list[%j] %does_not_alias = arith.cmpi ne, %prev_dealloc_bp, %dealloc_bp %updated_alias_agg = arith.andi %does_not_alias_agg, %does_not_alias scf.yield %updated_alias_agg : i1 } %dealloc_cond = arith.andi %does_not_alias_any, %cond : i1 memref.store %dealloc_cond, %dyn_dealloc_cond_out[%i] : memref<?xi1> } return }
Currently, there are two special lowerings for common cases to avoid the library function and thus unnecessary memory load and store operations and function calls:
One memref, no retained
Lower a simple case without any retained values and a single MemRef. Ideally, static analysis can provide enough information such that the buffer-deallocation-simplification
pass is able to split the dealloc operations up into this simple case as much as possible before running this pass.
Example:
bufferization.dealloc (%arg0 : memref<2xf32>) if (%arg1)
is lowered to
scf.if %arg1 { memref.dealloc %arg0 : memref<2xf32> }
In most cases, the branch condition is either constant ‘true’ or ‘false’ and can thus be optimized away entirely by the canonicalizer pass.
One memref, arbitrarily many retained
A special case lowering for the deallocation operation with exactly one MemRef, but an arbitrary number of retained values. The size of the code produced by this lowering is linear to the number of retained values.
Example:
%0:2 = bufferization.dealloc (%m : memref<2xf32>) if (%cond) retain (%r0, %r1 : memref<1xf32>, memref<2xf32>) return %0#0, %0#1 : i1, i1
is lowered to
%m_base_pointer = memref.extract_aligned_pointer_as_index %m %r0_base_pointer = memref.extract_aligned_pointer_as_index %r0 %r0_does_not_alias = arith.cmpi ne, %m_base_pointer, %r0_base_pointer %r1_base_pointer = memref.extract_aligned_pointer_as_index %r1 %r1_does_not_alias = arith.cmpi ne, %m_base_pointer, %r1_base_pointer %not_retained = arith.andi %r0_does_not_alias, %r1_does_not_alias : i1 %should_dealloc = arith.andi %not_retained, %cond : i1 scf.if %should_dealloc { memref.dealloc %m : memref<2xf32> } %true = arith.constant true %r0_does_alias = arith.xori %r0_does_not_alias, %true : i1 %r0_ownership = arith.andi %r0_does_alias, %cond : i1 %r1_does_alias = arith.xori %r1_does_not_alias, %true : i1 %r1_ownership = arith.andi %r1_does_alias, %cond : i1 return %r0_ownership, %r1_ownership : i1, i1
One-Shot Bufferize bufferizes ops from top to bottom. This works well when all ops are bufferizable. However, when encountering a non-bufferizable tensor with allow-unknown-ops
, One-Shot Bufferize must insert to_memref
ops at the bufferization boundary and decide on a memref type. By default, One-Shot Bufferize choose the most dynamic memref type wrt. layout maps. E.g.:
%0 = "my_dialect.unbufferizable_op(%t) : (tensor<?x?xf32>) -> (tensor<?x?xf32>) %1 = tensor.extract %0[%idx1, %idx2] : tensor<?xf32>
When bufferizing the above IR, One-Shot Bufferize inserts a to_memref
ops with dynamic offset and strides:
%0 = "my_dialect.unbufferizable_op(%t) : (tensor<?x?xf32>) -> (tensor<?x?xf32>) %0_m = bufferization.to_memref %0 : memref<?x?xf32, strided<[?, ?], offset: ?>> %1 = memref.load %0_m[%idx1, %idx2] : memref<?x?xf32, strided<[?, ?], offset: ?>>
All users of %0
have fully dynamic layout maps. This ensures that the bufferized IR composes well with future bufferizations of unbufferizable_op
(maybe bufferized by another pass), regardless of the exact memref type of the future bufferization. If the op turns out to be bufferized to an op with a simpler memref type (e.g., identity layout map), we expect that canonicalization patterns would clean up unnecessarily dynamic layout maps. (Some of these canonicalization patterns may not be implemented yet.)
One-Shot Bufferize tries to infer the most precise memref type when bufferizing an op. If the entire IR is bufferizable, we do not have to resort to conservatively use fully dynamic layout maps. In that case, we also do not have to rely on canonicalization patterns to clean up the bufferized IR.
Note: There are some bufferizable ops for which a percise layout map cannot be inferred. E.g., a tensor.cast
from a tensor<*xf32>
to a tensor<?x?xf32>
must be bufferized to a memref.cast
with a memref type that has a fully dynamic layout map.
One-Shot Bufferize has an option unknown-type-conversion
to control the generation of layout maps when no precise layout can be inferred:
fully-dynamic-layout-map
uses fully dynamic layout maps and is the default behavior. This composes well when IR is partially bufferized.identity-layout-map
uses static identity layout maps. This option can be useful for legacy code that cannot handle memref types with layout maps. Note that this setting can lead to additional buffer copies when folding a to_tensor
/to_memref
pair with memref types that are not cast-compatible.Note: The unknown-type-conversion
option does not affect layout maps of function signatures. There is a separate function-signature-type-conversion
option that controls layout maps of function parameters and function results.
Custom ops can be bufferized if they implement BufferizableOpInterface
. Users must at least implement the following interface methods.
bufferizesToMemoryRead
: Return true
if the buffer of the given tensor OpOperand is read.bufferizesToMemoryWrite
: Return true
if the buffer of the given tensor OpOperand is written (if bufferizing in-place).getAliasingOpResult
: Return the OpResults that may share the same buffer as the given OpOperand. This interface method describes to OpOperand-to-OpResult mapping wrt. destination-passing style.bufferRelation
: Return BufferRelation::Equivalent
if the given OpResult is the exact same memref as the aliasing OpOperand after bufferization (in case of in-place bufferization). Otherwise, (e.g., they overlap but are not necessarily the exact same memrefs), BufferRelation::Unknown
should be returned. Additional buffer relations will be added in the future, but BufferRelation::Unknown
is always safe.bufferize
: Rewrite the op with the given rewriter. Ops should be replaced with bufferization::replaceOpWithBufferizedValues
.To get a better intuition of the interface methods, we invite users to take a look at existing implementations in MLIR, e.g., the implementation of tensor.insert
or tensor.extract
.
To get a better understanding of why One-Shot Bufferize introduced a buffer copy, users can run the pass with test-analysis-only print-conflicts
. Every tensor op is then annotated with an attribute that has a boolean value for each tensor OpOperand. true
means that the OpOperand bufferizes in-place. false
means that the OpOperand bufferizes out-of-place and a buffer copy will be inserted.
There are two reasons why a buffer copy may be inserted.
memref.global
buffers that are the result of arith.constant
ops are never modified.In the first case, print-conflicts
illustrates the conflict in the form of a (“read”, “conflicting write”, “last write”) tuple.
To get a better understanding of the SSA Use-Def Chain Analysis and the RaW conflict detection algorithm, we invite interested users to read the design document and watch the corresponding ODM talk (slides). can be used to bufferize a program in a single pass, as long as each op
Both dialect conversion-based bufferization and One-Shot Bufferize generate to_tensor
/to_memref
ops at the bufferization boundary (when run with allow-unknown-ops
). They can be combined and run in sequence. However, One-Shot Bufferize must run first because it cannot analyze those boundary ops. To update existing code step-by-step, it may be useful to specify a dialect filter for One-Shot Bufferize, so that dialects can be switched over one-by-one.
One-Shot Bufferize does currently not support function graph bufferization. I.e., CallOp
, ReturnOp
and function bbArgs are not bufferizable. Users can run the existing --func-bufferize
bufferization pass after One-Shot Bufferize.
Alternatively, users can try ModuleBufferization
, which is an extension of One-Shot Bufferize. This bufferization is still under development and does not support arbitrary IR. In essence, returning a tensor from a function is not supported, unless it is equivalent to a function bbArg. In that case, the corresponding return value can simply be dropped during bufferization.
Disclaimer: Most dialect conversion-based bufferization has been migrated to One-Shot Bufferize. New users should use One-Shot Bufferize (with or without analysis). The following documentation is only for existing users of dialect conversion-based bufferization.
This system is a simple application of MLIR's dialect conversion infrastructure. The bulk of the code related to bufferization is a set of ordinary ConversionPattern
's that dialect authors write for converting ops that operate on tensor
's to ops that operate on memref
's. A set of conventions and best practices are followed that allow these patterns to be run across multiple independent passes (rather than requiring a single huge atomic conversion pass), which makes the compilation pipelines scalable, robust, and easy to debug.
This document is targeted at people looking to utilize MLIR's bufferization functionality, along with people who want to extend it to cover their own ops.
NOTE: Before reading this document, please watch the talk “Type Conversions the Not-So-Hard-Way: MLIR's New Bufferization Infrastructure” (slides, recording). That talk gives a high-level overview of the bufferization infrastructure and important conceptual details related to using the MLIR dialect conversion infrastructure.
Bufferization itself does not free any of the buffers that have been allocated, nor does it do anything particularly intelligent with the placement of buffers w.r.t. control flow. Thus, a realistic compilation pipeline will usually consist of:
buffer-hoisting
, buffer-loop-hoisting
, and promote-buffers-to-stack
, which do optimizations that are only exposed after bufferization.After buffer deallocation has been completed, the program will be quite difficult to transform due to the presence of the deallocation ops. Thus, other optimizations such as linalg fusion on memrefs should be done before that stage.
Bufferization consists of running multiple partial bufferization passes, followed by one finalizing bufferization pass.
There is typically one partial bufferization pass per dialect (though other subdivisions are possible). For example, for a dialect X
there will typically be a pass X-bufferize
that knows how to bufferize all the ops in that dialect. By running pass X-bufferize
for each dialect X
in the program, all the ops in the program are incrementally bufferized.
Partial bufferization passes create programs where only some ops have been bufferized. These passes will create materializations (also sometimes called “casts”) that convert between the tensor
and memref
type, which allows bridging between ops that have been bufferized and ops that have not yet been bufferized.
Finalizing bufferizations complete the bufferization process, and guarantee that there are no tensors remaining in the program. This involves eliminating the materializations. The pass finalizing-bufferize
provides a minimal pass that only eliminates materializations and issues an error if any unbufferized ops exist in the program.
However, it is possible for a finalizing bufferization to do more than just eliminate materializations. By adding patterns (just as a partial bufferization would), it is possible for a finalizing bufferization pass to simultaneously bufferize ops and eliminate materializations. This has a number of disadvantages discussed in the talk and should generally be avoided.
As a concrete example, we will look at the bufferization pipeline from the mlir-npcomp
reference backend (code). The code, slightly simplified and annotated, is reproduced here:
// Partial bufferization passes. pm.addPass(createTensorConstantBufferizePass()); pm.addNestedPass<func::FuncOp>(createTCPBufferizePass()); // Bufferizes the downstream `tcp` dialect. pm.addNestedPass<func::FuncOp>(createSCFBufferizePass()); pm.addNestedPass<func::FuncOp>(createLinalgBufferizePass()); pm.addNestedPass<func::FuncOp>(createTensorBufferizePass()); pm.addPass(createFuncBufferizePass()); // Finalizing bufferization pass. pm.addNestedPass<func::FuncOp>(createFinalizingBufferizePass());
Looking first at the partial bufferization passes, we see that there are a sequence of FuncOp
passes (which run in parallel on functions). These function passes are bracketed by arith-bufferize
and func-bufferize
, which are module passes (and thus serialize the parallel compilation process). These two passes must be module passes because they make changes to the top-level module.
The bulk of the bufferization work is done by the function passes. Most of these passes are provided as part of the upstream MLIR distribution and bufferize their respective dialects (e.g. scf-bufferize
bufferizes the scf
dialect). The tcp-bufferize
pass is an exception -- it is a partial bufferization pass used to bufferize the downstream tcp
dialect, and fits in perfectly with all the other passes provided upstream.
The last pass is the finalizing bufferization pass. The mlir-npcomp
reference backend has arranged that all ops are bufferized by partial bufferizations, so that the upstream finalizing-bufferize
pass can be used as the finalizing bufferization pass. This gives excellent diagnostics when something goes wrong with the bufferization process, such as due to an op that wasn't handled by any pattern.
The contract of a partial bufferization pass is that a subset of ops (or kinds of ops, customizable by a ConversionTarget) get bufferized.
A partial bufferization pass is just a pass that uses the dialect conversion framework to apply ConversionPattern
s with a tensor
to memref
type conversion.
To describe how to write such a pass, we will walk through an example, the tensor-bufferize
pass (code, test) that bufferizes the tensor
dialect. Note that these passes have been replaced with a BufferizableOpInterface
-based implementation in the meantime, so we have to take a looker at an older version of the code.
The bulk of the code in the pass will be a set of conversion patterns, with a simple example being BufferizeCastOp).
class BufferizeCastOp : public OpConversionPattern<tensor::CastOp> { public: using OpConversionPattern::OpConversionPattern; LogicalResult matchAndRewrite(tensor::CastOp op, OpAdaptor adaptor, ConversionPatternRewriter &rewriter) const override { auto resultType = getTypeConverter()->convertType(op.getType()); rewriter.replaceOpWithNewOp<MemRefCastOp>(op, resultType, adaptor.source()); return success(); } };
See the talk for more details on how to write these patterns.
The pass itself is very small, and follows the basic pattern of any dialect conversion pass.
void mlir::populateTensorBufferizePatterns( BufferizeTypeConverter &typeConverter, RewritePatternSet &patterns) { patterns.add<BufferizeCastOp, BufferizeExtractOp>(typeConverter, patterns.getContext()); } struct TensorBufferizePass : public TensorBufferizeBase<TensorBufferizePass> { void runOnOperation() override { auto *context = &getContext(); BufferizeTypeConverter typeConverter; RewritePatternSet patterns(context); ConversionTarget target(*context); populateTensorBufferizePatterns(typeConverter, patterns); target.addIllegalOp<tensor::CastOp, tensor::ExtractOp>(); target.addLegalDialect<func::FuncDialect>(); if (failed( applyPartialConversion(getOperation(), target, std::move(patterns)))) signalPassFailure(); } };
The pass has all the hallmarks of a dialect conversion pass that does type conversions: a TypeConverter
, a RewritePatternSet
, and a ConversionTarget
, and a call to applyPartialConversion
. Note that a function populateTensorBufferizePatterns
is separated, so that power users can use the patterns independently, if necessary (such as to combine multiple sets of conversion patterns into a single conversion call, for performance).
One convenient utility provided by the MLIR bufferization infrastructure is the BufferizeTypeConverter
, which comes pre-loaded with the necessary conversions and materializations between tensor
and memref
.
In this case, the BufferizationOpsDialect
is marked as legal, so the bufferization.to_tensor
and bufferization.to_memref
ops, which are inserted automatically by the dialect conversion framework as materializations, are legal. There is a helper populateBufferizeMaterializationLegality
(code) which helps with this in general.
scf
dialect.RegionBranchOpInterface
(that is, they use regions to represent control flow).lib/Dialect/SCF/Transforms/StructuralTypeConversions.cpp
(code), which is well-commented and covers how to correctly convert ops that contain regions.func
, call
, and BranchOpInterface
ops.The contract of a finalizing bufferization pass is that all tensors are gone from the program.
The easiest way to write a finalizing bufferize pass is to not write one at all! MLIR provides a pass finalizing-bufferize
which eliminates the bufferization.to_tensor
/ bufferization.to_memref
materialization ops inserted by partial bufferization passes and emits an error if that is not sufficient to remove all tensors from the program.
This pass is sufficient when partial bufferization passes have bufferized all the ops in the program, leaving behind only the materializations. When possible, it is recommended to structure your pass pipeline this way, as this has the significant advantage that if an op does not get bufferized (due to a missing pattern, bug in the code, etc.), finalizing-bufferize
will emit a nice clean error, and the IR seen by finalizing-bufferize
will only contain only one unbufferized op.
However, before the current bufferization infrastructure was put in place, bufferization could only be done as a single finalizing bufferization mega-pass that used the populate*BufferizePatterns
functions from multiple dialects to simultaneously bufferize everything at once. Thus, one might see code in downstream projects structured this way. This structure is not recommended in new code. A helper, populateEliminateBufferizeMaterializationsPatterns
(code) is available for such passes to provide patterns that eliminate bufferization.to_tensor
and bufferization.to_memref
.
func-bufferize
was changed to be a partial conversion pass, and there is a new finalizing-bufferize
which serves as a general finalizing bufferization pass.BufferizableOpInterface
. New users should use One-Shot Bufferize instead of dialect conversion-based bufferization.