| # Bufferization |
| |
| [TOC] |
| |
| ## Overview |
| |
| Bufferization in MLIR is the process of converting ops with `tensor` semantics |
| to ops with `memref` semantics. MLIR provides an infrastructure that bufferizes |
| an entire program in a single pass (*One-Shot Bufferize*). This infrastructure |
| bufferizes all ops that implement the |
| [`BufferizableOpInterface`](https://github.com/llvm/llvm-project/blob/17a68065c378da74805e4e1b9a5b78cc9f83e580/mlir/include/mlir/Dialect/Bufferization/IR/BufferizableOpInterface.td) |
| can be bufferized. |
| |
| MLIR has an older bufferization infrastructure built around |
| [dialect conversion](DialectConversion.md). Most dialect conversion |
| bufferization patterns have been migrated to One-Shot Bufferize, but some |
| functionality such as function boundary bufferization still depends on dialect |
| conversion and its type converter. New projects should use One-Shot Bufferize, |
| as the dialect conversion-based bufferization will eventually be deprecated. |
| Moreover, One-Shot Bufferize results in better bufferization with fewer memory |
| allocations and buffer copies. This documentation is mostly about One-Shot |
| Bufferize, but also describes how to gradually migrate a project from dialect |
| conversion-based bufferization to One-Shot Bufferize. |
| |
| ## What is One-Shot Bufferize? |
| |
| One-Shot Bufferize is a new tensor bufferization pass designed for IR in |
| [destination-passing style](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/dps-fhpc17.pdf), |
| and with aggressive in-place bufferization. |
| |
| One-Shot Bufferize is: |
| |
| * **Monolithic**: A single MLIR pass does the entire work, whereas the |
| previous bufferization in MLIR was split across multiple passes residing in |
| different dialects. In One-Shot Bufferize, `BufferizableOpInterface` |
| implementations are spread across different dialects. |
| |
| * A **whole-function at a time analysis**. In-place bufferization decisions |
| are made by analyzing SSA use-def chains on tensors. Op interface |
| implementations not only provide the rewrite logic from tensor ops to memref |
| ops, but also helper methods for One-Shot Bufferize's analysis to query |
| information about an op's bufferization/memory semantics. |
| |
| * **Extensible** via an op interface: All ops that implement |
| `BufferizableOpInterface` can be bufferized. |
| |
| * **2-Pass**: Bufferization is internally broken down into 2 steps: First, |
| analyze the entire IR and make bufferization decisions. Then, bufferize |
| (rewrite) the IR. The analysis has access to exact SSA use-def information. |
| It incrementally builds alias and equivalence sets and does not rely on a |
| posteriori-alias analysis from preallocated memory. |
| |
| * **Greedy**: Operations are analyzed one-by-one and it is decided on the spot |
| whether a tensor OpOperand must be copied or not. Heuristics determine the |
| order of analysis. |
| |
| * **Modular**: The current One-Shot Analysis can be replaced with a different |
| analysis. The result of the analysis are queried by the bufferization via |
| `AnalysisState`, in particular `AnalysisState::isInPlace`. Any derived class |
| of `AnalysisState` that implements a small number virtual functions can |
| serve as a custom analysis. It is even possible to run One-Shot Bufferize |
| without any analysis (`AlwaysCopyAnalysisState`), in which case One-Shot |
| Bufferize behaves exactly like the old dialect conversion-based |
| bufferization (i.e., copy every buffer before writing to it). |
| |
| To reduce complexity, One-Shot Bufferize should be |
| [run after other transformations](https://llvm.discourse.group/t/rfc-linalg-on-tensors-update-and-comprehensive-bufferization-rfc/3373), |
| typically as one of the last steps right before lowering memref ops. Many |
| transformations are easier in tensor land; e.g., tile/fuse/… on tensors first, |
| then bufferize the remaining IR. |
| |
| From an architecture perspective, One-Shot Bufferize consists of |
| [BufferizableOpInterface](https://github.com/llvm/llvm-project/blob/17a68065c378da74805e4e1b9a5b78cc9f83e580/mlir/include/mlir/Dialect/Bufferization/IR/BufferizableOpInterface.td) |
| (and its implementations) and an |
| [analysis](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/OneShotAnalysis.h#L164) |
| of tensor SSA values that decides if a buffer can be used directly or must be |
| copied. The [bufferize] method of the op interface inspects analysis results and |
| rewrites tensor ops into memref ops. |
| |
| ## Goals of Bufferization |
| |
| The high-level goal of every bufferization technique is to: 1. Use as little |
| memory as possible. 2. Copy as little memory as possible. |
| |
| This implies reusing already allocated buffers when possible, turning |
| bufferization into an algorithmically complex problem with similarities to |
| register allocation. |
| |
| Depending on the concrete use case, there may be additional bufferization |
| requirements. If the contents of a buffer are expensive to compute, there could |
| be a tradeoff between *recomputation* and *compute once and copy*. On the |
| contrary, it may not even be possible to allocate new buffers at runtime on some |
| architectures. |
| |
| ## Destination-Passing Style |
| |
| Bufferization is an algorithmically complex problem. Given an op with a tensor |
| result, bufferization has to choose a memref buffer in which the result can be |
| stored. It is always safe to allocate a brand new buffer, but such a |
| bufferization strategy would be unacceptable for high-performance codegen. When |
| choosing an already existing buffer, we must be careful not to accidentally |
| overwrite data that is still needed later in the program. |
| |
| To simplify this problem, One-Shot Bufferize was designed to take advantage of |
| *destination-passing style*. This form exists in itself independently of |
| bufferization and is tied to SSA semantics: many ops are “updating” part of |
| their input SSA variable. For example the LLVM instruction |
| [`insertelement`](https://llvm.org/docs/LangRef.html#insertelement-instruction) |
| is inserting an element inside a vector. Since SSA values are immutable, the |
| operation returns a copy of the input vector with the element inserted. |
| Another example in MLIR is `linalg.generic`, which always has an extra `outs` |
| operand which provides the initial values to update (for example when the |
| operation is doing a reduction). |
| |
| This input is referred to as "destination" in the following (quotes are |
| important as this operand isn't modified in place but copied) and comes into |
| place in the context of bufferization as a possible "anchor" for the |
| bufferization algorithm. This allows the user to shape the input in a form that |
| guarantees close to optimal bufferization result when carefully choosing the |
| SSA value used as "destination". |
| |
| For every tensor result, a "destination-passing" style op has a corresponding |
| tensor operand. If there aren't any other uses of this tensor, the bufferization |
| can alias it with the op result and perform the operation "in-place" by reusing |
| the buffer allocated for this "destination" input. |
| |
| As an example, consider the following op: `%0 = tensor.insert %cst into |
| %t[%idx] : tensor<?xf32>` |
| |
| `%t` is the "destination" in this example. When choosing a buffer for the result |
| `%0`, denoted as `buffer(%0)`, One-Shot Bufferize considers only two options: |
| |
| 1. `buffer(%0) = buffer(%t)` : alias the "destination" tensor with the |
| result and perform the operation in-place. |
| 2. `buffer(%0)` is a newly allocated buffer. |
| |
| There may be other buffers in the same function that could potentially be used |
| for `buffer(%0)`, but those are not considered by One-Shot Bufferize to keep the |
| bufferization simple. One-Shot Bufferize could be extended to consider such |
| buffers in the future to achieve a better quality of bufferization. |
| |
| Tensor ops that are not in destination-passing style always bufferized to a |
| memory allocation. E.g.: |
| |
| ```mlir |
| %0 = tensor.generate %sz { |
| ^bb0(%i : index): |
| %cst = arith.constant 0.0 : f32 |
| tensor.yield %cst : f32 |
| } : tensor<?xf32> |
| ``` |
| |
| The result of `tensor.generate` does not have a "destination" operand, so |
| bufferization allocates a new buffer. This could be avoided by choosing an |
| op such as `linalg.generic`, which can express the same computation with a |
| "destination" operand, as specified behind outputs (`outs`): |
| |
| ```mlir |
| #map = affine_map<(i) -> (i)> |
| %0 = linalg.generic {indexing_maps = [#map], iterator_types = ["parallel"]} |
| outs(%t : tensor<?xf32>) { |
| ^bb0(%arg0 : f32): |
| %cst = arith.constant 0.0 : f32 |
| linalg.yield %cst : f32 |
| } -> tensor<?xf32> |
| ``` |
| |
| At first glance, the above `linalg.generic` op may not seem very useful because |
| the output tensor `%t` is entirely overwritten. Why pass the tensor `%t` as an |
| operand in the first place? As an example, this can be useful for overwriting a |
| slice of a tensor: |
| |
| ```mlir |
| %t = tensor.extract_slice %s [%idx] [%sz] [1] : tensor<?xf32> to tensor<?xf32> |
| %0 = linalg.generic ... outs(%t) { ... } -> tensor<?xf32> |
| %1 = tensor.insert_slice %0 into %s [%idx] [%sz] [1] |
| : tensor<?xf32> into tensor<?xf32> |
| ``` |
| |
| The above example bufferizes to a `memref.subview`, followed by a |
| "`linalg.generic` on memrefs" that overwrites the memory of the subview, assuming |
| that the slice `%t` has no other user. The `tensor.insert_slice` then bufferizes |
| to a no-op (in the absence of RaW conflicts such as a subsequent read of `%s`). |
| |
| RaW conflicts are detected with an analysis of SSA use-def chains (details |
| later). One-Shot Bufferize works best if there is a single SSA use-def chain, |
| where the result of a tensor op is the operand of the next tensor ops, e.g.: |
| |
| ```mlir |
| %0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>) |
| %1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>) |
| %2 = "my_dialect.yet_another_op"(%1) : (tensor<?xf32>) -> (tensor<?xf32>) |
| ``` |
| |
| Buffer copies are likely inserted if the SSA use-def chain splits at some point, |
| e.g.: |
| |
| ```mlir |
| %0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>) |
| %1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>) |
| %2 = "my_dialect.yet_another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>) |
| ``` |
| |
| One-Shot Bufferize has debug flags (`test-analysis-only print-conflicts`) that |
| print the results of the analysis and explain to the user why buffer copies were |
| inserted. |
| |
| ## Using One-Shot Bufferize |
| |
| MLIR provides a pass |
| [`-one-shot-bufferize`](https://mlir.llvm.org/docs/Passes/#-one-shot-bufferize-one-shot-bufferize) |
| that performs an analysis and bufferizes all ops with tensor semantics that |
| implement `BufferizableOpInterface`. For modularity reasons, these op interface |
| implementations are typically external models that live in a dialect's |
| "Transforms" build unit. (External models are a mechanism for implementing an op |
| interface in a different build unit.) It is the user's responsibility to ensure |
| that all needed external models are registered before running One-Shot |
| Bufferize. |
| |
| By default, One-Shot Bufferize fails when it encounters an op with tensor |
| semantics (i.e., tensor result or tensor operand) that is not bufferizable |
| (i.e., does not implement `BufferizableOpInterface`). This can be avoided with |
| `allow-unknown-ops`. In that case, One-Shot Bufferize inserts |
| `to_memref`/`to_tensor` ops around the bufferization boundary. These ops are |
| named versions of `unrealized_conversion_cast`. Note that One-Shot Bufferize's |
| analysis can currently not analyze these ops, so input IR with such ops may fail |
| bufferization. Therefore, running One-Shot Bufferize multiple times in a |
| sequence is also not supported at the moment. |
| |
| One-Shot Bufferize can be configured to bufferize only ops from a set of |
| dialects with `dialect-filter`. This can be useful for gradually migrating from |
| dialect conversion-based bufferization to One-Shot Bufferize. One-Shot Bufferize |
| must run first in such a case, because dialect conversion-based bufferization |
| generates `to_tensor`/`to_memref` ops which One-Shot Bufferize cannot analyze. |
| |
| One-Shot Bufferize can also be called programmatically with |
| [`bufferization::runOneShotBufferize`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/OneShotAnalysis.h#L167). |
| Alternatively, |
| [`bufferization::bufferizeOp`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/Bufferize.h#L78) |
| skips the analysis and inserts a copy on every buffer write, just like the |
| dialect conversion-based bufferization. |
| |
| ## Buffer Deallocation |
| |
| **Important: this pass is deprecated, please use the ownership based buffer** |
| **deallocation pass instead** |
| |
| One-Shot Bufferize deallocates all buffers that it allocates. This is in |
| contrast to the dialect conversion-based bufferization that delegates this job |
| to the |
| [`-buffer-deallocation`](https://mlir.llvm.org/docs/Passes/#-buffer-deallocation-adds-all-required-dealloc-operations-for-all-allocations-in-the-input-program) |
| pass. By default, One-Shot Bufferize rejects IR where a newly allocated buffer |
| is returned from a block. Such IR will fail bufferization. |
| |
| A new buffer allocation is returned from a block when the result of an op that |
| is not in destination-passing style is returned. E.g.: |
| |
| ```mlir |
| %0 = scf.if %c -> (tensor<?xf32>) { |
| %1 = tensor.generate ... -> tensor<?xf32> |
| scf.yield %1 : tensor<?xf32> |
| } else { |
| scf.yield %another_tensor : tensor<?xf32> |
| } |
| ``` |
| |
| The `scf.yield` in the "else" branch is OK, but the `scf.yield` in the "then" |
| branch will be rejected. |
| |
| Another case in which a buffer allocation may be returned is when a buffer copy |
| must be inserted due to a RaW conflict. E.g.: |
| |
| ```mlir |
| %0 = scf.if %c -> (tensor<?xf32>) { |
| %1 = tensor.insert %cst into %another_tensor[%idx] : tensor<?xf32> |
| "my_dialect.reading_tensor_op"(%another_tensor) : (tensor<?xf32>) -> () |
| ... |
| scf.yield %1 : tensor<?xf32> |
| } else { |
| scf.yield %yet_another_tensor : tensor<?xf32> |
| } |
| ``` |
| |
| In the above example, a buffer copy of `buffer(%another_tensor)` (with `%cst` |
| inserted) is yielded from the "then" branch. |
| |
| Note: Buffer allocations that are returned from a function are not deallocated. |
| It is the caller's responsibility to deallocate the buffer. For the full |
| function boundary ABI for MemRefs w.r.t. buffer deallocation refer to the |
| [*Function Boundary ABI*](#function-boundary-abi) section. In the future, this |
| could be automated with allocation hoisting (across function boundaries) or |
| reference counting. |
| |
| One-Shot Bufferize leaks all memory and does not generate any buffer |
| deallocations. The `-buffer-deallocation-pipeline` has to be run afterwards to |
| insert the deallocation operations. |
| |
| ## Ownership-based Buffer Deallocation |
| |
| Recommended compilation pipeline: |
| ``` |
| one-shot-bufferize |
| | it's recommended to perform all bufferization here at latest, |
| | <- any allocations inserted after this point have to be handled |
| V manually |
| expand-realloc |
| V |
| ownership-based-buffer-deallocation |
| V |
| canonicalize <- mostly for scf.if simplifications |
| V |
| buffer-deallocation-simplification |
| V <- from this point onwards no tensor values are allowed |
| lower-deallocations |
| V |
| CSE |
| V |
| canonicalize |
| ``` |
| |
| One-Shot Bufferize does not deallocate any buffers that it allocates. This job |
| is delegated to the |
| [`-ownership-based-buffer-deallocation`](https://mlir.llvm.org/docs/Passes/#-ownership-based-buffer-deallocation) |
| pass, i.e., after running One-Shot Bufferize, the result IR may have a number of |
| `memref.alloc` ops, but no `memref.dealloc` ops. This pass processes operations |
| implementing `FunctionOpInterface` one-by-one without analysing the call-graph. |
| This means, that there have to be [some rules](#function-boundary-abi) on how |
| MemRefs are handled when being passed from one function to another. The rest of |
| the pass revolves heavily around the `bufferization.dealloc` operation which is |
| inserted at the end of each basic block with appropriate operands and should be |
| optimized using the Buffer Deallocation Simplification pass |
| (`--buffer-deallocation-simplification`) and the regular canonicalizer |
| (`--canonicalize`). Lowering the result of the |
| `-ownership-based-buffer-deallocation` pass directly using |
| `--convert-bufferization-to-memref` without beforehand optimization is not |
| recommended as it will lead to very inefficient code (the runtime-cost of |
| `bufferization.dealloc` is `O(|memrefs|^2+|memref|*|retained|)`). |
| |
| ### Function boundary ABI |
| |
| The Buffer Deallocation pass operates on the level of operations implementing |
| the `FunctionOpInterface`. Such operations can take MemRefs as arguments, but |
| also return them. To ensure compatibility among all functions (including |
| external ones), some rules have to be enforced: |
| * When a MemRef is passed as a function argument, ownership is never acquired. |
| It is always the caller's responsibility to deallocate such MemRefs. |
| * Returning a MemRef from a function always passes ownership to the caller, |
| i.e., it is also the caller's responsibility to deallocate memrefs returned |
| from a called function. |
| * A function must not return a MemRef with the same allocated base buffer as |
| one of its arguments (in this case a copy has to be created). Note that in |
| this context two subviews of the same buffer that don't overlap are also |
| considered to alias. |
| |
| For external functions (e.g., library functions written externally in C), the |
| externally provided implementation has to adhere to these rules and they are |
| just assumed by the buffer deallocation pass. Functions on which the |
| deallocation pass is applied and the implementation is accessible are modified |
| by the pass such that the ABI is respected (i.e., buffer copies are inserted as |
| necessary). |
| |
| ### Inserting `bufferization.dealloc` operations |
| |
| `bufferization.dealloc` operations are unconditionally inserted at the end of |
| each basic block (just before the terminator). The majority of the pass is about |
| finding the correct operands for this operation. There are three variadic |
| operand lists to be populated, the first contains all MemRef values that may |
| need to be deallocated, the second list contains their associated ownership |
| values (of `i1` type), and the third list contains MemRef values that are still |
| needed at a later point and should thus not be deallocated. This operation |
| allows us to deal with any kind of aliasing behavior: it lowers to runtime |
| aliasing checks when not enough information can be collected statically. When |
| enough aliasing information is statically available, operands or the entire op |
| may fold away. |
| |
| **Ownerships** |
| |
| To do so, we use a concept of ownership indicators of memrefs which materialize |
| as an `i1` value for any SSA value of `memref` type, indicating whether the |
| basic block in which it was materialized has ownership of this MemRef. Ideally, |
| this is a constant `true` or `false`, but might also be a non-constant SSA |
| value. To keep track of those ownership values without immediately materializing |
| them (which might require insertion of `bufferization.clone` operations or |
| operations checking for aliasing at runtime at positions where we don't actually |
| need a materialized value), we use the `Ownership` class. This class represents |
| the ownership in three states forming a lattice on a partial order: |
| ``` |
| forall X in SSA values. uninitialized < unique(X) < unknown |
| forall X, Y in SSA values. |
| unique(X) == unique(Y) iff X and Y always evaluate to the same value |
| unique(X) != unique(Y) otherwise |
| ``` |
| Intuitively, the states have the following meaning: |
| * Uninitialized: the ownership is not initialized yet, this is the default |
| state; once an operation is finished processing the ownership of all |
| operation results with MemRef type should not be uninitialized anymore. |
| * Unique: there is a specific SSA value that can be queried to check ownership |
| without materializing any additional IR |
| * Unknown: no specific SSA value is available without materializing additional |
| IR, typically this is because two ownerships in 'Unique' state would have to |
| be merged manually (e.g., the result of an `arith.select` either has the |
| ownership of the then or else case depending on the condition value, |
| inserting another `arith.select` for the ownership values can perform the |
| merge and provide a 'Unique' ownership for the result), however, in the |
| general case this 'Unknown' state has to be assigned. |
| |
| Implied by the above partial order, the pass combines two ownerships in the |
| following way: |
| |
| | Ownership 1 | Ownership 2 | Combined Ownership | |
| |:--------------|:--------------|:-------------------| |
| | uninitialized | uninitialized | uninitialized | |
| | unique(X) | uninitialized | unique(X) | |
| | unique(X) | unique(X) | unique(X) | |
| | unique(X) | unique(Y) | unknown | |
| | unknown | unique | unknown | |
| | unknown | uninitialized | unknown | |
| | <td colspan=3> + symmetric cases | |
| |
| **Collecting the list of MemRefs that potentially need to be deallocated** |
| |
| For a given block, the list of MemRefs that potentially need to be deallocated |
| at the end of that block is computed by keeping track of all values for which |
| the block potentially takes over ownership. This includes MemRefs provided as |
| basic block arguments, interface handlers for operations like `memref.alloc` and |
| `func.call`, but also liveness information in regions with multiple basic |
| blocks. More concretely, it is computed by taking the MemRefs in the 'in' set |
| of the liveness analysis of the current basic block B, appended by the MemRef |
| block arguments and by the set of MemRefs allocated in B itself (determined by |
| the interface handlers), then subtracted (also determined by the interface |
| handlers) by the set of MemRefs deallocated in B. |
| |
| Note that we don't have to take the intersection of the liveness 'in' set with |
| the 'out' set of the predecessor block because a value that is in the 'in' set |
| must be defined in an ancestor block that dominates all direct predecessors and |
| thus the 'in' set of this block is a subset of the 'out' sets of each |
| predecessor. |
| |
| ``` |
| memrefs = filter((liveIn(block) U |
| allocated(block) U arguments(block)) \ deallocated(block), isMemRef) |
| ``` |
| |
| The list of conditions for the second variadic operands list of |
| `bufferization.dealloc` is computed by querying the stored ownership value for |
| each of the MemRefs collected as described above. The ownership state is updated |
| by the interface handlers while processing the basic block. |
| |
| **Collecting the list of MemRefs to retain** |
| |
| Given a basic block B, the list of MemRefs that have to be retained can be |
| different for each successor block S. For the two basic blocks B and S and the |
| values passed via block arguments to the destination block S, we compute the |
| list of MemRefs that have to be retained in B by taking the MemRefs in the |
| successor operand list of the terminator and the MemRefs in the 'out' set of the |
| liveness analysis for B intersected with the 'in' set of the destination block |
| S. |
| |
| This list of retained values makes sure that we cannot run into use-after-free |
| situations even if no aliasing information is present at compile-time. |
| |
| ``` |
| toRetain = filter(successorOperands + (liveOut(fromBlock) insersect |
| liveIn(toBlock)), isMemRef) |
| ``` |
| |
| ### Supported interfaces |
| |
| The pass uses liveness analysis and a few interfaces: |
| * `FunctionOpInterface` |
| * `CallOpInterface` |
| * `MemoryEffectOpInterface` |
| * `RegionBranchOpInterface` |
| * `RegionBranchTerminatorOpInterface` |
| |
| Due to insufficient information provided by the interface, it also special-cases |
| on the `cf.cond_br` operation and makes some assumptions about operations |
| implementing the `RegionBranchOpInterface` at the moment, but improving the |
| interfaces would allow us to remove those dependencies in the future. |
| |
| ### Limitations |
| |
| The Buffer Deallocation pass has some requirements and limitations on the input |
| IR. These are checked in the beginning of the pass and errors are emitted |
| accordingly: |
| * The set of interfaces the pass operates on must be implemented (correctly). |
| E.g., if there is an operation present with a nested region, but does not |
| implement the `RegionBranchOpInterface`, an error is emitted because the |
| pass cannot know the semantics of the nested region (and does not make any |
| default assumptions on it). |
| * No explicit control-flow loops are present. Currently, only loops using |
| structural-control-flow are supported. However, this limitation could be |
| lifted in the future. |
| * Deallocation operations should not be present already. The pass should |
| handle them correctly already (at least in most cases), but it's not |
| supported yet due to insufficient testing. |
| * Terminators must implement either `RegionBranchTerminatorOpInterface` or |
| `BranchOpInterface`, but not both. Terminators with more than one successor |
| are not supported (except `cf.cond_br`). This is not a fundamental |
| limitation, but there is no use-case justifying the more complex |
| implementation at the moment. |
| |
| ### Example |
| |
| The following example contains a few interesting cases: |
| * Basic block arguments are modified to also pass along the ownership |
| indicator, but not for entry bocks of non-private functions (assuming the |
| `private-function-dynamic-ownership` pass option is disabled) where the |
| function boundary ABI is applied instead. "Private" in this context refers |
| to functions that cannot be called externally. |
| * The result of `arith.select` initially has 'Unknown' assigned as ownership, |
| but once the `bufferization.dealloc` operation is inserted it is put in the |
| 'retained' list (since it has uses in a later basic block) and thus the |
| 'Unknown' ownership can be replaced with a 'Unique' ownership using the |
| corresponding result of the dealloc operation. |
| * The `cf.cond_br` operation has more than one successor and thus has to |
| insert two `bufferization.dealloc` operations (one for each successor). |
| While they have the same list of MemRefs to deallocate (because they perform |
| the deallocations for the same block), it must be taken into account that |
| some MemRefs remain *live* for one branch but not the other (thus set |
| intersection is performed on the *live-out* of the current block and the |
| *live-in* of the target block). Also, `cf.cond_br` supports separate |
| forwarding operands for each successor. To make sure that no MemRef is |
| deallocated twice (because there are two `bufferization.dealloc` operations |
| with the same MemRefs to deallocate), the condition operands are adjusted to |
| take the branch condition into account. While a generic lowering for such |
| terminator operations could be implemented, a specialized implementation can |
| take all the semantics of this particular operation into account and thus |
| generate a more efficient lowering. |
| |
| ```mlir |
| func.func @example(%memref: memref<?xi8>, %select_cond: i1, %br_cond: i1) { |
| %alloc = memref.alloc() : memref<?xi8> |
| %alloca = memref.alloca() : memref<?xi8> |
| %select = arith.select %select_cond, %alloc, %alloca : memref<?xi8> |
| cf.cond_br %br_cond, ^bb1(%alloc : memref<?xi8>), ^bb1(%memref : memref<?xi8>) |
| ^bb1(%bbarg: memref<?xi8>): |
| test.copy(%bbarg, %select) : (memref<?xi8>, memref<?xi8>) |
| return |
| } |
| ``` |
| |
| After running `--ownership-based-buffer-deallocation`, it looks as follows: |
| |
| ```mlir |
| // Since this is not a private function, the signature will not be modified even |
| // when private-function-dynamic-ownership is enabled. Instead the function |
| // boundary ABI has to be applied which means that ownership of `%memref` will |
| // never be acquired. |
| func.func @example(%memref: memref<?xi8>, %select_cond: i1, %br_cond: i1) { |
| %false = arith.constant false |
| %true = arith.constant true |
| |
| // The ownership of a MemRef defined by the `memref.alloc` operation is always |
| // assigned to be 'true'. |
| %alloc = memref.alloc() : memref<?xi8> |
| |
| // The ownership of a MemRef defined by the `memref.alloca` operation is |
| // always assigned to be 'false'. |
| %alloca = memref.alloca() : memref<?xi8> |
| |
| // The ownership of %select will be the join of the ownership of %alloc and |
| // the ownership of %alloca, i.e., of %true and %false. Because the pass does |
| // not know about the semantics of the `arith.select` operation (unless a |
| // custom handler is implemented), the ownership join will be 'Unknown'. If |
| // the materialized ownership indicator of %select is needed, either a clone |
| // has to be created for which %true is assigned as ownership or the result |
| // of a `bufferization.dealloc` where %select is in the retain list has to be |
| // used. |
| %select = arith.select %select_cond, %alloc, %alloca : memref<?xi8> |
| |
| // We use `memref.extract_strided_metadata` to get the base memref since it is |
| // not allowed to pass arbitrary memrefs to `memref.dealloc`. This property is |
| // already enforced for `bufferization.dealloc` |
| %base_buffer_memref, ... = memref.extract_strided_metadata %memref |
| : memref<?xi8> -> memref<i8>, index, index, index |
| %base_buffer_alloc, ... = memref.extract_strided_metadata %alloc |
| : memref<?xi8> -> memref<i8>, index, index, index |
| %base_buffer_alloca, ... = memref.extract_strided_metadata %alloca |
| : memref<?xi8> -> memref<i8>, index, index, index |
| |
| // The deallocation conditions need to be adjusted to incorporate the branch |
| // condition. In this example, this requires only a single negation, but might |
| // also require multiple arith.andi operations. |
| %not_br_cond = arith.xori %true, %br_cond : i1 |
| |
| // There are two dealloc operations inserted in this basic block, one per |
| // successor. Both have the same list of MemRefs to deallocate and the |
| // conditions only differ by the branch condition conjunct. |
| // Note, however, that the retained list differs. Here, both contain the |
| // %select value because it is used in both successors (since it's the same |
| // block), but the value passed via block argument differs (%memref vs. |
| // %alloc). |
| %10:2 = bufferization.dealloc |
| (%base_buffer_memref, %base_buffer_alloc, %base_buffer_alloca |
| : memref<i8>, memref<i8>, memref<i8>) |
| if (%false, %br_cond, %false) |
| retain (%alloc, %select : memref<?xi8>, memref<?xi8>) |
| |
| %11:2 = bufferization.dealloc |
| (%base_buffer_memref, %base_buffer_alloc, %base_buffer_alloca |
| : memref<i8>, memref<i8>, memref<i8>) |
| if (%false, %not_br_cond, %false) |
| retain (%memref, %select : memref<?xi8>, memref<?xi8>) |
| |
| // Because %select is used in ^bb1 without passing it via block argument, we |
| // need to update it's ownership value here by merging the ownership values |
| // returned by the dealloc operations |
| %new_ownership = arith.select %br_cond, %10#1, %11#1 : i1 |
| |
| // The terminator is modified to pass along the ownership indicator values |
| // with each MemRef value. |
| cf.cond_br %br_cond, ^bb1(%alloc, %10#0 : memref<?xi8>, i1), |
| ^bb1(%memref, %11#0 : memref<?xi8>, i1) |
| |
| // All non-entry basic blocks are modified to have an additional i1 argument for |
| // each MemRef value in the argument list. |
| ^bb1(%13: memref<?xi8>, %14: i1): // 2 preds: ^bb0, ^bb0 |
| test.copy(%13, %select) : (memref<?xi8>, memref<?xi8>) |
| |
| %base_buffer_13, ... = memref.extract_strided_metadata %13 |
| : memref<?xi8> -> memref<i8>, index, index, index |
| %base_buffer_select, ... = memref.extract_strided_metadata %select |
| : memref<?xi8> -> memref<i8>, index, index, index |
| |
| // Here, we don't have a retained list, because the block has no successors |
| // and the return has no operands. |
| bufferization.dealloc (%base_buffer_13, %base_buffer_select |
| : memref<i8>, memref<i8>) |
| if (%14, %new_ownership) |
| return |
| } |
| ``` |
| |
| ## Buffer Deallocation Simplification Pass |
| |
| The [semantics of the `bufferization.dealloc` operation](https://mlir.llvm.org/docs/Dialects/BufferizationOps/#bufferizationdealloc-bufferizationdeallocop) |
| provide a lot of opportunities for optimizations which can be conveniently split |
| into patterns using the greedy pattern rewriter. Some of those patterns need |
| access to additional analyses such as an analysis that can determine whether two |
| MemRef values must, may, or never originate from the same buffer allocation. |
| These patterns are collected in the Buffer Deallocation Simplification pass, |
| while patterns that don't need additional analyses are registered as part of the |
| regular canonicalizer pass. This pass is best run after |
| `--ownership-based-buffer-deallocation` followed by `--canonicalize`. |
| |
| The pass applies patterns for the following simplifications: |
| * Remove MemRefs from retain list when guaranteed to not alias with any value |
| in the 'memref' operand list. This avoids an additional aliasing check with |
| the removed value. |
| * Split off values in the 'memref' list to new `bufferization.dealloc` |
| operations only containing this value in the 'memref' list when it is |
| guaranteed to not alias with any other value in the 'memref' list. This |
| avoids at least one aliasing check at runtime and enables using a more |
| efficient lowering for this new `bufferization.dealloc` operation. |
| * Remove values from the 'memref' operand list when it is guaranteed to alias |
| with at least one value in the 'retained' list and may not alias any other |
| value in the 'retain' list. |
| |
| ## Lower Deallocations Pass |
| |
| The `-lower-deallocations` pass transforms all `bufferization.dealloc` |
| operations to `memref.dealloc` operations and may also insert operations from |
| the `scf`, `func`, and `arith` dialects to make deallocations conditional and |
| check whether two MemRef values come from the same allocation at runtime (when |
| the `buffer-deallocation-simplification` pass wasn't able to determine it |
| statically). |
| |
| The same lowering of the `bufferization.dealloc` operation is also part of the |
| `-convert-bufferization-to-memref` conversion pass which also lowers all the |
| other operations of the bufferization dialect. |
| |
| We distinguish multiple cases in this lowering pass to provide an overall more |
| efficient lowering. In the general case, a library function is created to avoid |
| quadratic code size explosion (relative to the number of operands of the dealloc |
| operation). The specialized lowerings aim to avoid this library function because |
| it requires allocating auxiliary MemRefs of index values. |
| |
| ### Generic Lowering |
| |
| A library function is generated to avoid code-size blow-up. On a high level, the |
| base-memref of all operands is extracted as an index value and stored into |
| specifically allocated MemRefs and passed to the library function which then |
| determines whether they come from the same original allocation. This information |
| is needed to avoid double-free situations and to correctly retain the MemRef |
| values in the `retained` list. |
| |
| **Dealloc Operation Lowering** |
| |
| This lowering supports all features the dealloc operation has to offer. It |
| computes the base pointer of each memref (as an index), stores it in a |
| new memref helper structure and passes it to the helper function generated |
| in `buildDeallocationLibraryFunction`. The results are stored in two lists |
| (represented as MemRefs) of booleans passed as arguments. The first list |
| stores whether the corresponding condition should be deallocated, the |
| second list stores the ownership of the retained values which can be used |
| to replace the result values of the `bufferization.dealloc` operation. |
| |
| Example: |
| ``` |
| %0:2 = bufferization.dealloc (%m0, %m1 : memref<2xf32>, memref<5xf32>) |
| if (%cond0, %cond1) |
| retain (%r0, %r1 : memref<1xf32>, memref<2xf32>) |
| ``` |
| lowers to (simplified): |
| ``` |
| %c0 = arith.constant 0 : index |
| %c1 = arith.constant 1 : index |
| %dealloc_base_pointer_list = memref.alloc() : memref<2xindex> |
| %cond_list = memref.alloc() : memref<2xi1> |
| %retain_base_pointer_list = memref.alloc() : memref<2xindex> |
| %m0_base_pointer = memref.extract_aligned_pointer_as_index %m0 |
| memref.store %m0_base_pointer, %dealloc_base_pointer_list[%c0] |
| %m1_base_pointer = memref.extract_aligned_pointer_as_index %m1 |
| memref.store %m1_base_pointer, %dealloc_base_pointer_list[%c1] |
| memref.store %cond0, %cond_list[%c0] |
| memref.store %cond1, %cond_list[%c1] |
| %r0_base_pointer = memref.extract_aligned_pointer_as_index %r0 |
| memref.store %r0_base_pointer, %retain_base_pointer_list[%c0] |
| %r1_base_pointer = memref.extract_aligned_pointer_as_index %r1 |
| memref.store %r1_base_pointer, %retain_base_pointer_list[%c1] |
| %dyn_dealloc_base_pointer_list = memref.cast %dealloc_base_pointer_list : |
| memref<2xindex> to memref<?xindex> |
| %dyn_cond_list = memref.cast %cond_list : memref<2xi1> to memref<?xi1> |
| %dyn_retain_base_pointer_list = memref.cast %retain_base_pointer_list : |
| memref<2xindex> to memref<?xindex> |
| %dealloc_cond_out = memref.alloc() : memref<2xi1> |
| %ownership_out = memref.alloc() : memref<2xi1> |
| %dyn_dealloc_cond_out = memref.cast %dealloc_cond_out : |
| memref<2xi1> to memref<?xi1> |
| %dyn_ownership_out = memref.cast %ownership_out : |
| memref<2xi1> to memref<?xi1> |
| call @dealloc_helper(%dyn_dealloc_base_pointer_list, |
| %dyn_retain_base_pointer_list, |
| %dyn_cond_list, |
| %dyn_dealloc_cond_out, |
| %dyn_ownership_out) : (...) |
| %m0_dealloc_cond = memref.load %dyn_dealloc_cond_out[%c0] : memref<2xi1> |
| scf.if %m0_dealloc_cond { |
| memref.dealloc %m0 : memref<2xf32> |
| } |
| %m1_dealloc_cond = memref.load %dyn_dealloc_cond_out[%c1] : memref<2xi1> |
| scf.if %m1_dealloc_cond { |
| memref.dealloc %m1 : memref<5xf32> |
| } |
| %r0_ownership = memref.load %dyn_ownership_out[%c0] : memref<2xi1> |
| %r1_ownership = memref.load %dyn_ownership_out[%c1] : memref<2xi1> |
| memref.dealloc %dealloc_base_pointer_list : memref<2xindex> |
| memref.dealloc %retain_base_pointer_list : memref<2xindex> |
| memref.dealloc %cond_list : memref<2xi1> |
| memref.dealloc %dealloc_cond_out : memref<2xi1> |
| memref.dealloc %ownership_out : memref<2xi1> |
| // replace %0#0 with %r0_ownership |
| // replace %0#1 with %r1_ownership |
| ``` |
| |
| **Library function** |
| |
| A library function is built per compilation unit that can be called at |
| bufferization dealloc sites to determine whether two MemRefs come from the same |
| allocation and their new ownerships. |
| |
| The generated function takes two MemRefs of indices and three MemRefs of |
| booleans as arguments: |
| * The first argument A should contain the result of the |
| extract_aligned_pointer_as_index operation applied to the MemRefs to be |
| deallocated |
| * The second argument B should contain the result of the |
| extract_aligned_pointer_as_index operation applied to the MemRefs to be |
| retained |
| * The third argument C should contain the conditions as passed directly |
| to the deallocation operation. |
| * The fourth argument D is used to pass results to the caller. Those |
| represent the condition under which the MemRef at the corresponding |
| position in A should be deallocated. |
| * The fifth argument E is used to pass results to the caller. It |
| provides the ownership value corresponding the the MemRef at the same |
| position in B |
| |
| This helper function is supposed to be called once for each |
| `bufferization.dealloc` operation to determine the deallocation need and |
| new ownership indicator for the retained values, but does not perform the |
| deallocation itself. |
| |
| Generated code: |
| ``` |
| func.func @dealloc_helper( |
| %dyn_dealloc_base_pointer_list: memref<?xindex>, |
| %dyn_retain_base_pointer_list: memref<?xindex>, |
| %dyn_cond_list: memref<?xi1>, |
| %dyn_dealloc_cond_out: memref<?xi1>, |
| %dyn_ownership_out: memref<?xi1>) { |
| %c0 = arith.constant 0 : index |
| %c1 = arith.constant 1 : index |
| %true = arith.constant true |
| %false = arith.constant false |
| %num_dealloc_memrefs = memref.dim %dyn_dealloc_base_pointer_list, %c0 |
| %num_retain_memrefs = memref.dim %dyn_retain_base_pointer_list, %c0 |
| // Zero initialize result buffer. |
| scf.for %i = %c0 to %num_retain_memrefs step %c1 { |
| memref.store %false, %dyn_ownership_out[%i] : memref<?xi1> |
| } |
| scf.for %i = %c0 to %num_dealloc_memrefs step %c1 { |
| %dealloc_bp = memref.load %dyn_dealloc_base_pointer_list[%i] |
| %cond = memref.load %dyn_cond_list[%i] |
| // Check for aliasing with retained memrefs. |
| %does_not_alias_retained = scf.for %j = %c0 to %num_retain_memrefs |
| step %c1 iter_args(%does_not_alias_aggregated = %true) -> (i1) { |
| %retain_bp = memref.load %dyn_retain_base_pointer_list[%j] |
| %does_alias = arith.cmpi eq, %retain_bp, %dealloc_bp : index |
| scf.if %does_alias { |
| %curr_ownership = memref.load %dyn_ownership_out[%j] |
| %updated_ownership = arith.ori %curr_ownership, %cond : i1 |
| memref.store %updated_ownership, %dyn_ownership_out[%j] |
| } |
| %does_not_alias = arith.cmpi ne, %retain_bp, %dealloc_bp : index |
| %updated_aggregate = arith.andi %does_not_alias_aggregated, |
| %does_not_alias : i1 |
| scf.yield %updated_aggregate : i1 |
| } |
| // Check for aliasing with dealloc memrefs in the list before the |
| // current one, i.e., |
| // `fix i, forall j < i: check_aliasing(%dyn_dealloc_base_pointer[j], |
| // %dyn_dealloc_base_pointer[i])` |
| %does_not_alias_any = scf.for %j = %c0 to %i step %c1 |
| iter_args(%does_not_alias_agg = %does_not_alias_retained) -> (i1) { |
| %prev_dealloc_bp = memref.load %dyn_dealloc_base_pointer_list[%j] |
| %does_not_alias = arith.cmpi ne, %prev_dealloc_bp, %dealloc_bp |
| %updated_alias_agg = arith.andi %does_not_alias_agg, %does_not_alias |
| scf.yield %updated_alias_agg : i1 |
| } |
| %dealloc_cond = arith.andi %does_not_alias_any, %cond : i1 |
| memref.store %dealloc_cond, %dyn_dealloc_cond_out[%i] : memref<?xi1> |
| } |
| return |
| } |
| ``` |
| |
| ### Specialized Lowerings |
| |
| Currently, there are two special lowerings for common cases to avoid the library |
| function and thus unnecessary memory load and store operations and function |
| calls: |
| |
| **One memref, no retained** |
| |
| Lower a simple case without any retained values and a single MemRef. Ideally, |
| static analysis can provide enough information such that the |
| `buffer-deallocation-simplification` pass is able to split the dealloc |
| operations up into this simple case as much as possible before running this |
| pass. |
| |
| Example: |
| ```mlir |
| bufferization.dealloc (%arg0 : memref<2xf32>) if (%arg1) |
| ``` |
| is lowered to |
| ```mlir |
| scf.if %arg1 { |
| memref.dealloc %arg0 : memref<2xf32> |
| } |
| ``` |
| |
| In most cases, the branch condition is either constant 'true' or 'false' and can |
| thus be optimized away entirely by the canonicalizer pass. |
| |
| **One memref, arbitrarily many retained** |
| |
| A special case lowering for the deallocation operation with exactly one MemRef, |
| but an arbitrary number of retained values. The size of the code produced by |
| this lowering is linear to the number of retained values. |
| |
| Example: |
| ```mlir |
| %0:2 = bufferization.dealloc (%m : memref<2xf32>) if (%cond) |
| retain (%r0, %r1 : memref<1xf32>, memref<2xf32>) |
| return %0#0, %0#1 : i1, i1 |
| ``` |
| is lowered to |
| ```mlir |
| %m_base_pointer = memref.extract_aligned_pointer_as_index %m |
| %r0_base_pointer = memref.extract_aligned_pointer_as_index %r0 |
| %r0_does_not_alias = arith.cmpi ne, %m_base_pointer, %r0_base_pointer |
| %r1_base_pointer = memref.extract_aligned_pointer_as_index %r1 |
| %r1_does_not_alias = arith.cmpi ne, %m_base_pointer, %r1_base_pointer |
| %not_retained = arith.andi %r0_does_not_alias, %r1_does_not_alias : i1 |
| %should_dealloc = arith.andi %not_retained, %cond : i1 |
| scf.if %should_dealloc { |
| memref.dealloc %m : memref<2xf32> |
| } |
| %true = arith.constant true |
| %r0_does_alias = arith.xori %r0_does_not_alias, %true : i1 |
| %r0_ownership = arith.andi %r0_does_alias, %cond : i1 |
| %r1_does_alias = arith.xori %r1_does_not_alias, %true : i1 |
| %r1_ownership = arith.andi %r1_does_alias, %cond : i1 |
| return %r0_ownership, %r1_ownership : i1, i1 |
| ``` |
| |
| ## Memory Layouts |
| |
| One-Shot Bufferize bufferizes ops from top to bottom. This works well when all |
| ops are bufferizable. However, when encountering a non-bufferizable tensor with |
| `allow-unknown-ops`, One-Shot Bufferize must insert `to_memref` ops at the |
| bufferization boundary and decide on a memref type. By default, One-Shot |
| Bufferize choose the most dynamic memref type wrt. layout maps. E.g.: |
| |
| ```mlir |
| %0 = "my_dialect.unbufferizable_op(%t) : (tensor<?x?xf32>) -> (tensor<?x?xf32>) |
| %1 = tensor.extract %0[%idx1, %idx2] : tensor<?xf32> |
| ``` |
| |
| When bufferizing the above IR, One-Shot Bufferize inserts a `to_memref` ops with |
| dynamic offset and strides: |
| |
| ```mlir |
| %0 = "my_dialect.unbufferizable_op(%t) : (tensor<?x?xf32>) -> (tensor<?x?xf32>) |
| %0_m = bufferization.to_memref %0 : memref<?x?xf32, strided<[?, ?], offset: ?>> |
| %1 = memref.load %0_m[%idx1, %idx2] : memref<?x?xf32, strided<[?, ?], offset: ?>> |
| ``` |
| |
| All users of `%0` have fully dynamic layout maps. This ensures that the |
| bufferized IR composes well with future bufferizations of `unbufferizable_op` |
| (maybe bufferized by another pass), regardless of the exact memref type of the |
| future bufferization. If the op turns out to be bufferized to an op with a |
| simpler memref type (e.g., identity layout map), we expect that canonicalization |
| patterns would clean up unnecessarily dynamic layout maps. (Some of these |
| canonicalization patterns may not be implemented yet.) |
| |
| One-Shot Bufferize tries to infer the most precise memref type when bufferizing |
| an op. If the entire IR is bufferizable, we do not have to resort to |
| conservatively use fully dynamic layout maps. In that case, we also do not have |
| to rely on canonicalization patterns to clean up the bufferized IR. |
| |
| Note: There are some bufferizable ops for which a percise layout map cannot be |
| inferred. E.g., a `tensor.cast` from a `tensor<*xf32>` to a `tensor<?x?xf32>` |
| must be bufferized to a `memref.cast` with a memref type that has a fully |
| dynamic layout map. |
| |
| One-Shot Bufferize has an option `unknown-type-conversion` to control the |
| generation of layout maps when no precise layout can be inferred: |
| |
| * `fully-dynamic-layout-map` uses fully dynamic layout maps and is the default |
| behavior. This composes well when IR is partially bufferized. |
| * `identity-layout-map` uses static identity layout maps. This option can be |
| useful for legacy code that cannot handle memref types with layout maps. |
| Note that this setting can lead to additional buffer copies when folding a |
| `to_tensor`/`to_memref` pair with memref types that are not cast-compatible. |
| |
| Note: The `unknown-type-conversion` option does not affect layout maps of |
| function signatures. There is a separate `function-signature-type-conversion` |
| option that controls layout maps of function parameters and function results. |
| |
| ## Extending One-Shot Bufferize |
| |
| Custom ops can be bufferized if they implement `BufferizableOpInterface`. Users |
| must at least implement the following interface methods. |
| |
| * `bufferizesToMemoryRead`: Return `true` if the buffer of the given tensor |
| OpOperand is read. |
| * `bufferizesToMemoryWrite`: Return `true` if the buffer of the given tensor |
| OpOperand is written (if bufferizing in-place). |
| * `getAliasingOpResult`: Return the OpResults that may share the same buffer |
| as the given OpOperand. This interface method describes to |
| OpOperand-to-OpResult mapping wrt. destination-passing style. |
| * `bufferRelation`: Return `BufferRelation::Equivalent` if the given OpResult |
| is the exact same memref as the aliasing OpOperand after bufferization (in |
| case of in-place bufferization). Otherwise, (e.g., they overlap but are not |
| necessarily the exact same memrefs), `BufferRelation::Unknown` should be |
| returned. Additional buffer relations will be added in the future, but |
| `BufferRelation::Unknown` is always safe. |
| * `bufferize`: Rewrite the op with the given rewriter. Ops should be replaced |
| with `bufferization::replaceOpWithBufferizedValues`. |
| |
| To get a better intuition of the interface methods, we invite users to take a |
| look at existing implementations in MLIR, e.g., the implementation of |
| `tensor.insert` or `tensor.extract`. |
| |
| ## Debugging Buffer Copies |
| |
| To get a better understanding of why One-Shot Bufferize introduced a buffer |
| copy, users can run the pass with `test-analysis-only print-conflicts`. Every |
| tensor op is then annotated with an attribute that has a boolean value for each |
| tensor OpOperand. `true` means that the OpOperand bufferizes in-place. `false` |
| means that the OpOperand bufferizes out-of-place and a buffer copy will be |
| inserted. |
| |
| There are two reasons why a buffer copy may be inserted. |
| |
| 1. Due to a RaW conflict, it is not safe to bufferize in-place. I.e., the |
| overwritten data is still needed. |
| 2. The buffer is not writable. E.g., `memref.global` buffers that are the |
| result of `arith.constant` ops are never modified. |
| |
| In the first case, `print-conflicts` illustrates the conflict in the form of a |
| ("read", "conflicting write", "last write") tuple. |
| |
| ## Understanding the SSA Use-Def Chain Analysis |
| |
| To get a better understanding of the SSA Use-Def Chain Analysis and the RaW |
| conflict detection algorithm, we invite interested users to read the |
| [design document](https://discourse.llvm.org/uploads/short-url/5kckJ3DftYwQokG252teFgw3sYa.pdf) |
| and watch the corresponding [ODM talk](https://youtu.be/TXEo59CYS9A) |
| ([slides](https://mlir.llvm.org/OpenMeetings/2022-01-13-One-Shot-Bufferization.pdf)). |
| can be used to bufferize a program in a single pass, as long as each op |
| |
| ## Migrating from Dialect Conversion-based Bufferization |
| |
| Both dialect conversion-based bufferization and One-Shot Bufferize generate |
| `to_tensor`/`to_memref` ops at the bufferization boundary (when run with |
| `allow-unknown-ops`). They can be combined and run in sequence. However, |
| One-Shot Bufferize must run first because it cannot analyze those boundary ops. |
| To update existing code step-by-step, it may be useful to specify a dialect |
| filter for One-Shot Bufferize, so that dialects can be switched over one-by-one. |
| |
| ## Bufferization Function Graphs |
| |
| One-Shot Bufferize does currently not support function graph bufferization. |
| I.e., `CallOp`, `ReturnOp` and function bbArgs are not bufferizable. Users can |
| run the existing `--func-bufferize` bufferization pass after One-Shot Bufferize. |
| |
| Alternatively, users can try |
| [`ModuleBufferization`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Linalg/ComprehensiveBufferize/ModuleBufferization.h#L31), |
| which is an extension of One-Shot Bufferize. This bufferization is still under |
| development and does not support arbitrary IR. In essence, returning a tensor |
| from a function is not supported, unless it is equivalent to a function bbArg. |
| In that case, the corresponding return value can simply be dropped during |
| bufferization. |
| |
| ## Dialect Conversion-based Bufferization |
| |
| Disclaimer: Most dialect conversion-based bufferization has been migrated to |
| One-Shot Bufferize. New users should use One-Shot Bufferize (with or without |
| analysis). The following documentation is only for existing users of dialect |
| conversion-based bufferization. |
| |
| This system is a simple application of MLIR's dialect conversion infrastructure. |
| The bulk of the code related to bufferization is a set of ordinary |
| `ConversionPattern`'s that dialect authors write for converting ops that operate |
| on `tensor`'s to ops that operate on `memref`'s. A set of conventions and best |
| practices are followed that allow these patterns to be run across multiple |
| independent passes (rather than requiring a single huge atomic conversion pass), |
| which makes the compilation pipelines scalable, robust, and easy to debug. |
| |
| This document is targeted at people looking to utilize MLIR's bufferization |
| functionality, along with people who want to extend it to cover their own ops. |
| |
| <a name="the-talk">**NOTE:**</a> Before reading this document, please watch the |
| talk "Type Conversions the Not-So-Hard-Way: MLIR's New Bufferization |
| Infrastructure" |
| ([slides](https://drive.google.com/file/d/1FVbzCXxZzS9LBLuvpPNLWJD-XDkt54ky/view?usp=sharing), |
| [recording](https://drive.google.com/file/d/1VfVajitgf8ZPnd-HRkJvaJiFLhBsluXN/view?usp=sharing)). |
| That talk gives a high-level overview of the bufferization infrastructure and |
| important conceptual details related to using the MLIR dialect conversion |
| infrastructure. |
| |
| ### Bufferization's place in a compilation pipeline |
| |
| Bufferization itself does not free any of the buffers that have been allocated, |
| nor does it do anything particularly intelligent with the placement of buffers |
| w.r.t. control flow. Thus, a realistic compilation pipeline will usually consist |
| of: |
| |
| 1. Bufferization |
| 1. Buffer optimizations such as `buffer-hoisting`, `buffer-loop-hoisting`, and |
| `promote-buffers-to-stack`, which do optimizations that are only exposed |
| after bufferization. |
| 1. Finally, running the [buffer deallocation](BufferDeallocationInternals.md) |
| pass. |
| |
| After buffer deallocation has been completed, the program will be quite |
| difficult to transform due to the presence of the deallocation ops. Thus, other |
| optimizations such as linalg fusion on memrefs should be done before that stage. |
| |
| ### General structure of the bufferization process |
| |
| Bufferization consists of running multiple *partial* bufferization passes, |
| followed by one *finalizing* bufferization pass. |
| |
| There is typically one partial bufferization pass per dialect (though other |
| subdivisions are possible). For example, for a dialect `X` there will typically |
| be a pass `X-bufferize` that knows how to bufferize all the ops in that dialect. |
| By running pass `X-bufferize` for each dialect `X` in the program, all the ops |
| in the program are incrementally bufferized. |
| |
| Partial bufferization passes create programs where only some ops have been |
| bufferized. These passes will create *materializations* (also sometimes called |
| "casts") that convert between the `tensor` and `memref` type, which allows |
| bridging between ops that have been bufferized and ops that have not yet been |
| bufferized. |
| |
| Finalizing bufferizations complete the bufferization process, and guarantee that |
| there are no tensors remaining in the program. This involves eliminating the |
| materializations. The pass `finalizing-bufferize` provides a minimal pass that |
| only eliminates materializations and issues an error if any unbufferized ops |
| exist in the program. |
| |
| However, it is possible for a finalizing bufferization to do more than just |
| eliminate materializations. By adding patterns (just as a partial bufferization |
| would), it is possible for a finalizing bufferization pass to simultaneously |
| bufferize ops and eliminate materializations. This has a number of disadvantages |
| discussed in the talk and should generally be avoided. |
| |
| ### Example |
| |
| As a concrete example, we will look at the bufferization pipeline from the |
| `mlir-npcomp` reference backend |
| ([code](https://github.com/llvm/mlir-npcomp/blob/97d6d04d41216e73d40b89ffd79620973fc14ce3/lib/RefBackend/RefBackend.cpp#L232)). |
| The code, slightly simplified and annotated, is reproduced here: |
| |
| ```c++ |
| // Partial bufferization passes. |
| pm.addPass(createTensorConstantBufferizePass()); |
| pm.addNestedPass<func::FuncOp>(createTCPBufferizePass()); // Bufferizes the downstream `tcp` dialect. |
| pm.addNestedPass<func::FuncOp>(createSCFBufferizePass()); |
| pm.addNestedPass<func::FuncOp>(createLinalgBufferizePass()); |
| pm.addNestedPass<func::FuncOp>(createTensorBufferizePass()); |
| pm.addPass(createFuncBufferizePass()); |
| |
| // Finalizing bufferization pass. |
| pm.addNestedPass<func::FuncOp>(createFinalizingBufferizePass()); |
| ``` |
| |
| Looking first at the partial bufferization passes, we see that there are a |
| sequence of `FuncOp` passes (which run in parallel on functions). These function |
| passes are bracketed by `arith-bufferize` and `func-bufferize`, which are module |
| passes (and thus serialize the parallel compilation process). These two passes |
| must be module passes because they make changes to the top-level module. |
| |
| The bulk of the bufferization work is done by the function passes. Most of these |
| passes are provided as part of the upstream MLIR distribution and bufferize |
| their respective dialects (e.g. `scf-bufferize` bufferizes the `scf` dialect). |
| The `tcp-bufferize` pass is an exception -- it is a partial bufferization pass |
| used to bufferize the downstream `tcp` dialect, and fits in perfectly with all |
| the other passes provided upstream. |
| |
| The last pass is the finalizing bufferization pass. The `mlir-npcomp` reference |
| backend has arranged that all ops are bufferized by partial bufferizations, so |
| that the upstream `finalizing-bufferize` pass can be used as the finalizing |
| bufferization pass. This gives excellent diagnostics when something goes wrong |
| with the bufferization process, such as due to an op that wasn't handled by any |
| pattern. |
| |
| ### How to write a partial bufferization pass |
| |
| The contract of a partial bufferization pass is that a subset of ops (or kinds |
| of ops, customizable by a ConversionTarget) get bufferized. |
| |
| A partial bufferization pass is just a pass that uses the |
| [dialect conversion](DialectConversion.md) framework to apply |
| `ConversionPattern`s with a `tensor` to `memref` type conversion. |
| |
| To describe how to write such a pass, we will walk through an example, the |
| `tensor-bufferize` pass |
| ([code](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/lib/Dialect/Tensor/Transforms/Bufferize.cpp#L23), |
| [test](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/test/Dialect/Tensor/bufferize.mlir#L1)) |
| that bufferizes the `tensor` dialect. Note that these passes have been replaced |
| with a `BufferizableOpInterface`-based implementation in the meantime, so we |
| have to take a looker at an older version of the code. |
| |
| The bulk of the code in the pass will be a set of conversion patterns, with a |
| simple example being |
| [BufferizeCastOp](https://github.com/llvm/llvm-project/blob/2bf6e443e54604c7818c4d1a1837f3d091023270/mlir/lib/Dialect/Tensor/Transforms/Bufferize.cpp#L23)). |
| |
| ``` |
| class BufferizeCastOp : public OpConversionPattern<tensor::CastOp> { |
| public: |
| using OpConversionPattern::OpConversionPattern; |
| LogicalResult |
| matchAndRewrite(tensor::CastOp op, OpAdaptor adaptor, |
| ConversionPatternRewriter &rewriter) const override { |
| auto resultType = getTypeConverter()->convertType(op.getType()); |
| rewriter.replaceOpWithNewOp<MemRefCastOp>(op, resultType, adaptor.source()); |
| return success(); |
| } |
| }; |
| ``` |
| |
| See [the talk](#the-talk) for more details on how to write these patterns. |
| |
| The |
| [pass itself](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/lib/Dialect/Tensor/Transforms/Bufferize.cpp#L57) |
| is very small, and follows the basic pattern of any dialect conversion pass. |
| |
| ``` |
| void mlir::populateTensorBufferizePatterns( |
| BufferizeTypeConverter &typeConverter, RewritePatternSet &patterns) { |
| patterns.add<BufferizeCastOp, BufferizeExtractOp>(typeConverter, |
| patterns.getContext()); |
| } |
| |
| struct TensorBufferizePass : public TensorBufferizeBase<TensorBufferizePass> { |
| void runOnOperation() override { |
| auto *context = &getContext(); |
| BufferizeTypeConverter typeConverter; |
| RewritePatternSet patterns(context); |
| ConversionTarget target(*context); |
| |
| populateTensorBufferizePatterns(typeConverter, patterns); |
| target.addIllegalOp<tensor::CastOp, tensor::ExtractOp>(); |
| target.addLegalDialect<func::FuncDialect>(); |
| |
| if (failed( |
| applyPartialConversion(getOperation(), target, std::move(patterns)))) |
| signalPassFailure(); |
| } |
| }; |
| ``` |
| |
| The pass has all the hallmarks of a dialect conversion pass that does type |
| conversions: a `TypeConverter`, a `RewritePatternSet`, and a `ConversionTarget`, |
| and a call to `applyPartialConversion`. Note that a function |
| `populateTensorBufferizePatterns` is separated, so that power users can use the |
| patterns independently, if necessary (such as to combine multiple sets of |
| conversion patterns into a single conversion call, for performance). |
| |
| One convenient utility provided by the MLIR bufferization infrastructure is the |
| `BufferizeTypeConverter`, which comes pre-loaded with the necessary conversions |
| and materializations between `tensor` and `memref`. |
| |
| In this case, the `BufferizationOpsDialect` is marked as legal, so the |
| `bufferization.to_tensor` and `bufferization.to_memref` ops, which are inserted |
| automatically by the dialect conversion framework as materializations, are |
| legal. There is a helper `populateBufferizeMaterializationLegality` |
| ([code](https://github.com/llvm/llvm-project/blob/a0b65a7bcd6065688189b3d678c42ed6af9603db/mlir/include/mlir/Transforms/Bufferize.h#L53)) |
| which helps with this in general. |
| |
| ### Other partial bufferization examples |
| |
| - `scf-bufferize` |
| ([code](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/lib/Dialect/SCF/Transforms/Bufferize.cpp#L1), |
| [test](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/test/Dialect/SCF/bufferize.mlir#L1)) |
| |
| - Bufferizes ops from the `scf` dialect. |
| - This is an example of how to bufferize ops that implement |
| `RegionBranchOpInterface` (that is, they use regions to represent |
| control flow). |
| - The bulk of the work is done by |
| `lib/Dialect/SCF/Transforms/StructuralTypeConversions.cpp` |
| ([code](https://github.com/llvm/llvm-project/blob/daaaed6bb89044ac58a23f1bb1ccdd12342a5a58/mlir/lib/Dialect/SCF/Transforms/StructuralTypeConversions.cpp#L1)), |
| which is well-commented and covers how to correctly convert ops that |
| contain regions. |
| |
| - `func-bufferize` |
| ([code](https://github.com/llvm/llvm-project/blob/2f5715dc78328215d51d5664c72c632a6dac1046/mlir/lib/Dialect/Func/Transforms/FuncBufferize.cpp#L1), |
| [test](https://github.com/llvm/llvm-project/blob/2f5715dc78328215d51d5664c72c632a6dac1046/mlir/test/Dialect/Func/func-bufferize.mlir#L1)) |
| |
| - Bufferizes `func`, `call`, and `BranchOpInterface` ops. |
| - This is an example of how to bufferize ops that have multi-block |
| regions. |
| - This is an example of a pass that is not split along dialect |
| subdivisions. |
| |
| ### How to write a finalizing bufferization pass |
| |
| The contract of a finalizing bufferization pass is that all tensors are gone |
| from the program. |
| |
| The easiest way to write a finalizing bufferize pass is to not write one at all! |
| MLIR provides a pass `finalizing-bufferize` which eliminates the |
| `bufferization.to_tensor` / `bufferization.to_memref` materialization ops |
| inserted by partial bufferization passes and emits an error if that is not |
| sufficient to remove all tensors from the program. |
| |
| This pass is sufficient when partial bufferization passes have bufferized all |
| the ops in the program, leaving behind only the materializations. When possible, |
| it is recommended to structure your pass pipeline this way, as this has the |
| significant advantage that if an op does not get bufferized (due to a missing |
| pattern, bug in the code, etc.), `finalizing-bufferize` will emit a nice clean |
| error, and the IR seen by `finalizing-bufferize` will only contain only one |
| unbufferized op. |
| |
| However, before the current bufferization infrastructure was put in place, |
| bufferization could only be done as a single finalizing bufferization mega-pass |
| that used the `populate*BufferizePatterns` functions from multiple dialects to |
| simultaneously bufferize everything at once. Thus, one might see code in |
| downstream projects structured this way. This structure is not recommended in |
| new code. A helper, `populateEliminateBufferizeMaterializationsPatterns` |
| ([code](https://github.com/llvm/llvm-project/blob/a0b65a7bcd6065688189b3d678c42ed6af9603db/mlir/include/mlir/Transforms/Bufferize.h#L58)) |
| is available for such passes to provide patterns that eliminate |
| `bufferization.to_tensor` and `bufferization.to_memref`. |
| |
| ### Changes since [the talk](#the-talk) |
| |
| - `func-bufferize` was changed to be a partial conversion pass, and there is a |
| new `finalizing-bufferize` which serves as a general finalizing |
| bufferization pass. |
| - Most partial bufferization passes have been reimplemented in terms of |
| `BufferizableOpInterface`. New users should use One-Shot Bufferize instead |
| of dialect conversion-based bufferization. |