|  | # Bufferization | 
|  |  | 
|  | [TOC] | 
|  |  | 
|  | ## Overview | 
|  |  | 
|  | Bufferization in MLIR is the process of converting ops with `tensor` semantics | 
|  | to ops with `memref` semantics. There are multiple MLIR passes that are related | 
|  | to bufferization. These passes typically run as one of the last steps in a | 
|  | pass pipeline, right before lowering to `memref` ops to LLVM. That is because | 
|  | many transformations are easier or only supported in tensor land; e.g., | 
|  | [tile/fuse/… on tensors first](https://llvm.discourse.group/t/rfc-linalg-on-tensors-update-and-comprehensive-bufferization-rfc/3373), | 
|  | then bufferize the remaining IR. | 
|  |  | 
|  |  | 
|  |  | 
|  | The most important bufferization pass is *One-Shot Bufferize*: This pass | 
|  | rewrites `tensor` IR to `memref` IR. There are additional helper passes that | 
|  | preprocess IR (e.g., so that IR can be bufferized more efficiently), perform | 
|  | buffer-level optimizations such as allocation hoisting, and | 
|  | [insert buffer deallocation ops](OwnershipBasedBufferDeallocation.md) so that | 
|  | the resulting `memref` IR has no memory leaks. | 
|  |  | 
|  | ## Deprecated Passes | 
|  |  | 
|  | The old dialect conversion-based bufferization passes have been deprecated and | 
|  | should not be used anymore. Most of those passes have already been removed from | 
|  | MLIR. One-Shot Bufferize produces in better bufferization results with fewer | 
|  | memory allocations and buffer copies. | 
|  |  | 
|  | The buffer deallocation pass has been deprecated in favor of the ownership-based | 
|  | buffer deallocation pipeline. The deprecated pass has some limitations that may | 
|  | cause memory leaks in the resulting IR. | 
|  |  | 
|  | ## What is One-Shot Bufferize? | 
|  |  | 
|  | One-Shot Bufferize is a tensor bufferization pass designed for IR in | 
|  | [destination-passing style](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/dps-fhpc17.pdf), | 
|  | and with aggressive in-place bufferization. | 
|  |  | 
|  | One-Shot Bufferize is: | 
|  |  | 
|  | *   **Monolithic**: A single MLIR pass does the entire work. | 
|  |  | 
|  | *   **Extensible** via an op interface: All ops that implement | 
|  | `BufferizableOpInterface` can be bufferized. | 
|  |  | 
|  | *   A **whole-function at a time analysis**. In-place bufferization decisions | 
|  | are made by analyzing SSA use-def chains on tensors. Op interface | 
|  | implementations not only provide the rewrite logic from tensor ops to memref | 
|  | ops, but also helper methods for One-Shot Bufferize's analysis to query | 
|  | information about an op's bufferization/memory semantics. | 
|  |  | 
|  | *   **2-Phase**: Bufferization is internally broken down into 2 steps: First, | 
|  | analyze the entire IR and make bufferization decisions. Then, bufferize | 
|  | (rewrite) the IR. The analysis has access to exact SSA use-def information. | 
|  | It incrementally builds alias and equivalence sets and does not rely on a | 
|  | posteriori-alias analysis from preallocated memory. | 
|  |  | 
|  | *   **Greedy**: Operations are analyzed one-by-one and it is decided on the spot | 
|  | whether a tensor OpOperand must be copied or not. Heuristics determine the | 
|  | order of analysis. | 
|  |  | 
|  | *   **Modular**: The current One-Shot Analysis can be replaced with a different | 
|  | analysis. The result of the analysis are queried by the bufferization via | 
|  | `AnalysisState`, in particular `AnalysisState::isInPlace`. Any derived class | 
|  | of `AnalysisState` that implements a small number virtual functions can | 
|  | serve as a custom analysis. It is even possible to run One-Shot Bufferize | 
|  | without any analysis (`AlwaysCopyAnalysisState`), in which case One-Shot | 
|  | Bufferize copies every buffer before writing to it. | 
|  |  | 
|  | Note that One-Shot Bufferize does not deallocate buffers. That is done by the | 
|  | [Ownership-based Buffer Deallocation passes](OwnershipBasedBufferDeallocation.md). | 
|  |  | 
|  | ## Goals of Bufferization | 
|  |  | 
|  | The high-level goal of every bufferization technique is to: | 
|  |  | 
|  | 1. Use as little memory as possible. | 
|  | 2. Copy as little memory as possible. | 
|  |  | 
|  | This implies reusing already allocated buffers when possible, turning | 
|  | bufferization into an algorithmically complex problem with similarities to | 
|  | register allocation. | 
|  |  | 
|  | Depending on the concrete use case, there may be additional bufferization | 
|  | requirements. If the contents of a buffer are expensive to compute, there could | 
|  | be a tradeoff between *recomputation* and *compute once and copy*. On the | 
|  | contrary, it may not even be possible to allocate new buffers at runtime on some | 
|  | architectures. | 
|  |  | 
|  | ## Destination-Passing Style | 
|  |  | 
|  | Bufferization is an algorithmically complex problem. Given an op with a tensor | 
|  | result, bufferization has to choose a memref buffer in which the result can be | 
|  | stored. It is always safe to allocate a brand new buffer, but such a | 
|  | bufferization strategy would be unacceptable for high-performance codegen. When | 
|  | choosing an already existing buffer, we must be careful not to accidentally | 
|  | overwrite data that is still needed later in the program. | 
|  |  | 
|  | To simplify this problem, One-Shot Bufferize was designed to take advantage of | 
|  | *destination-passing style* (DPS). In MLIR, DPS op should implement the | 
|  | [`DestinationStyleOpInterface`](https://github.com/llvm/llvm-project/blob/792d437b56adfb3416daf8105942d4899fb82763/mlir/include/mlir/Interfaces/DestinationStyleOpInterface.td). | 
|  | DPS exists in itself independently of bufferization and is tied to SSA | 
|  | semantics: many ops are "updating" a part of their input SSA variables. For | 
|  | example the LLVM instruction | 
|  | [`insertelement`](https://llvm.org/docs/LangRef.html#insertelement-instruction) | 
|  | is inserting an element inside a vector. Since SSA values are immutable, the | 
|  | operation returns a copy of the input vector with the element inserted. | 
|  | Another example in MLIR is `linalg.generic` on tensors, which always has an | 
|  | extra `outs` operand for each result, which provides the initial values to | 
|  | update (for example when the operation is doing a reduction). | 
|  |  | 
|  | `outs` operands are referred to as "destinations" in the following (quotes are | 
|  | important as this operand isn't modified in place but copied) and comes into | 
|  | place in the context of bufferization as a possible "anchor" for the | 
|  | bufferization algorithm. This allows the user to shape the input in a form that | 
|  | guarantees close to optimal bufferization result when carefully choosing the | 
|  | SSA value used as "destination". | 
|  |  | 
|  | For every tensor result, a DPS op has a corresponding tensor operand. If there | 
|  | aren't any other conflicting uses of this tensor, the bufferization can alias | 
|  | it with the op result and perform the operation "in-place" by reusing the buffer | 
|  | allocated for this "destination" input. | 
|  |  | 
|  | As an example, consider the following op: `%r = tensor.insert %f into | 
|  | %t[%idx] : tensor<5xf32>` | 
|  |  | 
|  |  | 
|  |  | 
|  | `%t` is the "destination" in this example. When choosing a buffer for the result | 
|  | `%r`, denoted as `buffer(%r)`, One-Shot Bufferize considers only two options: | 
|  |  | 
|  | 1.  `buffer(%r) = buffer(%t)`: store the result in the existing `buffer(%t)`. | 
|  | Note that this is not always possible. E.g., if the old contents of | 
|  | `buffer(%t)` are still needed. One-Shot Bufferize's main task is to detect | 
|  | such cases and fall back to the second option when necessary. | 
|  | 2.  `buffer(%r)` is a newly allocated buffer. | 
|  |  | 
|  | There may be other buffers in the same function that could potentially be used | 
|  | for `buffer(%r)`, but those are not considered by One-Shot Bufferize to keep the | 
|  | bufferization simple. One-Shot Bufferize could be extended to consider such | 
|  | buffers in the future to achieve a better quality of bufferization. | 
|  |  | 
|  | Tensor ops that are not in destination-passing style always bufferized to a | 
|  | memory allocation. E.g.: | 
|  |  | 
|  | ```mlir | 
|  | %0 = tensor.generate %sz { | 
|  | ^bb0(%i : index): | 
|  | %cst = arith.constant 0.0 : f32 | 
|  | tensor.yield %cst : f32 | 
|  | } : tensor<?xf32> | 
|  | ``` | 
|  |  | 
|  | The result of `tensor.generate` does not have a "destination" operand, so | 
|  | bufferization allocates a new buffer. This could be avoided by instead using an | 
|  | op such as `linalg.generic`, which can express the same computation with a | 
|  | "destination" operand, as specified behind outputs (`outs`): | 
|  |  | 
|  | ```mlir | 
|  | #map = affine_map<(i) -> (i)> | 
|  | %0 = linalg.generic {indexing_maps = [#map], iterator_types = ["parallel"]} | 
|  | outs(%t : tensor<?xf32>) { | 
|  | ^bb0(%arg0 : f32): | 
|  | %cst = arith.constant 0.0 : f32 | 
|  | linalg.yield %cst : f32 | 
|  | } -> tensor<?xf32> | 
|  | ``` | 
|  |  | 
|  | At first glance, the above `linalg.generic` op may not seem very useful because | 
|  | the output tensor `%t` is entirely overwritten. Why pass the tensor `%t` as an | 
|  | operand in the first place? As an example, this can be useful for overwriting a | 
|  | slice of a tensor: | 
|  |  | 
|  | ```mlir | 
|  | %t = tensor.extract_slice %s [%idx] [%sz] [1] : tensor<?xf32> to tensor<?xf32> | 
|  | %0 = linalg.generic ... outs(%t) { ... } -> tensor<?xf32> | 
|  | %1 = tensor.insert_slice %0 into %s [%idx] [%sz] [1] | 
|  | : tensor<?xf32> into tensor<?xf32> | 
|  | ``` | 
|  |  | 
|  | The above example bufferizes to a `memref.subview`, followed by a | 
|  | "`linalg.generic` on memrefs" that overwrites the memory of the subview, assuming | 
|  | that the slice `%t` has no other user. The `tensor.insert_slice` then bufferizes | 
|  | to a no-op (in the absence of RaW conflicts such as a subsequent read of `%s`). | 
|  |  | 
|  | RaW conflicts are detected with an analysis of SSA use-def chains (details | 
|  | later). One-Shot Bufferize works best if there is a single SSA use-def chain, | 
|  | where the result of a tensor op is the operand of the next tensor ops, e.g.: | 
|  |  | 
|  | ```mlir | 
|  | %0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>) | 
|  | %1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>) | 
|  | %2 = "my_dialect.yet_another_op"(%1) : (tensor<?xf32>) -> (tensor<?xf32>) | 
|  | ``` | 
|  |  | 
|  | Buffer copies are likely inserted if the SSA use-def chain splits at some point, | 
|  | e.g.: | 
|  |  | 
|  | ```mlir | 
|  | %0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>) | 
|  | %1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>) | 
|  |  | 
|  | // "yet_another_op" likely needs to read the data of %0, so "another_op" cannot | 
|  | // in-place write to buffer(%0). | 
|  | %2 = "my_dialect.yet_another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>) | 
|  | ``` | 
|  |  | 
|  | ## Tensor / MemRef Boundary | 
|  |  | 
|  | The bufferization dialect provides a few helper ops to connect tensor IR (that | 
|  | should be bufferized) with existing buffers (that may be allocated/provided by | 
|  | a different runtime/library/etc.). | 
|  |  | 
|  | `bufferization.to_memref %t` returns the future buffer of a tensor SSA value. | 
|  | `bufferization.to_tensor %m` returns a tensor SSA value for a given MemRef | 
|  | buffer. `bufferization.materialize_in_destination` indicates that a tensor value | 
|  | should materialize in a certain buffer. | 
|  |  | 
|  | Consider the following example, where a TOSA matmul result should materialize in | 
|  | an existing buffer `%C`: | 
|  |  | 
|  | ```mlir | 
|  | // Batched TOSA matrix multiplication. %A and %B are the | 
|  | // inputs, %C is the output. | 
|  | func.func @test_matmul(%A: memref<1x17x19xf32>, | 
|  | %B: memref<1x19x29xf32>, | 
|  | %C: memref<1x17x29xf32>) { | 
|  |  | 
|  | %A_tensor = bufferization.to_tensor %A restrict : memref<1x17x19xf32> | 
|  | %B_tensor = bufferization.to_tensor %B restrict : memref<1x19x29xf32> | 
|  |  | 
|  | %0 = tosa.matmul %A_tensor, %B_tensor | 
|  | : (tensor<1x17x19xf32>, tensor<1x19x29xf32>) -> | 
|  | tensor<1x17x29xf32> | 
|  |  | 
|  | bufferization.materialize_in_destination | 
|  | %0 in restrict writable %C | 
|  | : (tensor<1x17x29xf32>, memref<1x17x29xf32>) -> () | 
|  |  | 
|  | return | 
|  | } | 
|  | ``` | 
|  |  | 
|  | Note that all bufferization ops in this example have the `restrict` unit | 
|  | attribute set. This attribute is similar to the C restrict keyword and indicates | 
|  | that there is no other `to_tensor` or `materialize_in_destination` op with | 
|  | the same or an aliasing MemRef operand. Only such | 
|  | `to_tensor`/`materialize_in_destination` ops are supported. The `restrict` | 
|  | attribute gives strong aliasing guarantees to the bufferization analysis and | 
|  | allows us to look only at the tensor IR in a program. (Ops that do not operate | 
|  | on tensors are ignored by the One-Shot Bufferize.) | 
|  |  | 
|  | Also note that `tosa.matmul` cannot be bufferized as is: there is no | 
|  | `BufferizableOpInterface` implementation for that op. However, the op can be | 
|  | lowered to a combination of `tensor.empty` and `linalg.matmul`, which can be | 
|  | bufferized. | 
|  |  | 
|  | ## Using One-Shot Bufferize | 
|  |  | 
|  | MLIR provides a pass | 
|  | [`-one-shot-bufferize`](https://mlir.llvm.org/docs/Passes/#-one-shot-bufferize-one-shot-bufferize) | 
|  | that performs an analysis and bufferizes all ops with tensor semantics that | 
|  | implement `BufferizableOpInterface`. For modularity reasons, these op interface | 
|  | implementations are typically external models that live in a dialect's | 
|  | "Transforms" build unit. (External models are a mechanism for implementing an op | 
|  | interface in a different build unit.) It is the user's responsibility to ensure | 
|  | that all needed external models are registered before running One-Shot | 
|  | Bufferize. | 
|  |  | 
|  | By default, One-Shot Bufferize fails when it encounters an op with tensor | 
|  | semantics (i.e., tensor result or tensor operand) that is not bufferizable | 
|  | (i.e., does not implement `BufferizableOpInterface`). This can be avoided with | 
|  | `allow-unknown-ops`. In that case, One-Shot Bufferize inserts | 
|  | `to_memref`/`to_tensor` ops around the bufferization boundary. | 
|  |  | 
|  | One-Shot Bufferize can be configured to bufferize only ops from a set of | 
|  | dialects with `dialect-filter`. This can be useful for gradually migrating from | 
|  | dialect conversion-based bufferization to One-Shot Bufferize. One-Shot Bufferize | 
|  | must run first in such a case, because dialect conversion-based bufferization | 
|  | generates `to_tensor` ops without the `restrict` unit attribute, which One-Shot | 
|  | Bufferize cannot analyze. | 
|  |  | 
|  | One-Shot Bufferize can also be called programmatically with | 
|  | [`bufferization::runOneShotBufferize`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/OneShotAnalysis.h#L167). | 
|  | Alternatively, | 
|  | [`bufferization::bufferizeOp`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/Bufferize.h#L78) | 
|  | skips the analysis and inserts a copy on every buffer write, just like the | 
|  | dialect conversion-based bufferization. | 
|  |  | 
|  | By default, function boundaries are not bufferized. This is because there are | 
|  | currently limitations around function graph bufferization: recursive | 
|  | calls are not supported. As long as there are no recursive calls, function | 
|  | boundary bufferization can be enabled with `bufferize-function-boundaries`. Each | 
|  | tensor function argument and tensor function result is then turned into a | 
|  | memref. The layout map of the memref type can be controlled with | 
|  | `function-boundary-type-conversion`. | 
|  |  | 
|  | ## Memory Layouts | 
|  |  | 
|  | One-Shot Bufferize bufferizes ops from top to bottom. This works well when all | 
|  | ops are bufferizable. However, when encountering a non-bufferizable tensor with | 
|  | `allow-unknown-ops`, One-Shot Bufferize must insert `to_memref` ops at the | 
|  | bufferization boundary and decide on a memref type. By default, One-Shot | 
|  | Bufferize choose the most dynamic memref type wrt. layout maps. E.g.: | 
|  |  | 
|  | ```mlir | 
|  | %0 = "my_dialect.unbufferizable_op(%t) : (tensor<?x?xf32>) -> (tensor<?x?xf32>) | 
|  | %1 = tensor.extract %0[%idx1, %idx2] : tensor<?xf32> | 
|  | ``` | 
|  |  | 
|  | When bufferizing the above IR, One-Shot Bufferize inserts a `to_memref` ops with | 
|  | dynamic offset and strides: | 
|  |  | 
|  | ```mlir | 
|  | %0 = "my_dialect.unbufferizable_op(%t) : (tensor<?x?xf32>) -> (tensor<?x?xf32>) | 
|  | %0_m = bufferization.to_memref %0 : memref<?x?xf32, strided<[?, ?], offset: ?>> | 
|  | %1 = memref.load %0_m[%idx1, %idx2] : memref<?x?xf32, strided<[?, ?], offset: ?>> | 
|  | ``` | 
|  |  | 
|  | All users of `%0` have fully dynamic layout maps. This ensures that the | 
|  | bufferized IR composes well with future bufferizations of `unbufferizable_op` | 
|  | (maybe bufferized by another pass), regardless of the exact memref type of the | 
|  | future bufferization. If the op turns out to be bufferized to an op with a | 
|  | simpler memref type (e.g., identity layout map), we expect that canonicalization | 
|  | patterns would clean up unnecessarily dynamic layout maps. (Some of these | 
|  | canonicalization patterns may not be implemented yet.) | 
|  |  | 
|  | One-Shot Bufferize tries to infer the most precise memref type when bufferizing | 
|  | an op. If the entire IR is bufferizable, we do not have to resort to | 
|  | conservatively use fully dynamic layout maps. In that case, we also do not have | 
|  | to rely on canonicalization patterns to clean up the bufferized IR. | 
|  |  | 
|  | Note: There are some bufferizable ops for which a percise layout map cannot be | 
|  | inferred. E.g., a `tensor.cast` from a `tensor<*xf32>` to a `tensor<?x?xf32>` | 
|  | must be bufferized to a `memref.cast` with a memref type that has a fully | 
|  | dynamic layout map. | 
|  |  | 
|  | One-Shot Bufferize has an option `unknown-type-conversion` to control the | 
|  | generation of layout maps when no precise layout can be inferred: | 
|  |  | 
|  | *   `fully-dynamic-layout-map` uses fully dynamic layout maps and is the default | 
|  | behavior. This composes well when IR is partially bufferized. | 
|  | *   `identity-layout-map` uses static identity layout maps. This option can be | 
|  | useful for legacy code that cannot handle memref types with layout maps. | 
|  | Note that this setting can lead to additional buffer copies when folding a | 
|  | `to_tensor`/`to_memref` pair with memref types that are not cast-compatible. | 
|  |  | 
|  | Note: The `unknown-type-conversion` option does not affect layout maps of | 
|  | function signatures. There is a separate `function-signature-type-conversion` | 
|  | option that controls layout maps of function parameters and function results. | 
|  |  | 
|  | ## Extending One-Shot Bufferize | 
|  |  | 
|  | Custom ops can be bufferized if they implement `BufferizableOpInterface`. Users | 
|  | must at least implement the following interface methods. | 
|  |  | 
|  | *   `bufferizesToMemoryRead`: Return `true` if the buffer of the given tensor | 
|  | OpOperand is read. | 
|  | *   `bufferizesToMemoryWrite`: Return `true` if the buffer of the given tensor | 
|  | OpOperand is written (if bufferizing in-place). | 
|  | *   `getAliasingOpResult`: Return the OpResults that may share the same buffer | 
|  | as the given OpOperand. This interface method describes to | 
|  | OpOperand-to-OpResult mapping wrt. destination-passing style. | 
|  | *   `bufferRelation`: Return `BufferRelation::Equivalent` if the given OpResult | 
|  | is the exact same memref as the aliasing OpOperand after bufferization (in | 
|  | case of in-place bufferization). Otherwise, (e.g., they overlap but are not | 
|  | necessarily the exact same memrefs), `BufferRelation::Unknown` should be | 
|  | returned. Additional buffer relations will be added in the future, but | 
|  | `BufferRelation::Unknown` is always safe. | 
|  | *   `bufferize`: Rewrite the op with the given rewriter. Ops should be replaced | 
|  | with `bufferization::replaceOpWithBufferizedValues`. | 
|  |  | 
|  | To get a better intuition of the interface methods, we invite users to take a | 
|  | look at existing implementations in MLIR, e.g., the implementation of | 
|  | `tensor.insert` or `tensor.extract`. | 
|  |  | 
|  | Interface implementations of DPS ops (that implement | 
|  | `DestinationStyleOpInterface`) can derive from | 
|  | `DstBufferizableOpInterfaceExternalModel`, which provides all necessary | 
|  | method implementations except for `bufferize`. | 
|  |  | 
|  | ## Debugging Buffer Copies | 
|  |  | 
|  | To get a better understanding of why One-Shot Bufferize introduced a buffer | 
|  | copy, users can run the pass with `test-analysis-only print-conflicts`. Every | 
|  | tensor op is then annotated with an attribute that has a boolean value for each | 
|  | tensor OpOperand. `true` means that the OpOperand bufferizes in-place. `false` | 
|  | means that the OpOperand bufferizes out-of-place and a buffer copy will be | 
|  | inserted. | 
|  |  | 
|  | There are two reasons why a buffer copy may be inserted. | 
|  |  | 
|  | 1.  Due to a RaW conflict, it is not safe to bufferize in-place. I.e., the | 
|  | overwritten data is still needed. | 
|  | 2.  The buffer is not writable. E.g., `memref.global` buffers that are the | 
|  | result of `arith.constant` ops are never modified. | 
|  |  | 
|  | In the first case, `print-conflicts` illustrates the conflict in the form of a | 
|  | ("read", "conflicting write", "last write") tuple. | 
|  |  | 
|  | A RaW conflict consists of three parts, in the following order according to | 
|  | op dominance: | 
|  |  | 
|  | 1. **Definition:** A tensor `%t` is defined. | 
|  | 2. **Conflicting Write:** An operation writes to `buffer(%t)`. | 
|  | 3. **Read:** An operation reads `%t`. | 
|  |  | 
|  | When such a RaW conflict is detected during the analysis phase, One-Shot | 
|  | Bufferize will insert a buffer copy for the conflicting write. | 
|  |  | 
|  | **Example** | 
|  |  | 
|  | ```mlir | 
|  | // RUN: mlir-opt %s -one-shot-bufferize="bufferize-function-boundaries test-analysis-only print-conflicts" | 
|  | func.func @test(%arg0: f32, %arg1: f32, %arg2: index, %arg3: index) -> (f32, tensor<3xf32>) { | 
|  | // Create a new tensor with [%arg0, %arg0, %arg0]. | 
|  | %0 = tensor.from_elements %arg0, %arg0, %arg0 : tensor<3xf32> | 
|  |  | 
|  | // Insert something into the new tensor. | 
|  | %1 = tensor.insert %arg1 into %0[%arg2] : tensor<3xf32> | 
|  |  | 
|  | // Read from the old tensor. | 
|  | %r = tensor.extract %0[%arg3] : tensor<3xf32> | 
|  |  | 
|  | // Return the extracted value and the result of the insertion. | 
|  | func.return %r, %1 : f32, tensor<3xf32> | 
|  | } | 
|  | ``` | 
|  |  | 
|  | The output IR is as follows: | 
|  |  | 
|  | ```mlir | 
|  | func.func @test(%arg0: f32, %arg1: f32, %arg2: index, %arg3: index) -> (f32, tensor<3xf32>) { | 
|  | %from_elements = tensor.from_elements %arg0, %arg0, %arg0 {"C_0[DEF: result 0]"} : tensor<3xf32> | 
|  | %inserted = tensor.insert %arg1 into %from_elements[%arg2] {"C_0[CONFL-WRITE: 1]", __inplace_operands_attr__ = ["none", "false", "none"]} : tensor<3xf32> | 
|  | %extracted = tensor.extract %from_elements[%arg3] {"C_0[READ: 0]", __inplace_operands_attr__ = ["true", "none"]} : tensor<3xf32> | 
|  | return {__inplace_operands_attr__ = ["none", "true"]} %extracted, %inserted : f32, tensor<3xf32> | 
|  | } | 
|  | ``` | 
|  |  | 
|  | Note that the IR was not bufferized. It was merely annotated with the results | 
|  | of the bufferization analysis. Every operation with tensor semantics has a | 
|  | `__inplace_operands_attr__` attribute with one value per operand. If an operand | 
|  | is not a tensor, the respective value is `none`. Otherwise, if the operand was | 
|  | decided to be bufferized in-place, the value is `true`. A value of `false` | 
|  | indicates a buffer copy. In the above example, a buffer copy would be inserted | 
|  | for `tensor.insert`, so that it does not overwrite `buffer(%from_elements)`, | 
|  | which is still needed for `tensor.extract`. | 
|  |  | 
|  | For each RaW (there is only one in the example), three `C_i` attributes were | 
|  | added: | 
|  |  | 
|  | * `C_0[DEF: result 0]`: A tensor is defined: 0-th result of | 
|  | `tensor.from_elements`. | 
|  | * `C_0[CONFL-WRITE: 1]`: An operation (if bufferized in-place) would write into | 
|  | the future buffer of the defined tensor: 1-st operand of `tensor.insert`. | 
|  | * `C_0[READ: 0]`: An operation reads the tensor definition: 0-th operand of | 
|  | `tensor.extract`. | 
|  |  | 
|  | The fully bufferized IR (with the inserted buffer copy) is as follows: | 
|  |  | 
|  | ```mlir | 
|  | func.func @test(%arg0: f32, %arg1: f32, %arg2: index, %arg3: index) -> (f32, memref<3xf32>) { | 
|  | %c2 = arith.constant 2 : index | 
|  | %c1 = arith.constant 1 : index | 
|  | %c0 = arith.constant 0 : index | 
|  | %alloc = memref.alloc() {alignment = 64 : i64} : memref<3xf32> | 
|  | memref.store %arg0, %alloc[%c0] : memref<3xf32> | 
|  | memref.store %arg0, %alloc[%c1] : memref<3xf32> | 
|  | memref.store %arg0, %alloc[%c2] : memref<3xf32> | 
|  | %alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<3xf32> | 
|  | memref.copy %alloc, %alloc_0 : memref<3xf32> to memref<3xf32> | 
|  | memref.store %arg1, %alloc_0[%arg2] : memref<3xf32> | 
|  | %0 = memref.load %alloc[%arg3] : memref<3xf32> | 
|  | return %0, %alloc_0 : f32, memref<3xf32> | 
|  | } | 
|  | ``` | 
|  |  | 
|  | To get a better understanding of the SSA Use-Def Chain Analysis and the RaW | 
|  | conflict detection algorithm, interested users may want to refer to: | 
|  |  | 
|  | * [Original design document](https://discourse.llvm.org/uploads/short-url/5kckJ3DftYwQokG252teFgw3sYa.pdf) | 
|  | * [ODM talk](https://youtu.be/TXEo59CYS9A), ([slides](https://mlir.llvm.org/OpenMeetings/2022-01-13-One-Shot-Bufferization.pdf)). | 
|  | * [LLVM Dev Meeting 2023 tutorial slides](https://m-sp.org/downloads/llvm_dev_2023.pdf) | 
|  |  | 
|  | ## Migrating from Dialect Conversion-based Bufferization | 
|  |  | 
|  | Both dialect conversion-based bufferization and One-Shot Bufferize generate | 
|  | `to_tensor`/`to_memref` ops at the bufferization boundary (when run with | 
|  | `allow-unknown-ops`). They can be combined and run in sequence. However, | 
|  | One-Shot Bufferize must run first because it cannot analyze those boundary ops. | 
|  | To update existing code step-by-step, it may be useful to specify a dialect | 
|  | filter for One-Shot Bufferize, so that dialects can be switched over one-by-one. | 
|  |  | 
|  | ## Dialect Conversion-based Bufferization | 
|  |  | 
|  | Disclaimer: Most dialect conversion-based bufferization has been migrated to | 
|  | One-Shot Bufferize. New users should use One-Shot Bufferize (with or without | 
|  | analysis). The following documentation is only for existing users of dialect | 
|  | conversion-based bufferization. | 
|  |  | 
|  | This system is a simple application of MLIR's dialect conversion infrastructure. | 
|  | The bulk of the code related to bufferization is a set of ordinary | 
|  | `ConversionPattern`'s that dialect authors write for converting ops that operate | 
|  | on `tensor`'s to ops that operate on `memref`'s. A set of conventions and best | 
|  | practices are followed that allow these patterns to be run across multiple | 
|  | independent passes (rather than requiring a single huge atomic conversion pass), | 
|  | which makes the compilation pipelines scalable, robust, and easy to debug. | 
|  |  | 
|  | This document is targeted at people looking to utilize MLIR's bufferization | 
|  | functionality, along with people who want to extend it to cover their own ops. | 
|  |  | 
|  | <a name="the-talk">**NOTE:**</a> Before reading this document, please watch the | 
|  | talk "Type Conversions the Not-So-Hard-Way: MLIR's New Bufferization | 
|  | Infrastructure" | 
|  | ([slides](https://drive.google.com/file/d/1FVbzCXxZzS9LBLuvpPNLWJD-XDkt54ky/view?usp=sharing), | 
|  | [recording](https://drive.google.com/file/d/1VfVajitgf8ZPnd-HRkJvaJiFLhBsluXN/view?usp=sharing)). | 
|  | That talk gives a high-level overview of the bufferization infrastructure and | 
|  | important conceptual details related to using the MLIR dialect conversion | 
|  | infrastructure. | 
|  |  | 
|  | ### Bufferization's place in a compilation pipeline | 
|  |  | 
|  | Bufferization itself does not free any of the buffers that have been allocated, | 
|  | nor does it do anything particularly intelligent with the placement of buffers | 
|  | w.r.t. control flow. Thus, a realistic compilation pipeline will usually consist | 
|  | of: | 
|  |  | 
|  | 1.  Bufferization | 
|  | 1.  Buffer optimizations such as `buffer-hoisting`, `buffer-loop-hoisting`, and | 
|  | `promote-buffers-to-stack`, which do optimizations that are only exposed | 
|  | after bufferization. | 
|  | 1.  Finally, running the [ownership-based buffer deallocation](OwnershipBasedBufferDeallocation.md) | 
|  | pass. | 
|  |  | 
|  | After buffer deallocation has been completed, the program will be quite | 
|  | difficult to transform due to the presence of the deallocation ops. Thus, other | 
|  | optimizations such as linalg fusion on memrefs should be done before that stage. | 
|  |  | 
|  | ### General structure of the bufferization process | 
|  |  | 
|  | Bufferization consists of running multiple *partial* bufferization passes, | 
|  | followed by one *finalizing* bufferization pass. | 
|  |  | 
|  | There is typically one partial bufferization pass per dialect (though other | 
|  | subdivisions are possible). For example, for a dialect `X` there will typically | 
|  | be a pass `X-bufferize` that knows how to bufferize all the ops in that dialect. | 
|  | By running pass `X-bufferize` for each dialect `X` in the program, all the ops | 
|  | in the program are incrementally bufferized. | 
|  |  | 
|  | Partial bufferization passes create programs where only some ops have been | 
|  | bufferized. These passes will create *materializations* (also sometimes called | 
|  | "casts") that convert between the `tensor` and `memref` type, which allows | 
|  | bridging between ops that have been bufferized and ops that have not yet been | 
|  | bufferized. | 
|  |  | 
|  | Finalizing bufferizations complete the bufferization process, and guarantee that | 
|  | there are no tensors remaining in the program. This involves eliminating the | 
|  | materializations. The pass `finalizing-bufferize` provides a minimal pass that | 
|  | only eliminates materializations and issues an error if any unbufferized ops | 
|  | exist in the program. | 
|  |  | 
|  | However, it is possible for a finalizing bufferization to do more than just | 
|  | eliminate materializations. By adding patterns (just as a partial bufferization | 
|  | would), it is possible for a finalizing bufferization pass to simultaneously | 
|  | bufferize ops and eliminate materializations. This has a number of disadvantages | 
|  | discussed in the talk and should generally be avoided. | 
|  |  | 
|  | ### Example | 
|  |  | 
|  | As a concrete example, we will look at the bufferization pipeline from the | 
|  | `mlir-npcomp` reference backend | 
|  | ([code](https://github.com/llvm/mlir-npcomp/blob/97d6d04d41216e73d40b89ffd79620973fc14ce3/lib/RefBackend/RefBackend.cpp#L232)). | 
|  | The code, slightly simplified and annotated, is reproduced here: | 
|  |  | 
|  | ```c++ | 
|  | // Partial bufferization passes. | 
|  | pm.addPass(createTensorConstantBufferizePass()); | 
|  | pm.addNestedPass<func::FuncOp>(createTCPBufferizePass()); // Bufferizes the downstream `tcp` dialect. | 
|  | pm.addNestedPass<func::FuncOp>(createSCFBufferizePass()); | 
|  | pm.addNestedPass<func::FuncOp>(createLinalgBufferizePass()); | 
|  | pm.addNestedPass<func::FuncOp>(createTensorBufferizePass()); | 
|  | pm.addPass(createFuncBufferizePass()); | 
|  |  | 
|  | // Finalizing bufferization pass. | 
|  | pm.addNestedPass<func::FuncOp>(createFinalizingBufferizePass()); | 
|  | ``` | 
|  |  | 
|  | Looking first at the partial bufferization passes, we see that there are a | 
|  | sequence of `FuncOp` passes (which run in parallel on functions). These function | 
|  | passes are bracketed by `arith-bufferize` and `func-bufferize`, which are module | 
|  | passes (and thus serialize the parallel compilation process). These two passes | 
|  | must be module passes because they make changes to the top-level module. | 
|  |  | 
|  | The bulk of the bufferization work is done by the function passes. Most of these | 
|  | passes are provided as part of the upstream MLIR distribution and bufferize | 
|  | their respective dialects (e.g. `scf-bufferize` bufferizes the `scf` dialect). | 
|  | The `tcp-bufferize` pass is an exception -- it is a partial bufferization pass | 
|  | used to bufferize the downstream `tcp` dialect, and fits in perfectly with all | 
|  | the other passes provided upstream. | 
|  |  | 
|  | The last pass is the finalizing bufferization pass. The `mlir-npcomp` reference | 
|  | backend has arranged that all ops are bufferized by partial bufferizations, so | 
|  | that the upstream `finalizing-bufferize` pass can be used as the finalizing | 
|  | bufferization pass. This gives excellent diagnostics when something goes wrong | 
|  | with the bufferization process, such as due to an op that wasn't handled by any | 
|  | pattern. | 
|  |  | 
|  | ### How to write a partial bufferization pass | 
|  |  | 
|  | The contract of a partial bufferization pass is that a subset of ops (or kinds | 
|  | of ops, customizable by a ConversionTarget) get bufferized. | 
|  |  | 
|  | A partial bufferization pass is just a pass that uses the | 
|  | [dialect conversion](DialectConversion.md) framework to apply | 
|  | `ConversionPattern`s with a `tensor` to `memref` type conversion. | 
|  |  | 
|  | To describe how to write such a pass, we will walk through an example, the | 
|  | `tensor-bufferize` pass | 
|  | ([code](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/lib/Dialect/Tensor/Transforms/Bufferize.cpp#L23), | 
|  | [test](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/test/Dialect/Tensor/bufferize.mlir#L1)) | 
|  | that bufferizes the `tensor` dialect. Note that these passes have been replaced | 
|  | with a `BufferizableOpInterface`-based implementation in the meantime, so we | 
|  | have to take a looker at an older version of the code. | 
|  |  | 
|  | The bulk of the code in the pass will be a set of conversion patterns, with a | 
|  | simple example being | 
|  | [BufferizeCastOp](https://github.com/llvm/llvm-project/blob/2bf6e443e54604c7818c4d1a1837f3d091023270/mlir/lib/Dialect/Tensor/Transforms/Bufferize.cpp#L23)). | 
|  |  | 
|  | ``` | 
|  | class BufferizeCastOp : public OpConversionPattern<tensor::CastOp> { | 
|  | public: | 
|  | using OpConversionPattern::OpConversionPattern; | 
|  | LogicalResult | 
|  | matchAndRewrite(tensor::CastOp op, OpAdaptor adaptor, | 
|  | ConversionPatternRewriter &rewriter) const override { | 
|  | auto resultType = getTypeConverter()->convertType(op.getType()); | 
|  | rewriter.replaceOpWithNewOp<MemRefCastOp>(op, resultType, adaptor.source()); | 
|  | return success(); | 
|  | } | 
|  | }; | 
|  | ``` | 
|  |  | 
|  | See [the talk](#the-talk) for more details on how to write these patterns. | 
|  |  | 
|  | The | 
|  | [pass itself](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/lib/Dialect/Tensor/Transforms/Bufferize.cpp#L57) | 
|  | is very small, and follows the basic pattern of any dialect conversion pass. | 
|  |  | 
|  | ``` | 
|  | void mlir::populateTensorBufferizePatterns( | 
|  | BufferizeTypeConverter &typeConverter, RewritePatternSet &patterns) { | 
|  | patterns.add<BufferizeCastOp, BufferizeExtractOp>(typeConverter, | 
|  | patterns.getContext()); | 
|  | } | 
|  |  | 
|  | struct TensorBufferizePass : public TensorBufferizeBase<TensorBufferizePass> { | 
|  | void runOnOperation() override { | 
|  | auto *context = &getContext(); | 
|  | BufferizeTypeConverter typeConverter; | 
|  | RewritePatternSet patterns(context); | 
|  | ConversionTarget target(*context); | 
|  |  | 
|  | populateTensorBufferizePatterns(typeConverter, patterns); | 
|  | target.addIllegalOp<tensor::CastOp, tensor::ExtractOp>(); | 
|  | target.addLegalDialect<func::FuncDialect>(); | 
|  |  | 
|  | if (failed( | 
|  | applyPartialConversion(getOperation(), target, std::move(patterns)))) | 
|  | signalPassFailure(); | 
|  | } | 
|  | }; | 
|  | ``` | 
|  |  | 
|  | The pass has all the hallmarks of a dialect conversion pass that does type | 
|  | conversions: a `TypeConverter`, a `RewritePatternSet`, and a `ConversionTarget`, | 
|  | and a call to `applyPartialConversion`. Note that a function | 
|  | `populateTensorBufferizePatterns` is separated, so that power users can use the | 
|  | patterns independently, if necessary (such as to combine multiple sets of | 
|  | conversion patterns into a single conversion call, for performance). | 
|  |  | 
|  | One convenient utility provided by the MLIR bufferization infrastructure is the | 
|  | `BufferizeTypeConverter`, which comes pre-loaded with the necessary conversions | 
|  | and materializations between `tensor` and `memref`. | 
|  |  | 
|  | In this case, the `BufferizationOpsDialect` is marked as legal, so the | 
|  | `bufferization.to_tensor` and `bufferization.to_memref` ops, which are inserted | 
|  | automatically by the dialect conversion framework as materializations, are | 
|  | legal. There is a helper `populateBufferizeMaterializationLegality` | 
|  | ([code](https://github.com/llvm/llvm-project/blob/a0b65a7bcd6065688189b3d678c42ed6af9603db/mlir/include/mlir/Transforms/Bufferize.h#L53)) | 
|  | which helps with this in general. | 
|  |  | 
|  | ### Other partial bufferization examples | 
|  |  | 
|  | -   `scf-bufferize` | 
|  | ([code](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/lib/Dialect/SCF/Transforms/Bufferize.cpp#L1), | 
|  | [test](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/test/Dialect/SCF/bufferize.mlir#L1)) | 
|  |  | 
|  | -   Bufferizes ops from the `scf` dialect. | 
|  | -   This is an example of how to bufferize ops that implement | 
|  | `RegionBranchOpInterface` (that is, they use regions to represent | 
|  | control flow). | 
|  | -   The bulk of the work is done by | 
|  | `lib/Dialect/SCF/Transforms/StructuralTypeConversions.cpp` | 
|  | ([code](https://github.com/llvm/llvm-project/blob/daaaed6bb89044ac58a23f1bb1ccdd12342a5a58/mlir/lib/Dialect/SCF/Transforms/StructuralTypeConversions.cpp#L1)), | 
|  | which is well-commented and covers how to correctly convert ops that | 
|  | contain regions. | 
|  |  | 
|  | -   `func-bufferize` | 
|  | ([code](https://github.com/llvm/llvm-project/blob/2f5715dc78328215d51d5664c72c632a6dac1046/mlir/lib/Dialect/Func/Transforms/FuncBufferize.cpp#L1), | 
|  | [test](https://github.com/llvm/llvm-project/blob/2f5715dc78328215d51d5664c72c632a6dac1046/mlir/test/Dialect/Func/func-bufferize.mlir#L1)) | 
|  |  | 
|  | -   Bufferizes `func`, `call`, and `BranchOpInterface` ops. | 
|  | -   This is an example of how to bufferize ops that have multi-block | 
|  | regions. | 
|  | -   This is an example of a pass that is not split along dialect | 
|  | subdivisions. | 
|  |  | 
|  | ### How to write a finalizing bufferization pass | 
|  |  | 
|  | The contract of a finalizing bufferization pass is that all tensors are gone | 
|  | from the program. | 
|  |  | 
|  | The easiest way to write a finalizing bufferize pass is to not write one at all! | 
|  | MLIR provides a pass `finalizing-bufferize` which eliminates the | 
|  | `bufferization.to_tensor` / `bufferization.to_memref` materialization ops | 
|  | inserted by partial bufferization passes and emits an error if that is not | 
|  | sufficient to remove all tensors from the program. | 
|  |  | 
|  | This pass is sufficient when partial bufferization passes have bufferized all | 
|  | the ops in the program, leaving behind only the materializations. When possible, | 
|  | it is recommended to structure your pass pipeline this way, as this has the | 
|  | significant advantage that if an op does not get bufferized (due to a missing | 
|  | pattern, bug in the code, etc.), `finalizing-bufferize` will emit a nice clean | 
|  | error, and the IR seen by `finalizing-bufferize` will only contain only one | 
|  | unbufferized op. | 
|  |  | 
|  | However, before the current bufferization infrastructure was put in place, | 
|  | bufferization could only be done as a single finalizing bufferization mega-pass | 
|  | that used the `populate*BufferizePatterns` functions from multiple dialects to | 
|  | simultaneously bufferize everything at once. Thus, one might see code in | 
|  | downstream projects structured this way. This structure is not recommended in | 
|  | new code. A helper, `populateEliminateBufferizeMaterializationsPatterns` | 
|  | ([code](https://github.com/llvm/llvm-project/blob/a0b65a7bcd6065688189b3d678c42ed6af9603db/mlir/include/mlir/Transforms/Bufferize.h#L58)) | 
|  | is available for such passes to provide patterns that eliminate | 
|  | `bufferization.to_tensor` and `bufferization.to_memref`. | 
|  |  | 
|  | ### Changes since [the talk](#the-talk) | 
|  |  | 
|  | -   `func-bufferize` was changed to be a partial conversion pass, and there is a | 
|  | new `finalizing-bufferize` which serves as a general finalizing | 
|  | bufferization pass. | 
|  | -   Most partial bufferization passes have been reimplemented in terms of | 
|  | `BufferizableOpInterface`. New users should use One-Shot Bufferize instead | 
|  | of dialect conversion-based bufferization. |