mlir/docs/Bufferization.md - llvm-project.git - Git at Google

 # Bufferization

 [TOC]

 ## Overview

 Bufferization in MLIR is the process of converting ops with `tensor` semantics
 to ops with `memref` semantics. MLIR provides an infrastructure that bufferizes
 an entire program in a single pass (*One-Shot Bufferize*). This infrastructure
 bufferizes all ops that implement the
 [`BufferizableOpInterface`](https://github.com/llvm/llvm-project/blob/17a68065c378da74805e4e1b9a5b78cc9f83e580/mlir/include/mlir/Dialect/Bufferization/IR/BufferizableOpInterface.td)
 can be bufferized.

 MLIR has an older bufferization infrastructure built around
 [dialect conversion](DialectConversion.md). Most dialect conversion
 bufferization patterns have been migrated to One-Shot Bufferize, but some
 functionality such as function boundary bufferization still depends on dialect
 conversion and its type converter. New projects should use One-Shot Bufferize,
 as the dialect conversion-based bufferization will eventually be deprecated.
 Moreover, One-Shot Bufferize results in better bufferization with fewer memory
 allocations and buffer copies. This documentation is mostly about One-Shot
 Bufferize, but also describes how to gradually migrate a project from dialect
 conversion-based bufferization to One-Shot Bufferize.

 ## What is One-Shot Bufferize?

 One-Shot Bufferize is a new tensor bufferization pass designed for IR in
 [destination-passing style](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/dps-fhpc17.pdf),
 and with aggressive in-place bufferization.

 One-Shot Bufferize is:

 *   **Monolithic**: A single MLIR pass does the entire work, whereas the
     previous bufferization in MLIR was split across multiple passes residing in
     different dialects. In One-Shot Bufferize, `BufferizableOpInterface`
     implementations are spread across different dialects.

 *   A **whole-function at a time analysis**. In-place bufferization decisions
     are made by analyzing SSA use-def chains on tensors. Op interface
     implementations not only provide the rewrite logic from tensor ops to memref
     ops, but also helper methods for One-Shot Bufferize's analysis to query
     information about an op's bufferization/memory semantics.

 *   **Extensible** via an op interface: All ops that implement
     `BufferizableOpInterface` can be bufferized.

 *   **2-Pass**: Bufferization is internally broken down into 2 steps: First,
     analyze the entire IR and make bufferization decisions. Then, bufferize
     (rewrite) the IR. The analysis has access to exact SSA use-def information.
     It incrementally builds alias and equivalence sets and does not rely on a
     posteriori-alias analysis from preallocated memory.

 *   **Greedy**: Operations are analyzed one-by-one and it is decided on the spot
     whether a tensor OpOperand must be copied or not. Heuristics determine the
     order of analysis.

 *   **Modular**: The current One-Shot Analysis can be replaced with a different
     analysis. The result of the analysis are queried by the bufferization via
     `AnalysisState`, in particular `AnalysisState::isInPlace`. Any derived class
     of `AnalysisState` that implements a small number virtual functions can
     serve as a custom analysis. It is even possible to run One-Shot Bufferize
     without any analysis (`AlwaysCopyAnalysisState`), in which case One-Shot
     Bufferize behaves exactly like the old dialect conversion-based
     bufferization (i.e., copy every buffer before writing to it).

 To reduce complexity, One-Shot Bufferize should be
 [run after other transformations](https://llvm.discourse.group/t/rfc-linalg-on-tensors-update-and-comprehensive-bufferization-rfc/3373),
 typically as one of the last steps right before lowering memref ops. Many
 transformations are easier in tensor land; e.g., tile/fuse/… on tensors first,
 then bufferize the remaining IR.

 From an architecture perspective, One-Shot Bufferize consists of
 [BufferizableOpInterface](https://github.com/llvm/llvm-project/blob/17a68065c378da74805e4e1b9a5b78cc9f83e580/mlir/include/mlir/Dialect/Bufferization/IR/BufferizableOpInterface.td)
 (and its implementations) and an
 [analysis](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/OneShotAnalysis.h#L164)
 of tensor SSA values that decides if a buffer can be used directly or must be
 copied. The [bufferize] method of the op interface inspects analysis results and
 rewrites tensor ops into memref ops.

 ## Goals of Bufferization

 The high-level goal of every bufferization technique is to: 1. Use as little
 memory as possible. 2. Copy as little memory as possible.

 This implies reusing already allocated buffers when possible, turning
 bufferization into an algorithmically complex problem with similarities to
 register allocation.

 Depending on the concrete use case, there may be additional bufferization
 requirements. If the contents of a buffer are expensive to compute, there could
 be a tradeoff between *recomputation* and *compute once and copy*. On the
 contrary, it may not even be possible to allocate new buffers at runtime on some
 architectures.

 ## Destination-Passing Style

 Bufferization is an algorithmically complex problem. Given an op with a tensor
 result, bufferization has to choose a memref buffer in which the result can be
 stored. It is always safe to allocate a brand new buffer, but such a
 bufferization strategy would be unacceptable for high-performance codegen. When
 choosing an already existing buffer, we must be careful not to accidentally
 overwrite data that is still needed later in the program.

 To simplify this problem, One-Shot Bufferize was designed to take advantage of
 *destination-passing style*. This form exists in itself independently of
 bufferization and is tied to SSA semantics: many ops are “updating” part of
 their input SSA variable. For example the LLVM instruction
 [`insertelement`](https://llvm.org/docs/LangRef.html#insertelement-instruction)
 is inserting an element inside a vector. Since SSA values are immutable, the
 operation returns a copy of the input vector with the element inserted.
 Another example in MLIR is `linalg.generic`, which always has an extra `outs`
 operand which provides the initial values to update (for example when the
 operation is doing a reduction).

 This input is referred to as "destination" in the following (quotes are
 important as this operand isn't modified in place but copied) and comes into
 place in the context of bufferization as a possible "anchor" for the
 bufferization algorithm. This allows the user to shape the input in a form that
 guarantees close to optimal bufferization result when carefully choosing the
 SSA value used as "destination".

 For every tensor result, a "destination-passing" style op has a corresponding
 tensor operand. If there aren't any other uses of this tensor, the bufferization
 can alias it with the op result and perform the operation "in-place" by reusing
 the buffer allocated for this "destination" input.

 As an example, consider the following op: `%0 = tensor.insert %cst into
 %t[%idx] : tensor<?xf32>`

 `%t` is the "destination" in this example. When choosing a buffer for the result
 `%0`, denoted as `buffer(%0)`, One-Shot Bufferize considers only two options:

 1.  `buffer(%0) = buffer(%t)` : alias the "destination" tensor with the
     result and perform the operation in-place.
 2.  `buffer(%0)` is a newly allocated buffer.

 There may be other buffers in the same function that could potentially be used
 for `buffer(%0)`, but those are not considered by One-Shot Bufferize to keep the
 bufferization simple. One-Shot Bufferize could be extended to consider such
 buffers in the future to achieve a better quality of bufferization.

 Tensor ops that are not in destination-passing style always bufferized to a
 memory allocation. E.g.:

 ```mlir
 %0 = tensor.generate %sz {
 ^bb0(%i : index):
   %cst = arith.constant 0.0 : f32
   tensor.yield %cst : f32
 } : tensor<?xf32>
 ```

 The result of `tensor.generate` does not have a "destination" operand, so
 bufferization allocates a new buffer. This could be avoided by choosing an
 op such as `linalg.generic`, which can express the same computation with a
 "destination" operand, as specified behind outputs (`outs`):

 ```mlir
 #map = affine_map<(i) -> (i)>
 %0 = linalg.generic {indexing_maps = [#map], iterator_types = ["parallel"]}
                     outs(%t : tensor<?xf32>) {
   ^bb0(%arg0 : f32):
     %cst = arith.constant 0.0 : f32
     linalg.yield %cst : f32
 } -> tensor<?xf32>
 ```

 At first glance, the above `linalg.generic` op may not seem very useful because
 the output tensor `%t` is entirely overwritten. Why pass the tensor `%t` as an
 operand in the first place? As an example, this can be useful for overwriting a
 slice of a tensor:

 ```mlir
 %t = tensor.extract_slice %s [%idx] [%sz] [1] : tensor<?xf32> to tensor<?xf32>
 %0 = linalg.generic ... outs(%t) { ... } -> tensor<?xf32>
 %1 = tensor.insert_slice %0 into %s [%idx] [%sz] [1]
     : tensor<?xf32> into tensor<?xf32>
 ```

 The above example bufferizes to a `memref.subview`, followed by a
 "`linalg.generic` on memrefs" that overwrites the memory of the subview, assuming
 that the slice `%t` has no other user. The `tensor.insert_slice` then bufferizes
 to a no-op (in the absence of RaW conflicts such as a subsequent read of `%s`).

 RaW conflicts are detected with an analysis of SSA use-def chains (details
 later). One-Shot Bufferize works best if there is a single SSA use-def chain,
 where the result of a tensor op is the operand of the next tensor ops, e.g.:

 ```mlir
 %0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>)
 %1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>)
 %2 = "my_dialect.yet_another_op"(%1) : (tensor<?xf32>) -> (tensor<?xf32>)
 ```

 Buffer copies are likely inserted if the SSA use-def chain splits at some point,
 e.g.:

 ```mlir
 %0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>)
 %1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>)
 %2 = "my_dialect.yet_another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>)
 ```

 One-Shot Bufferize has debug flags (`test-analysis-only print-conflicts`) that
 print the results of the analysis and explain to the user why buffer copies were
 inserted.

 ## Using One-Shot Bufferize

 MLIR provides a pass
 [`-one-shot-bufferize`](https://mlir.llvm.org/docs/Passes/#-one-shot-bufferize-one-shot-bufferize)
 that performs an analysis and bufferizes all ops with tensor semantics that
 implement `BufferizableOpInterface`. For modularity reasons, these op interface
 implementations are typically external models that live in a dialect's
 "Transforms" build unit. (External models are a mechanism for implementing an op
 interface in a different build unit.) It is the user's responsibility to ensure
 that all needed external models are registered before running One-Shot
 Bufferize.

 By default, One-Shot Bufferize fails when it encounters an op with tensor
 semantics (i.e., tensor result or tensor operand) that is not bufferizable
 (i.e., does not implement `BufferizableOpInterface`). This can be avoided with
 `allow-unknown-ops`. In that case, One-Shot Bufferize inserts
 `to_memref`/`to_tensor` ops around the bufferization boundary. These ops are
 named versions of `unrealized_conversion_cast`. Note that One-Shot Bufferize's
 analysis can currently not analyze these ops, so input IR with such ops may fail
 bufferization. Therefore, running One-Shot Bufferize multiple times in a
 sequence is also not supported at the moment.

 One-Shot Bufferize can be configured to bufferize only ops from a set of
 dialects with `dialect-filter`. This can be useful for gradually migrating from
 dialect conversion-based bufferization to One-Shot Bufferize. One-Shot Bufferize
 must run first in such a case, because dialect conversion-based bufferization
 generates `to_tensor`/`to_memref` ops which One-Shot Bufferize cannot analyze.

 One-Shot Bufferize can also be called programmatically with
 [`bufferization::runOneShotBufferize`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/OneShotAnalysis.h#L167).
 Alternatively,
 [`bufferization::bufferizeOp`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/Bufferize.h#L78)
 skips the analysis and inserts a copy on every buffer write, just like the
 dialect conversion-based bufferization.

 ## Buffer Deallocation

 **Important: this pass is deprecated, please use the ownership based buffer**
 **deallocation pass instead**

 One-Shot Bufferize deallocates all buffers that it allocates. This is in
 contrast to the dialect conversion-based bufferization that delegates this job
 to the
 [`-buffer-deallocation`](https://mlir.llvm.org/docs/Passes/#-buffer-deallocation-adds-all-required-dealloc-operations-for-all-allocations-in-the-input-program)
 pass. By default, One-Shot Bufferize rejects IR where a newly allocated buffer
 is returned from a block. Such IR will fail bufferization.

 A new buffer allocation is returned from a block when the result of an op that
 is not in destination-passing style is returned. E.g.:

 ```mlir
 %0 = scf.if %c -> (tensor<?xf32>) {
   %1 = tensor.generate ... -> tensor<?xf32>
   scf.yield %1 : tensor<?xf32>
 } else {
   scf.yield %another_tensor : tensor<?xf32>
 }
 ```

 The `scf.yield` in the "else" branch is OK, but the `scf.yield` in the "then"
 branch will be rejected.

 Another case in which a buffer allocation may be returned is when a buffer copy
 must be inserted due to a RaW conflict. E.g.:

 ```mlir
 %0 = scf.if %c -> (tensor<?xf32>) {
   %1 = tensor.insert %cst into %another_tensor[%idx] : tensor<?xf32>
   "my_dialect.reading_tensor_op"(%another_tensor) : (tensor<?xf32>) -> ()
   ...
   scf.yield %1 : tensor<?xf32>
 } else {
   scf.yield %yet_another_tensor : tensor<?xf32>
 }
 ```

 In the above example, a buffer copy of `buffer(%another_tensor)` (with `%cst`
 inserted) is yielded from the "then" branch.

 Note: Buffer allocations that are returned from a function are not deallocated.
 It is the caller's responsibility to deallocate the buffer. For the full
 function boundary ABI for MemRefs w.r.t. buffer deallocation refer to the
 [*Function Boundary ABI*](#function-boundary-abi) section. In the future, this
 could be automated with allocation hoisting (across function boundaries) or
 reference counting.

 One-Shot Bufferize leaks all memory and does not generate any buffer
 deallocations. The `-buffer-deallocation-pipeline` has to be run afterwards to
 insert the deallocation operations.

 ## Ownership-based Buffer Deallocation

 Recommended compilation pipeline:
 ```
 one-shot-bufferize
        |          it's recommended to perform all bufferization here at latest,
        |       <- any allocations inserted after this point have to be handled
        V          manually
 expand-realloc
        V
 ownership-based-buffer-deallocation
        V
   canonicalize <- mostly for scf.if simplifications
        V
 buffer-deallocation-simplification
        V       <- from this point onwards no tensor values are allowed
 lower-deallocations
        V
       CSE
        V
   canonicalize
 ```

 One-Shot Bufferize does not deallocate any buffers that it allocates. This job
 is delegated to the
 [`-ownership-based-buffer-deallocation`](https://mlir.llvm.org/docs/Passes/#-ownership-based-buffer-deallocation)
 pass, i.e., after running One-Shot Bufferize, the result IR may have a number of
 `memref.alloc` ops, but no `memref.dealloc` ops. This pass processes operations
 implementing `FunctionOpInterface` one-by-one without analysing the call-graph.
 This means, that there have to be [some rules](#function-boundary-abi) on how
 MemRefs are handled when being passed from one function to another. The rest of
 the pass revolves heavily around the `bufferization.dealloc` operation which is
 inserted at the end of each basic block with appropriate operands and should be
 optimized using the Buffer Deallocation Simplification pass
 (`--buffer-deallocation-simplification`) and the regular canonicalizer
 (`--canonicalize`). Lowering the result of the
 `-ownership-based-buffer-deallocation` pass directly using
 `--convert-bufferization-to-memref` without beforehand optimization is not
 recommended as it will lead to very inefficient code (the runtime-cost of
 `bufferization.dealloc` is `O(|memrefs|^2+|memref|*|retained|)`).

 ### Function boundary ABI

 The Buffer Deallocation pass operates on the level of operations implementing
 the `FunctionOpInterface`. Such operations can take MemRefs as arguments, but
 also return them. To ensure compatibility among all functions (including
 external ones), some rules have to be enforced:
 *   When a MemRef is passed as a function argument, ownership is never acquired.
     It is always the caller's responsibility to deallocate such MemRefs.
 *   Returning a MemRef from a function always passes ownership to the caller,
     i.e., it is also the caller's responsibility to deallocate memrefs returned
     from a called function.
 *   A function must not return a MemRef with the same allocated base buffer as
     one of its arguments (in this case a copy has to be created). Note that in
     this context two subviews of the same buffer that don't overlap are also
     considered to alias.

 For external functions (e.g., library functions written externally in C), the
 externally provided implementation has to adhere to these rules and they are
 just assumed by the buffer deallocation pass. Functions on which the
 deallocation pass is applied and the implementation is accessible are modified
 by the pass such that the ABI is respected (i.e., buffer copies are inserted as
 necessary).

 ### Inserting `bufferization.dealloc` operations

 `bufferization.dealloc` operations are unconditionally inserted at the end of
 each basic block (just before the terminator). The majority of the pass is about
 finding the correct operands for this operation. There are three variadic
 operand lists to be populated, the first contains all MemRef values that may
 need to be deallocated, the second list contains their associated ownership
 values (of `i1` type), and the third list contains MemRef values that are still
 needed at a later point and should thus not be deallocated. This operation
 allows us to deal with any kind of aliasing behavior: it lowers to runtime
 aliasing checks when not enough information can be collected statically. When
 enough aliasing information is statically available, operands or the entire op
 may fold away.

 **Ownerships**

 To do so, we use a concept of ownership indicators of memrefs which materialize
 as an `i1` value for any SSA value of `memref` type, indicating whether the
 basic block in which it was materialized has ownership of this MemRef. Ideally,
 this is a constant `true` or `false`, but might also be a non-constant SSA
 value. To keep track of those ownership values without immediately materializing
 them (which might require insertion of `bufferization.clone` operations or
 operations checking for aliasing at runtime at positions where we don't actually
 need a materialized value), we use the `Ownership` class. This class represents
 the ownership in three states forming a lattice on a partial order:
 ```
 forall X in SSA values. uninitialized < unique(X) < unknown
 forall X, Y in SSA values.
   unique(X) == unique(Y) iff X and Y always evaluate to the same value
   unique(X) != unique(Y) otherwise
 ```
 Intuitively, the states have the following meaning:
 *   Uninitialized: the ownership is not initialized yet, this is the default
     state; once an operation is finished processing the ownership of all
     operation results with MemRef type should not be uninitialized anymore.
 *   Unique: there is a specific SSA value that can be queried to check ownership
     without materializing any additional IR
 *   Unknown: no specific SSA value is available without materializing additional
     IR, typically this is because two ownerships in 'Unique' state would have to
     be merged manually (e.g., the result of an `arith.select` either has the
     ownership of the then or else case depending on the condition value,
     inserting another `arith.select` for the ownership values can perform the
     merge and provide a 'Unique' ownership for the result), however, in the
     general case this 'Unknown' state has to be assigned.

 Implied by the above partial order, the pass combines two ownerships in the
 following way:

 | Ownership 1   | Ownership 2   | Combined Ownership |
 |:--------------|:--------------|:-------------------|
 | uninitialized | uninitialized | uninitialized      |
 | unique(X)     | uninitialized | unique(X)          |
 | unique(X)     | unique(X)     | unique(X)          |
 | unique(X)     | unique(Y)     | unknown            |
 | unknown       | unique        | unknown            |
 | unknown       | uninitialized | unknown            |
 | <td colspan=3> + symmetric cases                   |

 **Collecting the list of MemRefs that potentially need to be deallocated**

 For a given block, the list of MemRefs that potentially need to be deallocated
 at the end of that block is computed by keeping track of all values for which
 the block potentially takes over ownership. This includes MemRefs provided as
 basic block arguments, interface handlers for operations like `memref.alloc` and
 `func.call`, but also liveness information in regions with multiple basic
 blocks.  More concretely, it is computed by taking the MemRefs in the 'in' set
 of the liveness analysis of the current basic block B, appended by the MemRef
 block arguments and by the set of MemRefs allocated in B itself (determined by
 the interface handlers), then subtracted (also determined by the interface
 handlers) by the set of MemRefs deallocated in B.

 Note that we don't have to take the intersection of the liveness 'in' set with
 the 'out' set of the predecessor block because a value that is in the 'in' set
 must be defined in an ancestor block that dominates all direct predecessors and
 thus the 'in' set of this block is a subset of the 'out' sets of each
 predecessor.

 ```
 memrefs = filter((liveIn(block) U
   allocated(block) U arguments(block)) \ deallocated(block), isMemRef)
 ```

 The list of conditions for the second variadic operands list of
 `bufferization.dealloc` is computed by querying the stored ownership value for
 each of the MemRefs collected as described above. The ownership state is updated
 by the interface handlers while processing the basic block.

 **Collecting the list of MemRefs to retain**

 Given a basic block B, the list of MemRefs that have to be retained can be
 different for each successor block S.  For the two basic blocks B and S and the
 values passed via block arguments to the destination block S, we compute the
 list of MemRefs that have to be retained in B by taking the MemRefs in the
 successor operand list of the terminator and the MemRefs in the 'out' set of the
 liveness analysis for B intersected with the 'in' set of the destination block
 S.

 This list of retained values makes sure that we cannot run into use-after-free
 situations even if no aliasing information is present at compile-time.

 ```
 toRetain = filter(successorOperands + (liveOut(fromBlock) insersect
   liveIn(toBlock)), isMemRef)
 ```

 ### Supported interfaces

 The pass uses liveness analysis and a few interfaces:
 *   `FunctionOpInterface`
 *   `CallOpInterface`
 *   `MemoryEffectOpInterface`
 *   `RegionBranchOpInterface`
 *   `RegionBranchTerminatorOpInterface`

 Due to insufficient information provided by the interface, it also special-cases
 on the `cf.cond_br` operation and makes some assumptions about operations
 implementing the `RegionBranchOpInterface` at the moment, but improving the
 interfaces would allow us to remove those dependencies in the future.

 ### Limitations

 The Buffer Deallocation pass has some requirements and limitations on the input
 IR. These are checked in the beginning of the pass and errors are emitted
 accordingly:
 *   The set of interfaces the pass operates on must be implemented (correctly).
     E.g., if there is an operation present with a nested region, but does not
     implement the `RegionBranchOpInterface`, an error is emitted because the
     pass cannot know the semantics of the nested region (and does not make any
     default assumptions on it).
 *   No explicit control-flow loops are present. Currently, only loops using
     structural-control-flow are supported.  However, this limitation could be
     lifted in the future.
 *   Deallocation operations should not be present already. The pass should
     handle them correctly already (at least in most cases), but it's not
     supported yet due to insufficient testing.
 *   Terminators must implement either `RegionBranchTerminatorOpInterface` or
     `BranchOpInterface`, but not both. Terminators with more than one successor
     are not supported (except `cf.cond_br`). This is not a fundamental
     limitation, but there is no use-case justifying the more complex
     implementation at the moment.

 ### Example

 The following example contains a few interesting cases:
 *   Basic block arguments are modified to also pass along the ownership
     indicator, but not for entry bocks of non-private functions (assuming the
     `private-function-dynamic-ownership` pass option is disabled) where the
     function boundary ABI is applied instead. "Private" in this context refers
     to functions that cannot be called externally.
 *   The result of `arith.select` initially has 'Unknown' assigned as ownership,
     but once the `bufferization.dealloc` operation is inserted it is put in the
     'retained' list (since it has uses in a later basic block) and thus the
     'Unknown' ownership can be replaced with a 'Unique' ownership using the
     corresponding result of the dealloc operation.
 *   The `cf.cond_br` operation has more than one successor and thus has to
     insert two `bufferization.dealloc` operations (one for each successor).
     While they have the same list of MemRefs to deallocate (because they perform
     the deallocations for the same block), it must be taken into account that
     some MemRefs remain *live* for one branch but not the other (thus set
     intersection is performed on the *live-out* of the current block and the
     *live-in* of the target block). Also, `cf.cond_br` supports separate
     forwarding operands for each successor. To make sure that no MemRef is
     deallocated twice (because there are two `bufferization.dealloc` operations
     with the same MemRefs to deallocate), the condition operands are adjusted to
     take the branch condition into account. While a generic lowering for such
     terminator operations could be implemented, a specialized implementation can
     take all the semantics of this particular operation into account and thus
     generate a more efficient lowering.

 ```mlir
 func.func @example(%memref: memref<?xi8>, %select_cond: i1, %br_cond: i1) {
   %alloc = memref.alloc() : memref<?xi8>
   %alloca = memref.alloca() : memref<?xi8>
   %select = arith.select %select_cond, %alloc, %alloca : memref<?xi8>
   cf.cond_br %br_cond, ^bb1(%alloc : memref<?xi8>), ^bb1(%memref : memref<?xi8>)
 ^bb1(%bbarg: memref<?xi8>):
   test.copy(%bbarg, %select) : (memref<?xi8>, memref<?xi8>)
   return
 }
 ```

 After running `--ownership-based-buffer-deallocation`, it looks as follows:

 ```mlir
 // Since this is not a private function, the signature will not be modified even
 // when private-function-dynamic-ownership is enabled. Instead the function
 // boundary ABI has to be applied which means that ownership of `%memref` will
 // never be acquired.
 func.func @example(%memref: memref<?xi8>, %select_cond: i1, %br_cond: i1) {
   %false = arith.constant false
   %true = arith.constant true

   // The ownership of a MemRef defined by the `memref.alloc` operation is always
   // assigned to be 'true'.
   %alloc = memref.alloc() : memref<?xi8>

   // The ownership of a MemRef defined by the `memref.alloca` operation is
   // always assigned to be 'false'.
   %alloca = memref.alloca() : memref<?xi8>

   // The ownership of %select will be the join of the ownership of %alloc and
   // the ownership of %alloca, i.e., of %true and %false. Because the pass does
   // not know about the semantics of the `arith.select` operation (unless a
   // custom handler is implemented), the ownership join will be 'Unknown'. If
   // the materialized ownership indicator of %select is needed, either a clone
   // has to be created for which %true is assigned as ownership or the result
   // of a `bufferization.dealloc` where %select is in the retain list has to be
   // used.
   %select = arith.select %select_cond, %alloc, %alloca : memref<?xi8>

   // We use `memref.extract_strided_metadata` to get the base memref since it is
   // not allowed to pass arbitrary memrefs to `memref.dealloc`. This property is
   // already enforced for `bufferization.dealloc`
   %base_buffer_memref, ... = memref.extract_strided_metadata %memref
     : memref<?xi8> -> memref<i8>, index, index, index
   %base_buffer_alloc, ... = memref.extract_strided_metadata %alloc
     : memref<?xi8> -> memref<i8>, index, index, index
   %base_buffer_alloca, ... = memref.extract_strided_metadata %alloca
     : memref<?xi8> -> memref<i8>, index, index, index

   // The deallocation conditions need to be adjusted to incorporate the branch
   // condition. In this example, this requires only a single negation, but might
   // also require multiple arith.andi operations.
   %not_br_cond = arith.xori %true, %br_cond : i1

   // There are two dealloc operations inserted in this basic block, one per
   // successor. Both have the same list of MemRefs to deallocate and the
   // conditions only differ by the branch condition conjunct.
   // Note, however, that the retained list differs. Here, both contain the
   // %select value because it is used in both successors (since it's the same
   // block), but the value passed via block argument differs (%memref vs.
   // %alloc).
   %10:2 = bufferization.dealloc
            (%base_buffer_memref, %base_buffer_alloc, %base_buffer_alloca
              : memref<i8>, memref<i8>, memref<i8>)
         if (%false, %br_cond, %false)
     retain (%alloc, %select : memref<?xi8>, memref<?xi8>)

   %11:2 = bufferization.dealloc
            (%base_buffer_memref, %base_buffer_alloc, %base_buffer_alloca
              : memref<i8>, memref<i8>, memref<i8>)
         if (%false, %not_br_cond, %false)
     retain (%memref, %select : memref<?xi8>, memref<?xi8>)

   // Because %select is used in ^bb1 without passing it via block argument, we
   // need to update it's ownership value here by merging the ownership values
   // returned by the dealloc operations
   %new_ownership = arith.select %br_cond, %10#1, %11#1 : i1

   // The terminator is modified to pass along the ownership indicator values
   // with each MemRef value.
   cf.cond_br %br_cond, ^bb1(%alloc, %10#0 : memref<?xi8>, i1),
                        ^bb1(%memref, %11#0 : memref<?xi8>, i1)

 // All non-entry basic blocks are modified to have an additional i1 argument for
 // each MemRef value in the argument list.
 ^bb1(%13: memref<?xi8>, %14: i1):  // 2 preds: ^bb0, ^bb0
   test.copy(%13, %select) : (memref<?xi8>, memref<?xi8>)

   %base_buffer_13, ... = memref.extract_strided_metadata %13
     : memref<?xi8> -> memref<i8>, index, index, index
   %base_buffer_select, ... = memref.extract_strided_metadata %select
     : memref<?xi8> -> memref<i8>, index, index, index

   // Here, we don't have a retained list, because the block has no successors
   // and the return has no operands.
   bufferization.dealloc (%base_buffer_13, %base_buffer_select
                           : memref<i8>, memref<i8>)
                      if (%14, %new_ownership)
   return
 }
 ```

 ## Buffer Deallocation Simplification Pass

 The [semantics of the `bufferization.dealloc` operation](https://mlir.llvm.org/docs/Dialects/BufferizationOps/#bufferizationdealloc-bufferizationdeallocop)
 provide a lot of opportunities for optimizations which can be conveniently split
 into patterns using the greedy pattern rewriter. Some of those patterns need
 access to additional analyses such as an analysis that can determine whether two
 MemRef values must, may, or never originate from the same buffer allocation.
 These patterns are collected in the Buffer Deallocation Simplification pass,
 while patterns that don't need additional analyses are registered as part of the
 regular canonicalizer pass. This pass is best run after
 `--ownership-based-buffer-deallocation` followed by `--canonicalize`.

 The pass applies patterns for the following simplifications:
 *   Remove MemRefs from retain list when guaranteed to not alias with any value
     in the 'memref' operand list. This avoids an additional aliasing check with
     the removed value.
 *   Split off values in the 'memref' list to new `bufferization.dealloc`
     operations only containing this value in the 'memref' list when it is
     guaranteed to not alias with any other value in the 'memref' list. This
     avoids at least one aliasing check at runtime and enables using a more
     efficient lowering for this new `bufferization.dealloc` operation.
 *   Remove values from the 'memref' operand list when it is guaranteed to alias
     with at least one value in the 'retained' list and may not alias any other
     value in the 'retain' list.

 ## Lower Deallocations Pass

 The `-lower-deallocations` pass transforms all `bufferization.dealloc`
 operations to `memref.dealloc` operations and may also insert operations from
 the `scf`, `func`, and `arith` dialects to make deallocations conditional and
 check whether two MemRef values come from the same allocation at runtime (when
 the `buffer-deallocation-simplification` pass wasn't able to determine it
 statically).

 The same lowering of the `bufferization.dealloc` operation is also part of the
 `-convert-bufferization-to-memref` conversion pass which also lowers all the
 other operations of the bufferization dialect.

 We distinguish multiple cases in this lowering pass to provide an overall more
 efficient lowering. In the general case, a library function is created to avoid
 quadratic code size explosion (relative to the number of operands of the dealloc
 operation). The specialized lowerings aim to avoid this library function because
 it requires allocating auxiliary MemRefs of index values.

 ### Generic Lowering

 A library function is generated to avoid code-size blow-up. On a high level, the
 base-memref of all operands is extracted as an index value and stored into
 specifically allocated MemRefs and passed to the library function which then
 determines whether they come from the same original allocation. This information
 is needed to avoid double-free situations and to correctly retain the MemRef
 values in the `retained` list.

 **Dealloc Operation Lowering**

 This lowering supports all features the dealloc operation has to offer. It
 computes the base pointer of each memref (as an index), stores it in a
 new memref helper structure and passes it to the helper function generated
 in `buildDeallocationLibraryFunction`. The results are stored in two lists
 (represented as MemRefs) of booleans passed as arguments. The first list
 stores whether the corresponding condition should be deallocated, the
 second list stores the ownership of the retained values which can be used
 to replace the result values of the `bufferization.dealloc` operation.

 Example:
 ```
 %0:2 = bufferization.dealloc (%m0, %m1 : memref<2xf32>, memref<5xf32>)
                           if (%cond0, %cond1)
                       retain (%r0, %r1 : memref<1xf32>, memref<2xf32>)
 ```
 lowers to (simplified):
 ```
 %c0 = arith.constant 0 : index
 %c1 = arith.constant 1 : index
 %dealloc_base_pointer_list = memref.alloc() : memref<2xindex>
 %cond_list = memref.alloc() : memref<2xi1>
 %retain_base_pointer_list = memref.alloc() : memref<2xindex>
 %m0_base_pointer = memref.extract_aligned_pointer_as_index %m0
 memref.store %m0_base_pointer, %dealloc_base_pointer_list[%c0]
 %m1_base_pointer = memref.extract_aligned_pointer_as_index %m1
 memref.store %m1_base_pointer, %dealloc_base_pointer_list[%c1]
 memref.store %cond0, %cond_list[%c0]
 memref.store %cond1, %cond_list[%c1]
 %r0_base_pointer = memref.extract_aligned_pointer_as_index %r0
 memref.store %r0_base_pointer, %retain_base_pointer_list[%c0]
 %r1_base_pointer = memref.extract_aligned_pointer_as_index %r1
 memref.store %r1_base_pointer, %retain_base_pointer_list[%c1]
 %dyn_dealloc_base_pointer_list = memref.cast %dealloc_base_pointer_list :
    memref<2xindex> to memref<?xindex>
 %dyn_cond_list = memref.cast %cond_list : memref<2xi1> to memref<?xi1>
 %dyn_retain_base_pointer_list = memref.cast %retain_base_pointer_list :
    memref<2xindex> to memref<?xindex>
 %dealloc_cond_out = memref.alloc() : memref<2xi1>
 %ownership_out = memref.alloc() : memref<2xi1>
 %dyn_dealloc_cond_out = memref.cast %dealloc_cond_out :
    memref<2xi1> to memref<?xi1>
 %dyn_ownership_out = memref.cast %ownership_out :
    memref<2xi1> to memref<?xi1>
 call @dealloc_helper(%dyn_dealloc_base_pointer_list,
                      %dyn_retain_base_pointer_list,
                      %dyn_cond_list,
                      %dyn_dealloc_cond_out,
                      %dyn_ownership_out) : (...)
 %m0_dealloc_cond = memref.load %dyn_dealloc_cond_out[%c0] : memref<2xi1>
 scf.if %m0_dealloc_cond {
   memref.dealloc %m0 : memref<2xf32>
 }
 %m1_dealloc_cond = memref.load %dyn_dealloc_cond_out[%c1] : memref<2xi1>
 scf.if %m1_dealloc_cond {
   memref.dealloc %m1 : memref<5xf32>
 }
 %r0_ownership = memref.load %dyn_ownership_out[%c0] : memref<2xi1>
 %r1_ownership = memref.load %dyn_ownership_out[%c1] : memref<2xi1>
 memref.dealloc %dealloc_base_pointer_list : memref<2xindex>
 memref.dealloc %retain_base_pointer_list : memref<2xindex>
 memref.dealloc %cond_list : memref<2xi1>
 memref.dealloc %dealloc_cond_out : memref<2xi1>
 memref.dealloc %ownership_out : memref<2xi1>
 // replace %0#0 with %r0_ownership
 // replace %0#1 with %r1_ownership
 ```

 **Library function**

 A library function is built per compilation unit that can be called at
 bufferization dealloc sites to determine whether two MemRefs come from the same
 allocation and their new ownerships.

 The generated function takes two MemRefs of indices and three MemRefs of
 booleans as arguments:
   * The first argument A should contain the result of the
   extract_aligned_pointer_as_index operation applied to the MemRefs to be
   deallocated
   * The second argument B should contain the result of the
   extract_aligned_pointer_as_index operation applied to the MemRefs to be
   retained
   * The third argument C should contain the conditions as passed directly
   to the deallocation operation.
   * The fourth argument D is used to pass results to the caller. Those
   represent the condition under which the MemRef at the corresponding
   position in A should be deallocated.
   * The fifth argument E is used to pass results to the caller. It
   provides the ownership value corresponding the the MemRef at the same
   position in B

 This helper function is supposed to be called once for each
 `bufferization.dealloc` operation to determine the deallocation need and
 new ownership indicator for the retained values, but does not perform the
 deallocation itself.

 Generated code:
 ```
 func.func @dealloc_helper(
     %dyn_dealloc_base_pointer_list: memref<?xindex>,
     %dyn_retain_base_pointer_list: memref<?xindex>,
     %dyn_cond_list: memref<?xi1>,
     %dyn_dealloc_cond_out: memref<?xi1>,
     %dyn_ownership_out: memref<?xi1>) {
   %c0 = arith.constant 0 : index
   %c1 = arith.constant 1 : index
   %true = arith.constant true
   %false = arith.constant false
   %num_dealloc_memrefs = memref.dim %dyn_dealloc_base_pointer_list, %c0
   %num_retain_memrefs = memref.dim %dyn_retain_base_pointer_list, %c0
   // Zero initialize result buffer.
   scf.for %i = %c0 to %num_retain_memrefs step %c1 {
     memref.store %false, %dyn_ownership_out[%i] : memref<?xi1>
   }
   scf.for %i = %c0 to %num_dealloc_memrefs step %c1 {
     %dealloc_bp = memref.load %dyn_dealloc_base_pointer_list[%i]
     %cond = memref.load %dyn_cond_list[%i]
     // Check for aliasing with retained memrefs.
     %does_not_alias_retained = scf.for %j = %c0 to %num_retain_memrefs
         step %c1 iter_args(%does_not_alias_aggregated = %true) -> (i1) {
       %retain_bp = memref.load %dyn_retain_base_pointer_list[%j]
       %does_alias = arith.cmpi eq, %retain_bp, %dealloc_bp : index
       scf.if %does_alias {
         %curr_ownership = memref.load %dyn_ownership_out[%j]
         %updated_ownership = arith.ori %curr_ownership, %cond : i1
         memref.store %updated_ownership, %dyn_ownership_out[%j]
       }
       %does_not_alias = arith.cmpi ne, %retain_bp, %dealloc_bp : index
       %updated_aggregate = arith.andi %does_not_alias_aggregated,
                                       %does_not_alias : i1
       scf.yield %updated_aggregate : i1
     }
     // Check for aliasing with dealloc memrefs in the list before the
     // current one, i.e.,
     // `fix i, forall j < i: check_aliasing(%dyn_dealloc_base_pointer[j],
     // %dyn_dealloc_base_pointer[i])`
     %does_not_alias_any = scf.for %j = %c0 to %i step %c1
        iter_args(%does_not_alias_agg = %does_not_alias_retained) -> (i1) {
       %prev_dealloc_bp = memref.load %dyn_dealloc_base_pointer_list[%j]
       %does_not_alias = arith.cmpi ne, %prev_dealloc_bp, %dealloc_bp
       %updated_alias_agg = arith.andi %does_not_alias_agg, %does_not_alias
       scf.yield %updated_alias_agg : i1
     }
     %dealloc_cond = arith.andi %does_not_alias_any, %cond : i1
     memref.store %dealloc_cond, %dyn_dealloc_cond_out[%i] : memref<?xi1>
   }
   return
 }
 ```

 ### Specialized Lowerings

 Currently, there are two special lowerings for common cases to avoid the library
 function and thus unnecessary memory load and store operations and function
 calls:

 **One memref, no retained**

 Lower a simple case without any retained values and a single MemRef. Ideally,
 static analysis can provide enough information such that the
 `buffer-deallocation-simplification` pass is able to split the dealloc
 operations up into this simple case as much as possible before running this
 pass.

 Example:
 ```mlir
 bufferization.dealloc (%arg0 : memref<2xf32>) if (%arg1)
 ```
 is lowered to
 ```mlir
 scf.if %arg1 {
   memref.dealloc %arg0 : memref<2xf32>
 }
 ```

 In most cases, the branch condition is either constant 'true' or 'false' and can
 thus be optimized away entirely by the canonicalizer pass.

 **One memref, arbitrarily many retained**

 A special case lowering for the deallocation operation with exactly one MemRef,
 but an arbitrary number of retained values. The size of the code produced by
 this lowering is linear to the number of retained values.

 Example:
 ```mlir
 %0:2 = bufferization.dealloc (%m : memref<2xf32>) if (%cond)
                       retain (%r0, %r1 : memref<1xf32>, memref<2xf32>)
 return %0#0, %0#1 : i1, i1
 ```
 is lowered to
 ```mlir
 %m_base_pointer = memref.extract_aligned_pointer_as_index %m
 %r0_base_pointer = memref.extract_aligned_pointer_as_index %r0
 %r0_does_not_alias = arith.cmpi ne, %m_base_pointer, %r0_base_pointer
 %r1_base_pointer = memref.extract_aligned_pointer_as_index %r1
 %r1_does_not_alias = arith.cmpi ne, %m_base_pointer, %r1_base_pointer
 %not_retained = arith.andi %r0_does_not_alias, %r1_does_not_alias : i1
 %should_dealloc = arith.andi %not_retained, %cond : i1
 scf.if %should_dealloc {
   memref.dealloc %m : memref<2xf32>
 }
 %true = arith.constant true
 %r0_does_alias = arith.xori %r0_does_not_alias, %true : i1
 %r0_ownership = arith.andi %r0_does_alias, %cond : i1
 %r1_does_alias = arith.xori %r1_does_not_alias, %true : i1
 %r1_ownership = arith.andi %r1_does_alias, %cond : i1
 return %r0_ownership, %r1_ownership : i1, i1
 ```

 ## Memory Layouts

 One-Shot Bufferize bufferizes ops from top to bottom. This works well when all
 ops are bufferizable. However, when encountering a non-bufferizable tensor with
 `allow-unknown-ops`, One-Shot Bufferize must insert `to_memref` ops at the
 bufferization boundary and decide on a memref type. By default, One-Shot
 Bufferize choose the most dynamic memref type wrt. layout maps. E.g.:

 ```mlir
 %0 = "my_dialect.unbufferizable_op(%t) : (tensor<?x?xf32>) -> (tensor<?x?xf32>)
 %1 = tensor.extract %0[%idx1, %idx2] : tensor<?xf32>
 ```

 When bufferizing the above IR, One-Shot Bufferize inserts a `to_memref` ops with
 dynamic offset and strides:

 ```mlir
 %0 = "my_dialect.unbufferizable_op(%t) : (tensor<?x?xf32>) -> (tensor<?x?xf32>)
 %0_m = bufferization.to_memref %0 : memref<?x?xf32, strided<[?, ?], offset: ?>>
 %1 = memref.load %0_m[%idx1, %idx2] : memref<?x?xf32, strided<[?, ?], offset: ?>>
 ```

 All users of `%0` have fully dynamic layout maps. This ensures that the
 bufferized IR composes well with future bufferizations of `unbufferizable_op`
 (maybe bufferized by another pass), regardless of the exact memref type of the
 future bufferization. If the op turns out to be bufferized to an op with a
 simpler memref type (e.g., identity layout map), we expect that canonicalization
 patterns would clean up unnecessarily dynamic layout maps. (Some of these
 canonicalization patterns may not be implemented yet.)

 One-Shot Bufferize tries to infer the most precise memref type when bufferizing
 an op. If the entire IR is bufferizable, we do not have to resort to
 conservatively use fully dynamic layout maps. In that case, we also do not have
 to rely on canonicalization patterns to clean up the bufferized IR.

 Note: There are some bufferizable ops for which a percise layout map cannot be
 inferred. E.g., a `tensor.cast` from a `tensor<*xf32>` to a `tensor<?x?xf32>`
 must be bufferized to a `memref.cast` with a memref type that has a fully
 dynamic layout map.

 One-Shot Bufferize has an option `unknown-type-conversion` to control the
 generation of layout maps when no precise layout can be inferred:

 *   `fully-dynamic-layout-map` uses fully dynamic layout maps and is the default
     behavior. This composes well when IR is partially bufferized.
 *   `identity-layout-map` uses static identity layout maps. This option can be
     useful for legacy code that cannot handle memref types with layout maps.
     Note that this setting can lead to additional buffer copies when folding a
     `to_tensor`/`to_memref` pair with memref types that are not cast-compatible.

 Note: The `unknown-type-conversion` option does not affect layout maps of
 function signatures. There is a separate `function-signature-type-conversion`
 option that controls layout maps of function parameters and function results.

 ## Extending One-Shot Bufferize

 Custom ops can be bufferized if they implement `BufferizableOpInterface`. Users
 must at least implement the following interface methods.

 *   `bufferizesToMemoryRead`: Return `true` if the buffer of the given tensor
     OpOperand is read.
 *   `bufferizesToMemoryWrite`: Return `true` if the buffer of the given tensor
     OpOperand is written (if bufferizing in-place).
 *   `getAliasingOpResult`: Return the OpResults that may share the same buffer
     as the given OpOperand. This interface method describes to
     OpOperand-to-OpResult mapping wrt. destination-passing style.
 *   `bufferRelation`: Return `BufferRelation::Equivalent` if the given OpResult
     is the exact same memref as the aliasing OpOperand after bufferization (in
     case of in-place bufferization). Otherwise, (e.g., they overlap but are not
     necessarily the exact same memrefs), `BufferRelation::Unknown` should be
     returned. Additional buffer relations will be added in the future, but
     `BufferRelation::Unknown` is always safe.
 *   `bufferize`: Rewrite the op with the given rewriter. Ops should be replaced
     with `bufferization::replaceOpWithBufferizedValues`.

 To get a better intuition of the interface methods, we invite users to take a
 look at existing implementations in MLIR, e.g., the implementation of
 `tensor.insert` or `tensor.extract`.

 ## Debugging Buffer Copies

 To get a better understanding of why One-Shot Bufferize introduced a buffer
 copy, users can run the pass with `test-analysis-only print-conflicts`. Every
 tensor op is then annotated with an attribute that has a boolean value for each
 tensor OpOperand. `true` means that the OpOperand bufferizes in-place. `false`
 means that the OpOperand bufferizes out-of-place and a buffer copy will be
 inserted.

 There are two reasons why a buffer copy may be inserted.

 1.  Due to a RaW conflict, it is not safe to bufferize in-place. I.e., the
     overwritten data is still needed.
 2.  The buffer is not writable. E.g., `memref.global` buffers that are the
     result of `arith.constant` ops are never modified.

 In the first case, `print-conflicts` illustrates the conflict in the form of a
 ("read", "conflicting write", "last write") tuple.

 ## Understanding the SSA Use-Def Chain Analysis

 To get a better understanding of the SSA Use-Def Chain Analysis and the RaW
 conflict detection algorithm, we invite interested users to read the
 [design document](https://discourse.llvm.org/uploads/short-url/5kckJ3DftYwQokG252teFgw3sYa.pdf)
 and watch the corresponding [ODM talk](https://youtu.be/TXEo59CYS9A)
 ([slides](https://mlir.llvm.org/OpenMeetings/2022-01-13-One-Shot-Bufferization.pdf)).
 can be used to bufferize a program in a single pass, as long as each op

 ## Migrating from Dialect Conversion-based Bufferization

 Both dialect conversion-based bufferization and One-Shot Bufferize generate
 `to_tensor`/`to_memref` ops at the bufferization boundary (when run with
 `allow-unknown-ops`). They can be combined and run in sequence. However,
 One-Shot Bufferize must run first because it cannot analyze those boundary ops.
 To update existing code step-by-step, it may be useful to specify a dialect
 filter for One-Shot Bufferize, so that dialects can be switched over one-by-one.

 ## Bufferization Function Graphs

 One-Shot Bufferize does currently not support function graph bufferization.
 I.e., `CallOp`, `ReturnOp` and function bbArgs are not bufferizable. Users can
 run the existing `--func-bufferize` bufferization pass after One-Shot Bufferize.

 Alternatively, users can try
 [`ModuleBufferization`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Linalg/ComprehensiveBufferize/ModuleBufferization.h#L31),
 which is an extension of One-Shot Bufferize. This bufferization is still under
 development and does not support arbitrary IR. In essence, returning a tensor
 from a function is not supported, unless it is equivalent to a function bbArg.
 In that case, the corresponding return value can simply be dropped during
 bufferization.

 ## Dialect Conversion-based Bufferization

 Disclaimer: Most dialect conversion-based bufferization has been migrated to
 One-Shot Bufferize. New users should use One-Shot Bufferize (with or without
 analysis). The following documentation is only for existing users of dialect
 conversion-based bufferization.

 This system is a simple application of MLIR's dialect conversion infrastructure.
 The bulk of the code related to bufferization is a set of ordinary
 `ConversionPattern`'s that dialect authors write for converting ops that operate
 on `tensor`'s to ops that operate on `memref`'s. A set of conventions and best
 practices are followed that allow these patterns to be run across multiple
 independent passes (rather than requiring a single huge atomic conversion pass),
 which makes the compilation pipelines scalable, robust, and easy to debug.

 This document is targeted at people looking to utilize MLIR's bufferization
 functionality, along with people who want to extend it to cover their own ops.

 <a name="the-talk">**NOTE:**</a> Before reading this document, please watch the
 talk "Type Conversions the Not-So-Hard-Way: MLIR's New Bufferization
 Infrastructure"
 ([slides](https://drive.google.com/file/d/1FVbzCXxZzS9LBLuvpPNLWJD-XDkt54ky/view?usp=sharing),
 [recording](https://drive.google.com/file/d/1VfVajitgf8ZPnd-HRkJvaJiFLhBsluXN/view?usp=sharing)).
 That talk gives a high-level overview of the bufferization infrastructure and
 important conceptual details related to using the MLIR dialect conversion
 infrastructure.

 ### Bufferization's place in a compilation pipeline

 Bufferization itself does not free any of the buffers that have been allocated,
 nor does it do anything particularly intelligent with the placement of buffers
 w.r.t. control flow. Thus, a realistic compilation pipeline will usually consist
 of:

 1.  Bufferization
 1.  Buffer optimizations such as `buffer-hoisting`, `buffer-loop-hoisting`, and
     `promote-buffers-to-stack`, which do optimizations that are only exposed
     after bufferization.
 1.  Finally, running the [buffer deallocation](BufferDeallocationInternals.md)
     pass.

 After buffer deallocation has been completed, the program will be quite
 difficult to transform due to the presence of the deallocation ops. Thus, other
 optimizations such as linalg fusion on memrefs should be done before that stage.

 ### General structure of the bufferization process

 Bufferization consists of running multiple *partial* bufferization passes,
 followed by one *finalizing* bufferization pass.

 There is typically one partial bufferization pass per dialect (though other
 subdivisions are possible). For example, for a dialect `X` there will typically
 be a pass `X-bufferize` that knows how to bufferize all the ops in that dialect.
 By running pass `X-bufferize` for each dialect `X` in the program, all the ops
 in the program are incrementally bufferized.

 Partial bufferization passes create programs where only some ops have been
 bufferized. These passes will create *materializations* (also sometimes called
 "casts") that convert between the `tensor` and `memref` type, which allows
 bridging between ops that have been bufferized and ops that have not yet been
 bufferized.

 Finalizing bufferizations complete the bufferization process, and guarantee that
 there are no tensors remaining in the program. This involves eliminating the
 materializations. The pass `finalizing-bufferize` provides a minimal pass that
 only eliminates materializations and issues an error if any unbufferized ops
 exist in the program.

 However, it is possible for a finalizing bufferization to do more than just
 eliminate materializations. By adding patterns (just as a partial bufferization
 would), it is possible for a finalizing bufferization pass to simultaneously
 bufferize ops and eliminate materializations. This has a number of disadvantages
 discussed in the talk and should generally be avoided.

 ### Example

 As a concrete example, we will look at the bufferization pipeline from the
 `mlir-npcomp` reference backend
 ([code](https://github.com/llvm/mlir-npcomp/blob/97d6d04d41216e73d40b89ffd79620973fc14ce3/lib/RefBackend/RefBackend.cpp#L232)).
 The code, slightly simplified and annotated, is reproduced here:

 ```c++
   // Partial bufferization passes.
   pm.addPass(createTensorConstantBufferizePass());
   pm.addNestedPass<func::FuncOp>(createTCPBufferizePass()); // Bufferizes the downstream `tcp` dialect.
   pm.addNestedPass<func::FuncOp>(createSCFBufferizePass());
   pm.addNestedPass<func::FuncOp>(createLinalgBufferizePass());
   pm.addNestedPass<func::FuncOp>(createTensorBufferizePass());
   pm.addPass(createFuncBufferizePass());

   // Finalizing bufferization pass.
   pm.addNestedPass<func::FuncOp>(createFinalizingBufferizePass());
 ```

 Looking first at the partial bufferization passes, we see that there are a
 sequence of `FuncOp` passes (which run in parallel on functions). These function
 passes are bracketed by `arith-bufferize` and `func-bufferize`, which are module
 passes (and thus serialize the parallel compilation process). These two passes
 must be module passes because they make changes to the top-level module.

 The bulk of the bufferization work is done by the function passes. Most of these
 passes are provided as part of the upstream MLIR distribution and bufferize
 their respective dialects (e.g. `scf-bufferize` bufferizes the `scf` dialect).
 The `tcp-bufferize` pass is an exception -- it is a partial bufferization pass
 used to bufferize the downstream `tcp` dialect, and fits in perfectly with all
 the other passes provided upstream.

 The last pass is the finalizing bufferization pass. The `mlir-npcomp` reference
 backend has arranged that all ops are bufferized by partial bufferizations, so
 that the upstream `finalizing-bufferize` pass can be used as the finalizing
 bufferization pass. This gives excellent diagnostics when something goes wrong
 with the bufferization process, such as due to an op that wasn't handled by any
 pattern.

 ### How to write a partial bufferization pass

 The contract of a partial bufferization pass is that a subset of ops (or kinds
 of ops, customizable by a ConversionTarget) get bufferized.

 A partial bufferization pass is just a pass that uses the
 [dialect conversion](DialectConversion.md) framework to apply
 `ConversionPattern`s with a `tensor` to `memref` type conversion.

 To describe how to write such a pass, we will walk through an example, the
 `tensor-bufferize` pass
 ([code](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/lib/Dialect/Tensor/Transforms/Bufferize.cpp#L23),
 [test](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/test/Dialect/Tensor/bufferize.mlir#L1))
 that bufferizes the `tensor` dialect. Note that these passes have been replaced
 with a `BufferizableOpInterface`-based implementation in the meantime, so we
 have to take a looker at an older version of the code.

 The bulk of the code in the pass will be a set of conversion patterns, with a
 simple example being
 [BufferizeCastOp](https://github.com/llvm/llvm-project/blob/2bf6e443e54604c7818c4d1a1837f3d091023270/mlir/lib/Dialect/Tensor/Transforms/Bufferize.cpp#L23)).

 ```
 class BufferizeCastOp : public OpConversionPattern<tensor::CastOp> {
 public:
   using OpConversionPattern::OpConversionPattern;
   LogicalResult
   matchAndRewrite(tensor::CastOp op, OpAdaptor adaptor,
                   ConversionPatternRewriter &rewriter) const override {
     auto resultType = getTypeConverter()->convertType(op.getType());
     rewriter.replaceOpWithNewOp<MemRefCastOp>(op, resultType, adaptor.source());
     return success();
   }
 };
 ```

 See [the talk](#the-talk) for more details on how to write these patterns.

 The
 [pass itself](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/lib/Dialect/Tensor/Transforms/Bufferize.cpp#L57)
 is very small, and follows the basic pattern of any dialect conversion pass.

 ```
 void mlir::populateTensorBufferizePatterns(
     BufferizeTypeConverter &typeConverter, RewritePatternSet &patterns) {
   patterns.add<BufferizeCastOp, BufferizeExtractOp>(typeConverter,
                                                     patterns.getContext());
 }

 struct TensorBufferizePass : public TensorBufferizeBase<TensorBufferizePass> {
   void runOnOperation() override {
     auto *context = &getContext();
     BufferizeTypeConverter typeConverter;
     RewritePatternSet patterns(context);
     ConversionTarget target(*context);

     populateTensorBufferizePatterns(typeConverter, patterns);
     target.addIllegalOp<tensor::CastOp, tensor::ExtractOp>();
     target.addLegalDialect<func::FuncDialect>();

     if (failed(
             applyPartialConversion(getOperation(), target, std::move(patterns))))
       signalPassFailure();
   }
 };
 ```

 The pass has all the hallmarks of a dialect conversion pass that does type
 conversions: a `TypeConverter`, a `RewritePatternSet`, and a `ConversionTarget`,
 and a call to `applyPartialConversion`. Note that a function
 `populateTensorBufferizePatterns` is separated, so that power users can use the
 patterns independently, if necessary (such as to combine multiple sets of
 conversion patterns into a single conversion call, for performance).

 One convenient utility provided by the MLIR bufferization infrastructure is the
 `BufferizeTypeConverter`, which comes pre-loaded with the necessary conversions
 and materializations between `tensor` and `memref`.

 In this case, the `BufferizationOpsDialect` is marked as legal, so the
 `bufferization.to_tensor` and `bufferization.to_memref` ops, which are inserted
 automatically by the dialect conversion framework as materializations, are
 legal. There is a helper `populateBufferizeMaterializationLegality`
 ([code](https://github.com/llvm/llvm-project/blob/a0b65a7bcd6065688189b3d678c42ed6af9603db/mlir/include/mlir/Transforms/Bufferize.h#L53))
 which helps with this in general.

 ### Other partial bufferization examples

 -   `scf-bufferize`
     ([code](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/lib/Dialect/SCF/Transforms/Bufferize.cpp#L1),
     [test](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/test/Dialect/SCF/bufferize.mlir#L1))

     -   Bufferizes ops from the `scf` dialect.
     -   This is an example of how to bufferize ops that implement
         `RegionBranchOpInterface` (that is, they use regions to represent
         control flow).
     -   The bulk of the work is done by
         `lib/Dialect/SCF/Transforms/StructuralTypeConversions.cpp`
         ([code](https://github.com/llvm/llvm-project/blob/daaaed6bb89044ac58a23f1bb1ccdd12342a5a58/mlir/lib/Dialect/SCF/Transforms/StructuralTypeConversions.cpp#L1)),
         which is well-commented and covers how to correctly convert ops that
         contain regions.

 -   `func-bufferize`
     ([code](https://github.com/llvm/llvm-project/blob/2f5715dc78328215d51d5664c72c632a6dac1046/mlir/lib/Dialect/Func/Transforms/FuncBufferize.cpp#L1),
     [test](https://github.com/llvm/llvm-project/blob/2f5715dc78328215d51d5664c72c632a6dac1046/mlir/test/Dialect/Func/func-bufferize.mlir#L1))

     -   Bufferizes `func`, `call`, and `BranchOpInterface` ops.
     -   This is an example of how to bufferize ops that have multi-block
         regions.
     -   This is an example of a pass that is not split along dialect
         subdivisions.

 ### How to write a finalizing bufferization pass

 The contract of a finalizing bufferization pass is that all tensors are gone
 from the program.

 The easiest way to write a finalizing bufferize pass is to not write one at all!
 MLIR provides a pass `finalizing-bufferize` which eliminates the
 `bufferization.to_tensor` / `bufferization.to_memref` materialization ops
 inserted by partial bufferization passes and emits an error if that is not
 sufficient to remove all tensors from the program.

 This pass is sufficient when partial bufferization passes have bufferized all
 the ops in the program, leaving behind only the materializations. When possible,
 it is recommended to structure your pass pipeline this way, as this has the
 significant advantage that if an op does not get bufferized (due to a missing
 pattern, bug in the code, etc.), `finalizing-bufferize` will emit a nice clean
 error, and the IR seen by `finalizing-bufferize` will only contain only one
 unbufferized op.

 However, before the current bufferization infrastructure was put in place,
 bufferization could only be done as a single finalizing bufferization mega-pass
 that used the `populate*BufferizePatterns` functions from multiple dialects to
 simultaneously bufferize everything at once. Thus, one might see code in
 downstream projects structured this way. This structure is not recommended in
 new code. A helper, `populateEliminateBufferizeMaterializationsPatterns`
 ([code](https://github.com/llvm/llvm-project/blob/a0b65a7bcd6065688189b3d678c42ed6af9603db/mlir/include/mlir/Transforms/Bufferize.h#L58))
 is available for such passes to provide patterns that eliminate
 `bufferization.to_tensor` and `bufferization.to_memref`.

 ### Changes since [the talk](#the-talk)

 -   `func-bufferize` was changed to be a partial conversion pass, and there is a
     new `finalizing-bufferize` which serves as a general finalizing
     bufferization pass.
 -   Most partial bufferization passes have been reimplemented in terms of
     `BufferizableOpInterface`. New users should use One-Shot Bufferize instead
     of dialect conversion-based bufferization.