| # Generic DAG Rewriter Infrastructure Rationale |
| |
| This document details the rationale behind a general DAG-to-DAG rewrite |
| infrastructure for MLIR. For up-to-date documentation on the user facing API, |
| please look at the main [Pattern Rewriting document](../PatternRewriter.md). |
| |
| ## Introduction and Motivation |
| |
| The goal of a compiler IR is to represent code - at various levels of |
| abstraction which pose different sets of tradeoffs in terms of representational |
| capabilities and ease of transformation. However, the ability to represent code |
| is not itself very useful - you also need to be able to implement those |
| transformations. |
| |
| There are many different types of compiler transformations, but this document |
| focuses on a particularly important class of transformation that comes up |
| repeatedly at scale, and is important for the goals of MLIR: matching one DAG of |
| operations, and replacing with another. This is an integral part of many |
| compilers and necessary for peephole optimizations like "eliminate identity |
| nodes" or "replace x+0 with x", a generalized canonicalization framework (e.g. |
| Instruction Combiner in LLVM), as well as a useful abstraction to implement |
| optimization algorithms for optimization algorithms for IR at multiple levels. |
| |
| A particular strength of MLIR (and a major difference vs other compiler |
| infrastructures like LLVM, GCC, XLA, TensorFlow, etc) is that it uses a single |
| compiler IR to represent code at multiple levels of abstraction: an MLIR |
| operation can be a "TensorFlow operation", an "XLA HLO", an Affine Loop Nest, an |
| LLVM IR instruction (transitively including X86, Lanai, PTX, and other target |
| specific instructions), or anything else that the MLIR operation system can |
| reasonably express. Given that MLIR spans such a wide range of different problem |
| scopes, a single infrastructure for performing graph-to-graph rewrites can help |
| solve many diverse domain challenges. |
| |
| [Static single assignment](https://en.wikipedia.org/wiki/Static_single_assignment_form) |
| (SSA) representations like MLIR make it easy to access the operands and "users" |
| of an operation. As such, a natural abstraction for these graph-to-graph |
| rewrites is that of DAG pattern matching: clients define DAG tile patterns |
| (where a tile is a sequence of operations defining a subgraph of the DAG), and |
| each pattern includes a result DAG to produce and the cost of the result (or, |
| inversely, the benefit of doing the replacement). A common infrastructure |
| efficiently finds and performs the rewrites. |
| |
| While this concept is simple, the details are more nuanced. This document |
| defines and explores a set of abstractions that can solve a wide range of |
| different problems, and be applied to many different sorts of problems that MLIR |
| is - and is expected to - face over time. We do this by separating the pattern |
| application algorithm from the "driver" of the computation loop, and make space |
| for the patterns to be defined declaratively. |
| |
| ### Constant folding |
| |
| A degenerate but pervasive case of DAG-to-DAG pattern matching is constant |
| folding: an operation whose operands contain constants can often be folded to a |
| result constant value. |
| |
| MLIR operations may override a |
| [`fold`](../Canonicalization.md/#canonicalizing-with-the-fold-method) routine, which |
| exposes a simpler API compared to a general DAG-to-DAG pattern matcher, and |
| allows for it to be applicable in cases that a generic matcher would not. For |
| example, a DAG-rewrite can remove arbitrary nodes in the current function, which |
| could invalidate iterators. Constant folding as an API does not remove any |
| nodes, it just provides a (list of) constant values and allows the clients to |
| update their data structures as necessary. |
| |
| ## Related Work |
| |
| There is a huge amount of related work to consider, given that nearly every |
| compiler in existence has to solve this problem many times over. One unifying |
| problem is that all of these systems are designed to solve one particular, and |
| usually, narrow problem: MLIR on the other hand would like to solve many of |
| these problems within a single infrastructure. Here are a few related graph |
| rewrite systems, along with the pros and cons of their work (The most similar |
| design to the infrastructure present in MLIR is the LLVM DAG-to-DAG instruction |
| selection algorithm). |
| |
| ### AST-Level Pattern Matchers |
| |
| The literature is full of source-to-source translators which transform |
| identities in order to improve performance (e.g. transforming `X*0` into `0`). |
| One large example is the GCC `fold` function, which performs |
| [many optimizations](https://github.com/gcc-mirror/gcc/blob/master/gcc/fold-const.c) |
| on ASTs. Clang has |
| [similar routines](https://clang.llvm.org/docs/InternalsManual.html#constant-folding-in-the-clang-ast) |
| for simple constant folding of expressions (as required by the C++ standard) but |
| doesn't perform general optimizations on its ASTs. |
| |
| The primary downside of AST optimizers is that you can't see across operations |
| that have multiple uses. It is |
| [well known in literature](https://llvm.org/pubs/2008-06-LCTES-ISelUsingSSAGraphs.pdf) |
| that DAG pattern matching is more powerful than tree pattern matching, but on |
| the other hand, DAG pattern matching can lead to duplication of computation |
| which needs to be checked for. |
| |
| ### "Combiners" and other peephole optimizers |
| |
| Compilers end up with a lot of peephole optimizers for various things, e.g. the |
| GCC |
| ["combine" routines](https://github.com/gcc-mirror/gcc/blob/master/gcc/combine.c) |
| (which try to merge two machine instructions into a single one), the LLVM |
| [Inst Combine](https://github.com/llvm/llvm-project/tree/main/llvm/lib/Transforms/InstCombine) |
| [pass](https://llvm.org/docs/Passes.html#instcombine-combine-redundant-instructions), |
| LLVM's |
| [DAG Combiner](https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/SelectionDAG/DAGCombiner.cpp), |
| the Swift compiler's |
| [SIL Combiner](https://github.com/apple/swift/tree/main/lib/SILOptimizer/SILCombiner), |
| etc. These generally match one or more operations and produce zero or more |
| operations as a result. The LLVM |
| [Legalization](https://github.com/llvm/llvm-project/tree/main/llvm/lib/CodeGen/SelectionDAG) |
| infrastructure has a different outer loop but otherwise works the same way. |
| |
| These passes have a lot of diversity, but also have a unifying structure: they |
| mostly have a worklist outer loop which visits operations. They then use a |
| visitor pattern (or equivalent) to switch over the class of operation and |
| dispatch to a method. That method contains a long list of hand-written C++ code |
| that pattern-matches various special cases. LLVM introduced a "match" function |
| that allows writing patterns in a somewhat more declarative style using template |
| metaprogramming (MLIR has similar facilities). Here's a simple example: |
| |
| ```c++ |
| // Y - (X + 1) --> ~X + Y |
| if (match(Op1, m_OneUse(m_Add(m_Value(X), m_One())))) |
| return BinaryOperator::CreateAdd(Builder.CreateNot(X), Op0); |
| ``` |
| |
| Here is a somewhat more complicated one (this is not the biggest or most |
| complicated :) |
| |
| ```c++ |
| // C2 is ODD |
| // LHS = XOR(Y,C1), Y = AND(Z,C2), C1==(C2+1) => LHS == NEG(OR(Z, ~C2)) |
| // ADD(LHS, RHS) == SUB(RHS, OR(Z, ~C2)) |
| if (match(LHS, m_Xor(m_Value(Y), m_APInt(C1)))) |
| if (C1->countTrailingZeros() == 0) |
| if (match(Y, m_And(m_Value(Z), m_APInt(C2))) && *C1 == (*C2 + 1)) { |
| Value NewOr = Builder.CreateOr(Z, ~(*C2)); |
| return Builder.CreateSub(RHS, NewOr, "sub"); |
| } |
| ``` |
| |
| These systems are simple to set up, and pattern matching templates have some |
| advantages (they are extensible for new sorts of sub-patterns, look compact at |
| point of use). On the other hand, they have lots of well known problems, for |
| example: |
| |
| * These patterns are very error prone to write, and contain lots of |
| redundancies. |
| * The IR being matched often has identities (e.g. when matching commutative |
| operators) and the C++ code has to handle it manually - take a look at |
| [the full code](https://github.com/llvm/llvm-project/blob/c0b5000bd848303320c03f80fbf84d71e74518c9/llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp#L767) |
| for `checkForNegativeOperand` that defines the second pattern). |
| * The matching code compiles slowly, both because it generates tons of code |
| and because the templates instantiate slowly. |
| * Adding new patterns (e.g. for count leading zeros in the example above) is |
| awkward and doesn't often happen. |
| * The cost model for these patterns is not really defined - it is emergent |
| based on the order the patterns are matched in code. |
| * They are non-extensible without rebuilding the compiler. |
| * It isn't practical to apply theorem provers and other tools to these |
| patterns - they cannot be reused for other purposes. |
| |
| In addition to structured "combiners" like these, there are lots of ad-hoc |
| systems like the |
| [LLVM Machine code peephole optimizer](http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/CodeGen/PeepholeOptimizer.cpp?view=markup) |
| which are related. |
| |
| ### LLVM's DAG-to-DAG Instruction Selection Infrastructure |
| |
| The instruction selection subsystem in LLVM is the result of many years worth of |
| iteration and discovery, driven by the need for LLVM to support code generation |
| for lots of targets, the complexity of code generators for modern instruction |
| sets (e.g. X86), and the fanatical pursuit of reusing code across targets. Eli |
| Bendersky wrote a |
| [nice short overview](https://eli.thegreenplace.net/2013/02/25/a-deeper-look-into-the-llvm-code-generator-part-1) |
| of how this works, and the |
| [LLVM documentation](https://llvm.org/docs/CodeGenerator.html#select-instructions-from-dag) |
| describes it in more depth including its advantages and limitations. It allows |
| writing patterns like this. |
| |
| ``` |
| def : Pat<(or GR64:$src, (not (add GR64:$src, 1))), |
| (BLCI64rr GR64:$src)>; |
| ``` |
| |
| This example defines a matcher for the |
| ["blci" instruction](https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets#TBM_\(Trailing_Bit_Manipulation\)) |
| in the |
| [X86 target description](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86InstrInfo.td), |
| there are many others in that file (look for `Pat<>` patterns, since they aren't |
| entangled in details of the compiler like assembler/disassembler generation |
| logic). |
| |
| For the purposes of MLIR, there is much to like about this system, for example: |
| |
| * It is defined in a declarative format. |
| * It is extensible to target-defined operations. |
| * It automates matching across identities, like commutative patterns. |
| * It allows custom abstractions and intense factoring of target-specific |
| commonalities. |
| * It generates compact code - it compiles into a state machine, which is |
| interpreted. |
| * It allows the instruction patterns to be defined and reused for multiple |
| purposes. |
| * The patterns are "type checked" at compile time, detecting lots of bugs |
| early and eliminating redundancy from the pattern specifications. |
| * It allows the use of general C++ code for weird/complex cases. |
| |
| While there is a lot that is good here, there are also a few undesirable bits: |
| |
| * The representation is specifically designed and only applicable for |
| instruction selection, meaning that the directly adjacent problems like the |
| DAGCombiner and Legalizer can't use it. |
| * This isn't extensible at compiler runtime, you have to rebuild the compiler |
| to extend it. |
| * The error messages when failing to match a pattern |
| [are not exactly optimal](https://www.google.com/search?q=llvm+cannot+select). |
| * It has lots of implementation problems and limitations (e.g. can't write a |
| pattern for a multi-result operation) as a result of working with the |
| awkward SelectionDAG representation and being designed and implemented on |
| demand. |
| * Organic growth over time has left lots of sharp edges. |
| |
| ### Summary |
| |
| MLIR faces a wide range of pattern matching and graph rewrite problems, and one |
| of the major advantages of having a common representation for code at multiple |
| levels is that it allows for investing in - and highly leveraging - a single |
| infrastructure for doing this sort of work. |
| |
| ## Goals |
| |
| We'd like the to encompass many problems in the MLIR space, including 1-to-N |
| expansions (e.g. such as in type legalization during instruction selection when |
| an add of one bit width may be split into multiple adds of a smaller bit width), |
| M-to-1 patterns (e.g. when converting a multiply+add into a single muladd |
| operation), as well as general M-to-N patterns (e.g. instruction selection for |
| target instructions). Patterns have a benefit associated with them, and the |
| common infrastructure should be responsible for sorting out the highest benefit |
| match for a given application. |
| |
| We separate the task of picking a particular optimal pattern from a given root |
| node, the algorithm used to rewrite an entire graph given a particular set of |
| goals, and the definition of the patterns themselves. We do this because DAG |
| tile pattern matching is NP complete. Additionally, we would like to support |
| iterative rewrite algorithms that progressively transform the input program |
| through multiple steps. Furthermore, we would like to support many different |
| sorts of clients across the MLIR stack, and they may have different tolerances |
| for compile time cost, different demands for optimality, and other algorithmic |
| goals or constraints. |
| |
| We aim for MLIR transformations to be easy to implement and reduce the |
| likelihood for compiler bugs. We expect there to be a very large number of |
| patterns that are defined over time, and we believe that these sorts of patterns |
| will have a very large number of legality/validity constraints - many of which |
| are difficult to reason about in a consistent way, may be target specific, and |
| whose implementation may be particularly bug-prone. As such, we aim to design |
| the API around pattern definition to be simple, resilient to programmer errors, |
| and allow separation of concerns between the legality of the nodes generated |
| from the idea of the pattern being defined. |
| |
| Finally, error handling is a topmost concern, we want pattern match failures to |
| be diagnosable in a reasonable way. This is a difficult problem in general, as |
| the space of malfunction is too great to be fully enumerated and handled |
| optimally, but MLIR is already designed to represent the provenance of an |
| operation well. The aim of the pattern rewriting infrastructure is simply to |
| propagate that provenance information precisely, as well as diagnose pattern |
| match failures with the rationale for why a set of patterns do not apply. |
| |
| ### Non goals |
| |
| The pattern infrastructure does not aim to solve all compiler problems, it is |
| simply a DAG-to-DAG pattern matching system. Compiler algorithms that require |
| global dataflow analysis (e.g. common subexpression elimination, conditional |
| constant propagation, and many many others) will not be directly solved by this |
| infrastructure. |
| |
| This infrastructure is limited to DAG patterns, which (by definition) prevent |
| the patterns from seeing across cycles in a graph. In an SSA-based IR like MLIR, |
| this means that these patterns don't see across basic block arguments. We |
| consider this acceptable given the set of problems we are trying to solve - we |
| don't know of any other system that attempts to do so, and consider the payoff |
| of worrying about this to be low. |
| |
| This design includes the ability for DAG patterns to have associated benefits, |
| but those benefits are defined in terms of magic numbers (typically equal to the |
| number of nodes being replaced). For any given application, the units of magic |
| numbers will have to be defined. |