| # MLIR: Incremental Application to Graph Algorithms in ML Frameworks |
| |
| The existing documentation about MLIR focuses on long term vision, how its |
| pieces fit together, and the benefits of modular and composable infrastructure |
| in the vast and distant future. While this viewpoint appeals to some, it causes |
| concern for others who are more concerned about the "here and now" - why does it |
| make sense to make a "revolutionary" change when any individual problem can be |
| fixed in place? |
| |
| This document explains that adoption of MLIR to solve graph based problems |
| *isn't* a revolutionary change: it is an incremental series of steps which build |
| on each other, each of which delivers local value. This document also addresses |
| some points of confusion that keep coming up. |
| |
| One note: even though a major advantage of MLIR is that it can span the full |
| spectrum from graph algorithms down to low-level code generation, this document |
| focuses on the use of MLIR for **graph-level algorithms**. MLIR will also unlock |
| exciting code generation opportunities (particularly given its novel approach to |
| integrating state of the art polyhedral techniques), but issues that touch on |
| MLIR's relationship to XLA, Eigen, etc, are out of scope for this particular |
| doc. |
| |
| This document uses TensorFlow as the example given that it is the focus of our |
| immediate work, but we believe that the same viewpoint could be useful for |
| people working in the context of other ML frameworks that may consider adopting |
| MLIR in the future. |
| |
| ### How is MLIR relevant? |
| |
| MLIR is an overloaded acronym which unpacks as "Multi-Level Intermediate |
| Representation". Its high-level purpose is to provide mechanics for describing |
| and transforming programs and computations in a flexible way. It provides common |
| compiler infrastructure for things like constant folding, dead code elimination, |
| graph rewriting, and others - which are independent of the representational |
| choices picked by a given dialect (e.g. its concurrency semantics). It was built |
| with a specific focus on compile time and memory efficiency, accurate |
| propagation of source location information (important for reporting high quality |
| errors and warnings) and is designed for testability. |
| |
| TensorFlow has numerous subsystems (some of which are proprietary, e.g. |
| Tensor-RT, nGraph, CoreML, etc) as well as translation layers between these |
| different subsystems, and these translation layers face similar challenges. ((As |
| an aside, the internals of each of these subsystems could often benefit from |
| MLIR infrastructure, but that isn't a focus of this doc.)) |
| |
| A key observation that MLIR makes is that these subsystems often have two things |
| going on: they are both particular data structures and encodings (e.g. HLO |
| graphs, TF-Lite's flat buffer format, TensorFlow's Graph format, the ONNX |
| abstraction, etc) as well as an abstraction of computation (a specific way of |
| modeling a convolution, a set of supported operations etc). |
| |
| MLIR uses a standard IR (i.e., a set of data structures) for representing these |
| computations - this allows a huge amount of shared infrastructure across these |
| problem domains. MLIR then allows the definition of domain-specific "dialects" |
| that describe the set of operations that are legal and supported for a given |
| application. This means that the actual translations between data structures are |
| kept as simple as possible - and are thus relatively easy to make "correct". |
| This allows the common compiler infrastructure to handle the mapping problems |
| and the other issues within the domain. |
| |
| MLIR's design is directly informed by the experience of building (and then |
| living with) intermediate representations like the LLVM IR, LLVM SelectionDAG, |
| the LLVM machine instruction representation, Swift SIL IR, and learns new |
| lessons from TensorFlow and XLA HLO, as well as learning from building countless |
| research and production systems on top of them. Our goal is to drag the state of |
| the art in compilers forward, not to merely apply a few well-known techniques to |
| the machine learning domain. |
| |
| ### What does adoption mean? |
| |
| The point of this document is not to advocate for rewriting any particular |
| subsystem in TensorFlow - indeed, the burden required to justify a rewrite is |
| high, and often very specific to that subsystem. That said, there are several |
| subsystems that are about to get rewritten or substantially revised anyway, so |
| we use those as examples to concretely describe the benefits that MLIR provides |
| in these cases and what it will take. The subsystems discussed are: |
| |
| 1. the TF Lite TOCO translator, which we need to improve error |
| reporting/reliability issues and generalize it to support more ops, and |
| 1. the TF/XLA bridge which needs to improve usability by merging some of its |
| usage models, support dynamic shapes and generalize guest subsystem support |
| to Tensor-RT and nGraph. |
| 1. Grappler is another subsystem that is likely to get substantial revisions in |
| the future, and would definitely benefit from the MLIR framework, but there |
| are no known plans to do that work at this point, so we don't discuss it |
| further. |
| |
| Adopting MLIR for these works the same way - and, in fact, the work to support |
| TF Lite is mostly a subset of the larger work to support the functionality of |
| the TF/XLA bridge. TF Lite and the TF/XLA bridge include several compiler passes |
| (things like encapsulate, functionalize control flow, lowering of ops, fusion, |
| constant folding, shape inference, etc). |
| |
| MLIR supports converting from TensorFlow Graphs to MLIR and back, which means |
| that we can start by putting in a no-op translation to MLIR and back into the |
| pipeline, and verify that nothing breaks. Then we can work on replacing the |
| compiler transformations one by one by reimplementing them (with the improved |
| algorithms that we're planning). |
| |
| This is a development plan, we wouldn't actually ship a TensorFlow that just |
| uses MLIR for a single pass. In practice, we'll have the MLIR flag gated under |
| an option, build out a replacement for an entire subsystem (e.g. the TOCO |
| translator) and when the time is right, we'll do A/B comparisons and eventually |
| make a switch and phase out the old code over time. |
| |
| ## What benefit does MLIR provide? |
| |
| The adoption plan above might sound like it only makes things worse in the |
| immediate term - we have two implementations of the same functionality, we are |
| dividing our efforts, etc. In order for this to be worth it, we should have a |
| good sense that we are building towards an improved future that will make |
| customers and TensorFlow engineers happier when it lands. Here we describe a few |
| of the benefits that MLIR provides, in no particular order: |
| |
| ### A Lossless Human Editable Textual Representation |
| |
| The MLIR in-memory data structure has a human readable and writable format, as |
| well as [a specification](../LangRef.md) for that format - built just like any |
| other programming language. Important properties of this format are that it is |
| compact, easy to read, and lossless. You can dump an MLIR program out to disk |
| and munge around with it, then send it through a few more passes. |
| |
| If you haven't worked with a system that works this way, it is hard to overstate |
| how big of a deal this in practice: it means that you can call `foo->dump()` on |
| an IR object to see its full contents, it means you can diff the IR before and |
| after a change, delta reduce IR files, and many other things. |
| |
| ### A Graph Verification Pass |
| |
| Like many other popular compiler infrastructures, MLIR provides infrastructure |
| and implementation for a "verifier" which checks that the IR is well formed. The |
| MLIR verifier is a simple framework that makes it easy to provide a single |
| source of truth for those correctness properties and is general across all |
| Dialects (e.g. TF Graph, TF Lite flat buffer, XLA HLO, etc). |
| |
| A verifier pass is sort of like a 'super assertion' that catches mistakes in |
| program transformations early, making you as an engineer more productive, making |
| the product more reliable, and making it easier to track down bugs when they |
| appear - because the verifier can be run at any time, either as a compiler pass |
| or with a single function call. |
| |
| While MLIR provides a well-considered infrastructure for IR verification, and |
| has simple checks for existing TensorFlow operations, there is a lot that should |
| be added here and lots of opportunity to get involved! |
| |
| ### Designed for Testability |
| |
| There are many aspects of this in MLIR, but we'll focus on compiler |
| transformations since they are the easiest to understand. Compiler |
| transformations are modeled as subclasses of the `Pass` C++ class, which are |
| driven by an `mlir-opt` tool. When combined with a lossless textual |
| representation, it becomes really easy to write unit tests for compiler |
| transformations, for example, this is a simple test that shows "x-x" is being |
| turned into zero: |
| |
| ```mlir |
| // RUN: mlir-opt %s -canonicalize | FileCheck %s |
| func @test_subi_zero_cfg(%arg0: i32) -> i32 { |
| %y = arith.subi %arg0, %arg0 : i32 |
| return %y: i32 |
| } |
| // CHECK-LABEL: func @test_subi_zero_cfg(%arg0: i32) |
| // CHECK-NEXT: %c0_i32 = arith.constant 0 : i32 |
| // CHECK-NEXT: return %c0 |
| ``` |
| |
| The "CHECK" comments are interpreted by the |
| [LLVM FileCheck tool](https://llvm.org/docs/CommandGuide/FileCheck.html), which |
| is sort of like a really advanced grep. This test is fully self-contained: it |
| feeds the input into the [canonicalize pass](../Canonicalization.md), and checks |
| that the output matches the CHECK lines. See the `test/Transforms` directory for |
| more examples. In contrast, standard unit testing exposes the API of the |
| underlying framework to lots and lots of tests (making it harder to refactor and |
| move the API), typically requires a lot more code, and exacerbates issues with |
| link time. For examples, see |
| [the TEST_F functions in TensorFlow's testsuite](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/grappler/optimizers/arithmetic_optimizer_test.cc). |
| |
| MLIR has been pervasively designed with this sort of design by testability, |
| allowing us to put in place a culture that expects every behavior changing |
| commit to include a test case, and for these test cases to be stable and |
| reliable over time, since they are testing exactly what they are supposed to. |
| End to end integration tests are still super useful for some things of course! |
| |
| ### Infrastructure for Warnings and Error Diagnostics and Location Tracking |
| |
| MLIR benefits from the lessons learned from building other compilers - including |
| Clang which |
| [[set the standard](http://blog.llvm.org/2010/04/amazing-feats-of-clang-error-recovery.html)](http://blog.llvm.org/2010/04/amazing-feats-of-clang-error-recovery.html) |
| for quality of implementation in C/C++ compiler diagnostics. Drawing from this |
| experience (and fixing mistakes in LLVM), MLIR requires that operations and |
| functions carry abstract location information, that transformations propagate |
| this information, and provides standardized mechanisms to emit errors and |
| warnings, as well as for clients to hook into them to capture and report them in |
| custom ways. |
| |
| Why is this important? In practice, many graph-to-graph translators can fail |
| (e.g. TF Lite when an unsupported op is used) and it is important to be able to |
| report the error up through to the user in the most precise way possible, in |
| order for it to be actionable. This includes tracking rewrites through fusions |
| and fissions of ops, mapping back into language / API specific domains, etc. |
| |
| More selfishly for infrastructure hackers, this is a huge boon because it means |
| that it is easy to write good tests for this: the testing tools for MLIR capture |
| the diagnostics produced by passes (using the standard diagnostic hooks) and |
| check that they match the expected diagnostics in the testcase. For example, to |
| test the dependence analysis infra in the code generator, Andy Davis wrote a |
| simple pass that checks dependencies and emits them as "notes", allowing him to |
| write tests like this: |
| |
| ```mlir |
| // RUN: mlir-opt %s -memref-dependence-check -verify-diagnostics |
| func @different_memrefs() { |
| %m.a = memref.alloc() : memref<100xf32> |
| %m.b = memref.alloc() : memref<100xf32> |
| %c0 = arith.constant 0 : index |
| %c1 = arith.constant 1.0 : f32 |
| memref.store %c1, %m.a[%c0] : memref<100xf32> |
| // expected-note@-1 {{dependence from memref access 0 to access 1 = false}} |
| %v0 = memref.load %m.b[%c0] : memref<100xf32> |
| return |
| } |
| ``` |
| |
| Note that a major limitation of this is that MLIR suffers from a problem of |
| "garbage in, garbage out": if the input locations to MLIR are imprecise, then |
| there is nothing that it can do to recover them. There is work underway in |
| TensorFlow/Python to improve the situation, and Swift for TensorFlow already has |
| perfect location tracking due to its design. |
| |
| ### Shape Information Captured in the IR |
| |
| In TensorFlow Graphs, each op takes and returns values using a very simple type |
| system (TF_DataType) in which each value is a tensor of unknown rank and |
| dimensions. At the same time, many graphs have static shapes easily knowable for |
| wide swaths of the computation, and even dynamically shaped operations often |
| have statically knowable dimensions. Many analyses and transformations benefit |
| and use this information when available, but because TensorFlow graphs don't |
| capture this (e.g. serialize it to proto), passes have to recompute it on demand |
| with ShapeRefiner. |
| |
| The [MLIR Tensor Type](../Dialects/Builtin.md/#rankedtensortype) directly |
| captures shape information, so you can have things like: |
| |
| ```mlir |
| %x = tf.Add %x, %y : tensor<128 x 8 x ? x f32> |
| ``` |
| |
| Capturing this in the IR is expected to speed up transformations (avoiding |
| recomputing the same info over and over again) which therefore makes it |
| practical to apply stronger shape analysis algorithms. It also makes it easier |
| to work with the IR, because on-the-side representations can get out of date, |
| and the API is easier to work with from an ergonomics perspective. |
| |
| ### Unified Graph Rewriting Infrastructure |
| |
| This is still a work in progress, but we have sightlines towards a |
| [general rewriting infrastructure](RationaleGenericDAGRewriter.md) for |
| transforming DAG tiles into other DAG tiles, using a declarative pattern format. |
| DAG to DAG rewriting is a generalized solution for many common compiler |
| optimizations, lowerings, and other rewrites and having an IR enables us to |
| invest in building a single high-quality implementation. |
| |
| Declarative pattern rules are preferable to imperative C++ code for a number of |
| reasons: they are more compact, easier to reason about, can have checkers |
| written against them, and new tools can be built that inspect and manipulate the |
| declarative patterns in interesting ways - e.g. applying theorem provers to |
| them. It will be exciting to see this ecosystem develop as the infrastructure |
| matures. |
| |
| ### Clarified Semantics for TensorFlow Operations |
| |
| One of the challenging things about working with TensorFlow is that there are |
| many invariants and behaviors that need to be preserved and known about when |
| working with Graphs, and these can be difficult to reason about and lead to |
| bugs. Things like 'dead values', Switch and Merge nodes, concurrency semantics, |
| nodes that execute even when passed a dead value, multiple device program |
| representation - etc... all add complexities that can make it challenging to |
| reason about whether a transformation or analysis is correct in general. Even |
| something as simple as constant folding or transforming integer `x-x` into `0` |
| is non-trivial because you need to consider control dependence edges. |
| |
| One of our major goals for the TensorFlow dialect of MLIR is to sort out these |
| situations and upgrade existing TensorFlow graphs to semantics that are easier |
| to reason about. The solutions to these problems are all still being debated, |
| but those discussions have already yielded a lot of potential answers: |
| introducing a `tf_dead_or<x>` types for switch/merge, modeling of TF operations |
| using futures/async semantics etc. None of these particular battles are critical |
| or important for MLIR to succeed (because of its "meta" nature, the abstraction |
| decisions of any given dialect are up for it to decide), but each one that works |
| out will make it easier to work with and transform TensorFlow operations. We |
| expect these issues to get nailed down in the next couple of months when MLIR |
| effort moves beyond TF Lite / TOCO support. The discussions that are happening |
| now are super valuable and making progress. |
| |
| ### Ergonomics |
| |
| A minor-in-theory, but important-in-practice point is that MLIR is designed to |
| make it easy, memory efficient, and less error prone to transform code than |
| other systems. `TensorFlow::Graph` has implementation issues where the same |
| information is stored redundantly in different places (which must be manually |
| kept up to date), has somewhat unusual representation of certain constructs |
| (e.g. the function library, which makes it very difficult to add or remove |
| functions, e.g. during interprocedural transformations), and stores information |
| in the graph that is used by the executor, but isn't necessary for program |
| transformation. |
| |
| TensorFlow has made a lot of progress in this area over the years, and there are |
| lots of ideas about further improvements in the future, we are happy that MLIR |
| addresses these needs (making it much easier to implement correct program |
| transformations) today, and are committed to pushing hard to make it better. |
| |
| ### Compile Time Performance and Memory Use |
| |
| MLIR has been designed to be memory and compile-time efficient in its algorithms |
| and data structures, using immutable and uniqued structures, low level |
| bit-packing, and other well-known techniques to avoid unnecessary heap |
| allocations, and allow simple and safe multithreaded optimization of MLIR |
| programs. There are other reasons to believe that the MLIR implementations of |
| common transformations will be more efficient than the Python and C++ |
| TensorFlow::Graph implementations of the same things, given the current |
| implementation details of TensorFlow. |
| |
| That said, this is very much a theory at this point. When the new implementation |
| of various subsystems are available, we will see what happens in practice: there |
| will be no reason to speculate - we can measure. |
| |
| ## Common Questions and Concerns |
| |
| Here we address some frequently asked questions and concerns. |
| |
| ### Isn't MLIR a big dependency to take on? |
| |
| We've heard that at least some people are concerned that MLIR is a "big" |
| dependency to take on, and could result in large code size. Here are some key |
| points MLIR: |
| |
| 1. The entire MLIR codebase is a pretty small C++ code base in absolute terms |
| compared to what goes into a modern ML framework. |
| 1. Like LLVM, MLIR is designed as a set of libraries that clients can link in |
| or ignore as they wish. For example, the transformations in MLIR kept |
| separate from the core IR abstractions, and dialect specific code (e.g. |
| TensorFlow, TF-Lite, XLA, etc) is all independently selectable by the build |
| system. Clients that don't care about XLA don't link in that code, whether |
| they are a TF-Lite system or a client that is completely unrelated to |
| TensorFlow. |
| 1. MLIR's only third party dependency is on LLVM, but it doesn't depend on LLVM |
| IR or any other heavy dependency - it just depends on LLVM's support library |
| which provides efficient hash tables and other |
| [memory efficient data structures that the STL does not](http://llvm.org/docs/ProgrammersManual.html#picking-the-right-data-structure-for-a-task). |
| There have been discussions about splitting this set of libraries out to its |
| own subproject in LLVM that the LLVM IR project depends on. This would be |
| great for MLIR as well as other LLVM subprojects. |
| 1. TensorFlow and many other frameworks already use LLVM - if so, MLIR would |
| not be pulling in an additional dependency at all. |
| |
| ### How does MLIR represent {control flow, concurrency, …} semantics in TensorFlow? |
| |
| MLIR provides a dialect that is an isomorphic 1-1 mapping between TensorFlow |
| graphs and MLIR, as well as a pretty complete translator back and forth (the |
| only known gap is that a few TF_DataType enums aren't handled yet). MLIR is a |
| "Multi-Level IR", which allows it to represent code with different abstraction |
| levels, so the ability to faithfully represent TensorFlow code in a completely |
| backwards compatible way (even if there are some historical warts!) is critical. |
| |
| In *addition* to the isomorphic mapping, we are actively working on efforts to |
| raise the abstraction level for working with TensorFlow graphs in MLIR. Doing so |
| would make it even easier to write TensorFlow transformations than it is today, |
| and would provide a path to migrating TF 1.x graphs forward into the TF 2.x |
| world. For example, because MLIR has an extensible type system, we can directly |
| model whether it is impossible for a Tensor value to be a "dead" value - similar |
| to the use of optional types in modern programming languages. |
| |
| These discussions occasionally cause confusion because there are several issues |
| being mixed up into one: |
| |
| * What are the current semantics of TensorFlow graphs, and what invariants can |
| we rely on? |
| * What should the semantics be in TensorFlow 2.0? |
| * What do programs rely on in practice, and if it is unfriendly, can we |
| migrate it? |
| * Can we find a way to make it so transforms don't have to worry about the |
| complexities of Switch/Merge, by using higher level control flow |
| representations? (tentative answer: yes) |
| * How should MLIR represent async vs sync operations, what invariants are |
| provided, how does this dovetail with control flow? |
| * When is it safe and beneficial to perform optimizations that might reduce |
| parallelism? |
| |
| All of these questions have a "conservative/safe fallback": we can continue |
| providing exactly the same abstractions that TensorFlow always has. That said, |
| we are trying hard to level-up the representation (taking advantage of the |
| "Multi-Level" part of MLIR) because doing so will make it much much easier to |
| write analyses and transformations than it currently is in TensorFlow. |
| |
| ### Non Goals |
| |
| It is important to point out things that MLIR does not aim to do. For example, |
| there is no runtime component to MLIR: the TensorFlow executor, the TF Lite |
| FlatBuffer interpreter, or other existing runtime should be used as-is. |
| |
| Another non-goal is that MLIR currently doesn't support a stable binary |
| encoding. We will certainly add this at some point, but existing formats should |
| be used for serialization and distribution in the meantime. |