| # 'acc' Dialect |
| |
| The `acc` dialect is an MLIR dialect for representing the OpenACC |
| programming model. OpenACC is a standardized directive-based model which |
| is used with C, C++, and Fortran to enable programmers to expose |
| parallelism in their code. The descriptive approach used by OpenACC |
| allows targeting of parallel multicore and accelerator targets like GPUs |
| by giving the compiler the freedom of how to parallelize for specific |
| architectures. OpenACC also provides the ability to optimize the |
| parallelism through increasingly more prescriptive clauses. |
| |
| This dialect models the constructs from the |
| [OpenACC 3.3 specification](https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC-3.3-final.pdf) |
| |
| This document describes the design of the OpenACC dialect in MLIR. It |
| lists and explains design goals and design choices along with their |
| rationale. It also describes specifics with regards to acc dialect |
| operations, types, and attributes. |
| |
| [TOC] |
| |
| ## Dialect Design Goals |
| |
| * Needs to have complete representation of the OpenACC language. |
| - A frontend requires this in order to properly generate a |
| representation of possible `acc` pragmas in MLIR. Additionally, |
| this dialect is expected to be further lowered when materializing |
| its semantics. Without a complete representation, a frontend might |
| choose a lower abstraction (such as direct runtime call) - but this |
| would impact the ability to do analysis and optimizations on the |
| dialect. |
| * Allow representation at the same semantic level as the OpenACC |
| language while having capability to represent nuances of the source |
| language semantics (such as Fortran descriptors) in an agnostic manner. |
| - Using abstractions that closely model the OpenACC language |
| simplifies frontend implementation. It also allows for easier |
| debugging of the IR. However, sometimes source language specific |
| behavior is needed when materializing OpenACC. In these cases, such |
| as privatization of C++ objects with default constructor, the |
| frontend fills in the `recipe` along with the `private` operation |
| which can be packaged neatly with the `acc` dialect operations. |
| * Be able to regenerate the semantic equivalent of the user pragmas from |
| the dialect (including bounds, names, clauses, modifiers, etc). |
| - This is a strong measure of making sure that the dialect is not |
| lossy in semantics. It also allows capability to generate |
| appropriate and useful debug information outside of the frontend. |
| * Be dialect agnostic so that it can be used and coexist with other |
| dialects including but not limited to `hlfir`, `fir`, `llvm`, `cir`. |
| - Directive-based models such as OpenACC are always used with a |
| source language, so the `acc` dialect coexisting with other |
| dialect(s) is necessary by construction. Through proper |
| abstractions, neither the `acc` dialect nor the source language |
| dialect should have dependencies on each other; where needed, |
| interfaces should be used to ensure `acc` dialect can verify |
| expected properties. |
| * The dialect must allow dataflow to be modeled accurately and |
| performantly using MLIR's existing facilities. |
| - Appropriate dataflow modeling is important for analyses and IR |
| reasoning - even something as simple as walking the uses. Therefore |
| operations, like data operations, are expected to generate results |
| which can be used in modeling behavior. For example, consider an |
| `acc copyin` clause. After the `acc.copyin` operation, a pointer |
| which lives on devices should be distinguishable from one that lives |
| in host memory. |
| * Be friendly to MLIR optimization passes by implementing common |
| interfaces. |
| - Interfaces, such as `MemoryEffects`, are the key way MLIR |
| transformations and analyses are designed to interact with the IR. |
| In order for the operations in the `acc` dialect to be optimizable |
| (either directly or even indirectly by not blocking optimizations |
| of nested IR), implementing relevant common interfaces is needed. |
| |
| The design philosophy of the acc dialect is one where the design goals |
| are adhered to. Current and planned operations, attributes, types must |
| adhere to the design goals. |
| |
| ## Operation Categories |
| |
| The OpenACC dialect includes both high-level operations (which retain |
| the same semantic meaning as their OpenACC language equivalent), |
| intermediate-level operations (which are used to decompose clauses |
| from constructs), and low-level operations (to encode specifics |
| associated with source language in a generic way). |
| |
| The high-level operations list contains the following OpenACC language |
| constructs and their corresponding operations: |
| * `acc parallel` → `acc.parallel` |
| * `acc kernels` → `acc.kernels` |
| * `acc serial` → `acc.serial` |
| * `acc data` → `acc.data` |
| * `acc loop` → `acc.loop` |
| * `acc enter data` → `acc.enter_data` |
| * `acc exit data` → `acc.exit_data` |
| * `acc host_data` → `acc.host_data` |
| * `acc init` → `acc.init` |
| * `acc shutdown` → `acc.shutdown` |
| * `acc update` → `acc.update` |
| * `acc set` → `acc.set` |
| * `acc wait` → `acc.wait` |
| * `acc atomic read` → `acc.atomic.read` |
| * `acc atomic write` → `acc.atomic.write` |
| * `acc atomic update` → `acc.atomic.update` |
| * `acc atomic capture` → `acc.atomic.capture` |
| |
| This second group contains operations which are used to represent |
| either decomposed constructs or clauses for more accurate modeling: |
| * `acc routine` → `acc.routine` + `acc.routine_info` attribute |
| * `acc declare` → `acc.declare_enter` + `acc.declare_exit` or |
| `acc.declare` |
| * `acc {construct} copyin` → `acc.copyin` (before region) + |
| `acc.delete` (after region) |
| * `acc {construct} copy` → `acc.copyin` (before region) + |
| `acc.copyout` (after region) |
| * `acc {construct} copyout` → `acc.create` (before region) + |
| `acc.copyout` (after region) |
| * `acc {construct} attach` → `acc.attach` (before region) + |
| `acc.detach` (after region) |
| * `acc {construct} create` → `acc.create` (before region) + |
| `acc.delete` (after region) |
| * `acc {construct} present` → `acc.present` (before region) + |
| `acc.delete` (after region) |
| * `acc {construct} no_create` → `acc.nocreate` (before region) + |
| `acc.delete` (after region) |
| * `acc {construct} deviceptr` → `acc.deviceptr` |
| * `acc {construct} private` → `acc.private` |
| * `acc {construct} firstprivate` → `acc.firstprivate` |
| * `acc {construct} reduction` → `acc.reduction` |
| * `acc cache` → `acc.cache` |
| * `acc update device` → `acc.update_device` |
| * `acc update host` → `acc.update_host` |
| * `acc host_data use_device` → `acc.use_device` |
| * `acc declare device_resident` → `acc.declare_device_resident` |
| * `acc declare link` → `acc.declare_link` |
| * `acc exit data delete` → `acc.delete` (with `structured` flag as |
| false) |
| * `acc exit data detach` → `acc.detach` (with `structured` flag as |
| false) |
| * `acc {construct} {data_clause}(var[lb:ub])` → `acc.bounds` |
| |
| The low-level operations are: |
| * `acc.private.recipe` |
| * `acc.reduction.recipe` |
| * `acc.firstprivate.recipe` |
| * `acc.global_ctor` |
| * `acc.global_dtor` |
| * `acc.yield` |
| * `acc.terminator` |
| The low-level operations semantics and reasoning are further explained |
| in sections below. |
| |
| ### Data Operations |
| |
| #### Data Clause Decomposition |
| The data clauses are decomposed from their constructs for better |
| dataflow modeling in MLIR. There are multiple reasons for this which |
| are consistent with the dialect goals: |
| * Correctly represents dataflow. Data clauses have different effects |
| at entry to region and at exit from region. |
| * Friendlier to add attributes such as `MemoryEffects` to a single |
| operation. This can better reflect semantics (like the fact that an |
| `acc.copyin` operation only reads host memory) |
| * Operations can be moved or optimized individually (eg `CSE`). |
| * Easier to keep track of debug information. Line location can point to |
| the text representing the data clause instead of the construct. |
| Additionally, attributes can be used to keep track of variable names in |
| clauses without having to walk the IR tree in attempt to recover the |
| information (this makes acc dialect more agnostic with regards to what |
| other dialect it is used with). |
| * Clear operation ordering since all data operations are on same |
| list. |
| |
| Each of the `acc` dialect data operations represents either the |
| entry or the exit portion of the data action specification. Thus, |
| `acc.copyin` represents the semantics defined in section |
| `2.7.7 copyin clause` whose wording starts with |
| `At entry to a region`. The decomposed exit operation `acc.delete` |
| represents the second part of that section, whose wording starts with |
| `At exit from the region`. The `delete` action may be performed |
| after checking and updating of the relevant reference counters noted. |
| |
| The `acc` data operations, even when decomposed, retain their original |
| data clause in an operation operand `dataClause` for possibility to |
| recover this information during debugging. For example, `acc copy`, |
| does not translate to `acc.copy` operation, but instead to `acc.copyin` |
| for entry and `acc.copyout` for exit. Both the decomposed operations |
| hold a `dataClause` field that specifies this was an `acc copy`. |
| |
| The link between the decomposed entry and exit operations is the ssa |
| value produced by the entry operation. Namely, it is the `accPtr` result |
| which is used both in the `dataOperands` of the operation used for the |
| construct and in the `accPtr` operand of the exit operation. |
| |
| #### Bounds |
| |
| OpenACC data clauses allow the use of bounds specifiers as per |
| `2.7.1 Data Specification in Data Clauses`. However, array dimensions |
| for the data are not always required in the clause if the source |
| language's type system captures this information - the user can just |
| specify the variable name in the data clause. So the `acc.bounds` |
| operation is an important piece to ensure uniform representation of both |
| explicit user set dimensions and implicit type-based dimensions. It |
| contains several key features to allow properly encoding sizes in a |
| manner flexible and agnostic to the source language's dialect: |
| * Multi-dimensional arrays can be represented by using multiple ordered |
| `acc.bounds` operations. |
| * Bounds are required to be zero-normalized. This works well with the |
| `PointerLikeType` requirement in data clauses - since a lowerbound of 0 |
| means looking at data at the zero offset from pointer. This requirement |
| also works well in ensuring the `acc` dialect is agnostic to source |
| language dialect since it prevents ambiguity such as the case of Fortran |
| arrays where the lower bound is not a fixed value. |
| * If the source dialect does not encode the dimensions in the type (eg |
| `!fir.array<?x?xi32>`) but instead encodes it in some other way (such as |
| through descriptors), then the frontend must fill in the `acc.bounds` |
| operands with appropriate information (such as loads from descriptor). |
| The `acc.bounds` operation also permits lossy source dialect, such |
| as if the frontend uses aggressive pointer decay and cannot represent |
| the dimensions in the type system (eg using `!llvm.ptr` for arrays). |
| Both of these aspects show `acc.bounds`' operation's flexibility to |
| allow the representation to be agnostic since the `acc` dialect is not |
| expected to be able to understand how to extract dimension information |
| from the types of the source dialect. |
| * The OpenACC specification allows either extent or upperbound in the |
| data clause depending on whether it is Fortran or C and C++. The |
| `acc.bounds` operation is rich enough to accept either or both - for |
| convenience in lowering to the dialect and for ability to precisely |
| capture the meaning from the clause. |
| * The stride, either in units or bytes, can be also captured in the |
| `acc.bounds` operation. This is also an important part to be able to |
| accept a source language's arrays without forcing the frontend to |
| normalize them in some way. For example, consider a case where in a |
| parent function, a whole array is mapped to device. Then only a view of |
| a non-1 stride is passed to child function (eg Fortran array slice with |
| non-1 stride). A `copy` operation of this data in child should be able |
| to avoid remapping this array. If instead the operation required |
| normalizing the array (such as making it contiguous), then unexpected |
| disjoint mapping of the same host data would be error-prone since it |
| would result in multiple mappings to device. |
| |
| #### Counters |
| |
| The data operations also maintain semantics described in the OpenACC |
| specification related to runtime counters. More specifically, consider |
| the specification of the entry portion of `acc copyin` in section 2.7.7: |
| ``` |
| At entry to a region, the structured reference counter is used. On an |
| enter data directive, the dynamic reference counter is used. |
| - If var is present and is not a null pointer, a present increment |
| action with the appropriate reference counter is performed. |
| - If var is not present, a copyin action with the appropriate reference |
| counter is performed. |
| - If var is a pointer reference, an attach action is performed. |
| ``` |
| The `acc.copyin` operation includes these semantics, including those |
| related to attach, which is specified through the `varPtrPtr` operand. |
| The `structured` flag on the operation is important since the |
| `structured reference counter` should be used when the flag is true; and |
| the `dynamic reference counter` should be used when it is false. |
| |
| At exit from structured regions (`acc data`, `acc kernels`), the |
| `acc copyin` operation is decomposed to `acc.delete` (with the |
| `structured` flag as true). The semantics of the `acc.delete` are |
| also consistent with the OpenACC specification noted for the exit |
| portion of the `acc copyin` clause: |
| ``` |
| At exit from the region: |
| - If the structured reference counter for var is zero, no action is |
| taken. |
| - Otherwise, a detach action is performed if var is a pointer reference, |
| and a present decrement action with the structured reference counter is |
| performed if var is not a null pointer. If both structured and dynamic |
| reference counters are zero, a delete action is performed. |
| ``` |
| |
| ### Types |
| |
| There are a few acc dialect type categories to describe: |
| * type of acc data clause operation input `varPtr` |
| - The type of `varPtr` must be pointer-like. This is done by |
| attaching the `PointerLikeType` interface to the appropriate MLIR |
| type. Although memory/storage concept is a lower level abstraction, |
| it is useful because the OpenACC model distinguishes between host |
| and device memory explicitly - and the mapping between the two is |
| done through pointers. Thus, by explicitly requiring it in the |
| dialect, the appropriate language frontend must create storage or |
| use type that satisfies the mapping constraint. |
| * type of result of acc data clause operations |
| - The type of the acc data clause operation is exactly the same as |
| `varPtr`. This was done intentionally instead of introducing an |
| `acc.ref/ptr` type so that IR compatibility and the dialect's |
| existing strong type checking can be maintained. This is needed |
| since the `acc` dialect must live within another dialect whose type |
| system is unknown to it. The only constraint is that the appropriate |
| dialect type must use the `PointerLikeType` interface. |
| * type of decomposed clauses |
| - Decomposed clauses, such as `acc.bounds` and `acc.declare_enter` |
| produce types to allow their results to be used only in specific |
| operations. |
| |
| ### Recipes |
| |
| Recipes are a generic way to express source language specific semantics. |
| |
| There are currently two categories of recipes, but the recipe concept |
| can be extended for any additional low-level information that needs |
| to be captured for successful lowering of OpenACC. The two categories |
| are: |
| * recipes used in the context of privatization associated with a |
| construct |
| * recipes used in the context of additional specification of data |
| semantics |
| |
| The intention of the recipes is to specify how materialization of |
| action, such as privatization, should be done when the semantics |
| of the action needs interpreted and lowered, such as before generating |
| LLVM dialect. |
| |
| The recipes used for privatization provide a source-language independent |
| way of specifying the creation of a local variable of that type. This |
| means using the appropriate `alloca` instruction and being able to |
| specify default initialization or default constructor. |
| |
| ### Routine |
| |
| The routine directive is used to note that a procedure should be made |
| available for the accelerator in a way that is consistent with its |
| modifiers, such as those that describe the parallelism. In the acc |
| dialect, an acc routine is represented through two joint pieces - an |
| attribute and an operation: |
| * The `acc.routine` operation is simply a specifier which notes which |
| symbol (or string) the acc routine is needed for, along with parallelism |
| associated. This defines a symbol that can be referenced in attribute. |
| * The `acc.routine_info` attribute is an attribute used on the source |
| dialect specific operation which specifies one or multiple `acc.routine` |
| symbols. Typically, this is attached to `func.func` which either |
| provides the declaration (in case of externals) or provides the |
| actual body of the acc routine in the dialect that the source language |
| was translated to. |
| |
| ### Declare |
| |
| OpenACC `declare` is a mechanism which declares a definition of a global |
| or a local to be accessible to accelerator with an implicit lifetime |
| as that of the scope where it was declared in. Thus, `declare` semantics |
| are represented through multiple operations and attributes: |
| * `acc.declare` - This is a structured operation which contains an |
| MLIR region and can be used in similar manner as acc.data to specify |
| an implicit data region with specific procedure lifetime. This is |
| typically used inside `func.func` after variable declarations. |
| * `acc.declare_enter` - This is an unstructured operation which is |
| used as a decomposed form of `acc declare`. It effectively allows the |
| entry operation to exist in a scope different than the exit operation. |
| It can also be used along `acc.declare_exit` which consumes its token |
| to define a scoped region without using MLIR region. This operation is |
| also used in `acc.global_ctor`. |
| * `acc.declare_exit` - The matching equivalent of `acc.declare_enter` |
| except that it specifies exit semantics. This operation is typically |
| used inside a `func.func` at the exit points or with `acc.global_dtor`. |
| * `acc.global_ctor` - Lives at the same level as source dialect globals |
| and is used to specify data actions to be done at program entry. This |
| is used in conjunction with source dialect globals whose lifetime is |
| not just a single procedure. |
| * `acc.global_dtor` - Defines the exit data actions that should be done |
| at program exit. Typically used to revert the actions of |
| `acc.global_ctor`. |
| |
| The attributes: |
| * `acc.declare` - This is a facility for easier determination of |
| variables which are `acc declare`'d. This attribute is used on |
| operations producing globals and on operations producing locals such as |
| dialect specific `alloca`'s. Having this attribute is required in order |
| to appear in a data mapping operation associated with any of the |
| `acc.declare*` operations. |
| * `acc.declare_action` - Since the OpenACC specification allows |
| declaration of variables that have yet to be allocated, this attribute |
| is used at the allocation and deallocation points. More specifically, |
| this attribute captures symbols of functions to be called to perform |
| an action either pre-allocate, post-allocate, pre-deallocate, or |
| post-deallocate. Calls to these functions should be materialized when |
| lowering OpenACC semantics to ensure proper data actions are done |
| after the allocation/deallocation. |
| |
| ## OpenACC Transforms and Analyses |
| |
| The design goal for the `acc` dialect is to be friendly to MLIR |
| optimization passes including CSE and LICM. Additionally, since it is |
| designed to recover original clauses, it makes late verification and |
| analysis possible in the MLIR framework outside of the frontend. |
| |
| This section describes a few MLIR-level passes for which the `acc` |
| dialect design should be friendly for. This section is currently |
| solely outlining the possibilities intended by the design and not |
| necessarily existing passes. |
| |
| ### Verification |
| |
| Since the OpenACC dialect is not lossy with regards to its |
| representation, it is possible to do OpenACC language semantic checking |
| at the MLIR-level. What follows is a list of various semantic checks |
| needed. |
| |
| This first list is required to be done in the frontend because the `acc` |
| dialect operations must be valid when constructed: |
| * Ensure that only listed clauses are allowed for each directive. |
| * Ensure that only listed modifiers are allowed for each clause. |
| |
| However, the following are semantic checks that can be done at the |
| MLIR-level (either in a separate pass or as part of the operation |
| verifier): |
| * Specify the validity checks that each modifier needs. (eg num_gangs |
| may need a positive integer). |
| * Ensure valid clause nesting. |
| * Validate clause restrictions which cannot appear with others. |
| * Validate that no conflicting clauses are used on variables. |
| |
| Note that some of these checks can be even more precise when done at the |
| MLIR level because optimizations like inlining and constant propagation |
| expose detail that wouldn't have been visible in the frontend. |
| |
| ### Implicit Data Attributes |
| |
| The OpenACC specification includes a section on `2.6.2 Variables with |
| Implicitly Determined Data Attributes`. What this section describes are |
| the data actions that should be applied to a variable for which |
| user did not specify a data action for. The action depends on the |
| construct being used and also on the default clause. However, the point |
| to note here is that variables which are live-in into the acc region |
| must employ some data mapping so the data can be passed to accelerator. |
| |
| One possible optimizations that affects data attributes needed is |
| `Scalar Replacement of Aggregates (SROA)`. The `acc` dialect should |
| not prevent this from happening on the source dialect. |
| |
| Because it is intended to be possible to apply optimizations across an |
| `acc` region, the analysis/transformation pass that applies the implicit |
| data attributes should be run as late as possible - ideally right before |
| any outlining process which uses the `acc` region body to create an |
| accelerator procedure. It is expected that existing MLIR facilities, |
| such as `mlir::Liveness` will work for the `acc` region and thus can be |
| used to perform this analysis. |
| |
| ### Redundant Clause Elimination |
| |
| The data operations are modeled in a way where data entry operations |
| look like loads and data exit operations look like stores. Thus these |
| operations are intended to be optimized in the following ways: |
| * Be able to eliminate redundant operations such as when an `acc.copyin` |
| dominates another. |
| * Be able to hoist/sink such operations out of loops. |
| |
| ## Operations TOC |
| |
| [include "Dialects/OpenACCDialectOps.md"] |
| |