| =================================== |
| Memory Model Relaxation Annotations |
| =================================== |
| |
| .. contents:: |
| :local: |
| |
| Introduction |
| ============ |
| |
| Memory Model Relaxation Annotations (MMRAs) are target-defined properties |
| on instructions that can be used to selectively relax constraints placed |
| by the memory model. For example: |
| |
| * The use of ``VulkanMemoryModel`` in a SPIRV program allows certain |
| memory operations to be reordered across ``acquire`` or ``release`` |
| operations. |
| * OpenCL APIs expose primitives to only fence a specific set of address |
| spaces. Carrying that information to the backend can enable the |
| use of faster synchronization instructions, rather than fencing all |
| address spaces everytime. |
| |
| MMRAs offer an opt-in system for targets to relax the default LLVM |
| memory model. |
| As such, they are attached to an operation using LLVM metadata which |
| can always be dropped without affecting correctness. |
| |
| Definitions |
| =========== |
| |
| memory operation |
| A load, a store, an atomic, or a function call that is marked as |
| accessing memory. |
| |
| synchronizing operation |
| An instruction that synchronizes memory with other threads (e.g. |
| an atomic or a fence). |
| |
| tag |
| Metadata attached to a memory or synchronizing operation |
| that represents some target-defined property regarding memory |
| synchronization. |
| |
| An operation may have multiple tags that each represent a different |
| property. |
| |
| A tag is composed of a pair of metadata string: a *prefix* and a *suffix*. |
| |
| In LLVM IR, the pair is represented using a metadata tuple. |
| In other cases (comments, documentation, etc.), we may use the |
| ``prefix:suffix`` notation. |
| For example: |
| |
| .. code-block:: |
| :caption: Example: Tags in Metadata |
| |
| !0 = !{!"scope", !"workgroup"} # scope:workgroup |
| !1 = !{!"scope", !"device"} # scope:device |
| !2 = !{!"scope", !"system"} # scope:system |
| |
| .. note:: |
| |
| The only semantics relevant to the optimizer is the |
| "compatibility" relation defined below. All other |
| semantics are target defined. |
| |
| Tags can also be organised in lists to allow operations |
| to specify all of the tags they belong to. Such a list |
| is referred to as a "set of tags". |
| |
| .. code-block:: |
| :caption: Example: Set of Tags in Metadata |
| |
| !0 = !{!"scope", !"workgroup"} |
| !1 = !{!"sync-as", !"private"} |
| !2 = !{!0, !2} |
| |
| .. note:: |
| |
| If an operation does not have MMRA metadata, it's treated as if |
| it has an empty list (``!{}``) of tags. |
| |
| Note that it is not an error if a tag is not recognized by the |
| instruction it is applied to, or by the current target. |
| Such tags are simply ignored. |
| |
| Both synchronizing operations and memory operations can have |
| zero or more tags attached to them using the ``!mmra`` syntax. |
| |
| For the sake of readability in examples below, |
| we use a (non-functional) short syntax to represent MMMRA metadata: |
| |
| .. code-block:: |
| :caption: Short Syntax Example |
| |
| store %ptr1 # foo:bar |
| store %ptr1 !mmra !{!"foo", !"bar"} |
| |
| These two notations can be used in this document and are strictly |
| equivalent. However, only the second version is functional. |
| |
| compatibility |
| Two sets of tags are said to be *compatible* iff, for every unique |
| tag prefix P present in at least one set: |
| |
| - the other set contains no tag with prefix P, or |
| - at least one tag with prefix P is common to both sets. |
| |
| The above definition implies that an empty set is always compatible |
| with any other set. This is an important property as it ensures that |
| if a transform drops the metadata on an operation, it can never affect |
| correctness. In other words, the memory model cannot be relaxed further |
| by deleting metadata from instructions. |
| |
| .. _HappensBefore: |
| |
| The *happens-before* Relation |
| ============================== |
| |
| Compatibility checks can be used to opt out of the *happens-before* relation |
| established between two instructions. |
| |
| Ordering |
| When two instructions' metadata are not compatible, any program order |
| between them are not in *happens-before*. |
| |
| For example, consider two tags ``foo:bar`` and |
| ``foo:baz`` exposed by a target: |
| |
| .. code-block:: |
| |
| A: store %ptr1 # foo:bar |
| B: store %ptr2 # foo:baz |
| X: store atomic release %ptr3 # foo:bar |
| |
| In the above figure, ``A`` is compatible with ``X``, and hence ``A`` |
| happens-before ``X``. But ``B`` is not compatible with |
| ``X``, and hence it is not happens-before ``X``. |
| |
| Synchronization |
| If an synchronizing operation has one or more tags, then whether it |
| synchronizes-with and participates in the ``seq_cst`` order with |
| other operations is target dependent. |
| |
| Whether the following example synchronizes with another sequence depends |
| on the target-defined semantics of ``foo:bar`` and ``foo:bux``. |
| |
| .. code-block:: |
| |
| fence release # foo:bar |
| store atomic %ptr1 # foo:bux |
| |
| Examples |
| -------- |
| |
| Example 1: |
| .. code-block:: |
| |
| A: store ptr addrspace(1) %ptr2 # sync-as:1 vulkan:nonprivate |
| B: store atomic release ptr addrspace(1) %ptr3 # sync-as:0 vulkan:nonprivate |
| |
| A and B are not ordered relative to each other |
| (no *happens-before*) because their sets of tags are not compatible. |
| |
| Note that the ``sync-as`` value does not have to match the ``addrspace`` value. |
| e.g. In Example 1, a store-release to a location in ``addrspace(1)`` wants to |
| only synchronize with operations happening in ``addrspace(0)``. |
| |
| Example 2: |
| .. code-block:: |
| |
| A: store ptr addrspace(1) %ptr2 # sync-as:1 vulkan:nonprivate |
| B: store atomic release ptr addrspace(1) %ptr3 # sync-as:1 vulkan:nonprivate |
| |
| The ordering of A and B is unaffected because their set of tags are |
| compatible. |
| |
| Note that A and B may or may not be in *happens-before* due to other reasons. |
| |
| Example 3: |
| .. code-block:: |
| |
| A: store ptr addrspace(1) %ptr2 # sync-as:1 vulkan:nonprivate |
| B: store atomic release ptr addrspace(1) %ptr3 # vulkan:nonprivate |
| |
| The ordering of A and B is unaffected because their set of tags are |
| compatible. |
| |
| Example 4: |
| .. code-block:: |
| |
| A: store ptr addrspace(1) %ptr2 # sync-as:1 |
| B: store atomic release ptr addrspace(1) %ptr3 # sync-as:2 |
| |
| A and B do not have to be ordered relative to each other |
| (no *happens-before*) because their sets of tags are not compatible. |
| |
| Use-cases |
| ========= |
| |
| SPIRV ``NonPrivatePointer`` |
| --------------------------- |
| |
| MMRAs can support the SPIRV capability |
| ``VulkanMemoryModel``, where synchronizing operations only affect |
| memory operations that specify ``NonPrivatePointer`` semantics. |
| |
| The example below is generated from a SPIRV program using the |
| following recipe: |
| |
| - Add ``vulkan:nonprivate`` to every synchronizing operation. |
| - Add ``vulkan:nonprivate`` to every non-atomic memory operation |
| that is marked ``NonPrivatePointer``. |
| - Add ``vulkan:private`` to tags of every non-atomic memory operation |
| that is not marked ``NonPrivatePointer``. |
| |
| .. code-block:: |
| |
| Thread T1: |
| A: store %ptr1 # vulkan:nonprivate |
| B: store %ptr2 # vulkan:private |
| X: store atomic release %ptr3 # vulkan:nonprivate |
| |
| Thread T2: |
| Y: load atomic acquire %ptr3 # vulkan:nonprivate |
| C: load %ptr2 # vulkan:private |
| D: load %ptr1 # vulkan:nonprivate |
| |
| Compatibility ensures that operation ``A`` is ordered |
| relative to ``X`` while operation ``D`` is ordered relative to ``Y``. |
| If ``X`` synchronizes with ``Y``, then ``A`` happens-before ``D``. |
| No such relation can be inferred about operations ``B`` and ``C``. |
| |
| .. note:: |
| The `Vulkan Memory Model <https://registry.khronos.org/vulkan/specs/1.3-extensions/html/vkspec.html#memory-model-non-private>`_ |
| considers all atomic operation non-private. |
| |
| Whether ``vulkan:nonprivate`` would be specified on atomic operations is |
| an implementation detail, as an atomic operation is always ``nonprivate``. |
| The implementation may choose to be explicit and emit IR with |
| ``vulkan:nonprivate`` on every atomic operation, or it could choose to |
| only emit ``vulkan::private`` and assume ``vulkan:nonprivate`` |
| by default. |
| |
| Operations marked with ``vulkan:private`` effectively opt out of the |
| happens-before order in a SPIRV program since they are incompatible |
| with every synchronizing operation. Note that SPIRV operations that |
| are not marked ``NonPrivatePointer`` are not entirely private to the |
| thread --- they are implicitly synchronized at the start or end of a |
| thread by the Vulkan *system-synchronizes-with* relationship. This |
| example assumes that the target-defined semantics of |
| ``vulkan:private`` correctly implements this property. |
| |
| This scheme is general enough to express the interoperability of SPIRV |
| programs with other environments. |
| |
| .. code-block:: |
| |
| Thread T1: |
| A: store %ptr1 # vulkan:nonprivate |
| X: store atomic release %ptr2 # vulkan:nonprivate |
| |
| Thread T2: |
| Y: load atomic acquire %ptr2 # foo:bar |
| B: load %ptr1 |
| |
| In the above example, thread ``T1`` originates from a SPIRV program |
| while thread ``T2`` originates from a non-SPIRV program. Whether ``X`` |
| can synchronize with ``Y`` is target defined. If ``X`` synchronizes |
| with ``Y``, then ``A`` happens before ``B`` (because A/X and |
| Y/B are compatible). |
| |
| Implementation Example |
| ~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Consider the implementation of SPIRV ``NonPrivatePointer`` on a target |
| where all memory operations are cached, and the entire cache is |
| flushed or invalidated at a ``release`` or ``acquire`` respectively. A |
| possible scheme is that when translating a SPIRV program, memory |
| operations marked ``NonPrivatePointer`` should not be cached, and the |
| cache contents should not be touched during an ``acquire`` and |
| ``release`` operation. |
| |
| This could be implemented using the tags that share the ``vulkan:`` prefix, |
| as follows: |
| |
| - For memory operations: |
| |
| - Operations with ``vulkan:nonprivate`` should bypass the cache. |
| - Operations with ``vulkan:private`` should be cached. |
| - Operations that specify neither or both should conservatively |
| bypass the cache to ensure correctness. |
| |
| - For synchronizing operations: |
| |
| - Operations with ``vulkan:nonprivate`` should not flush or |
| invalidate the cache. |
| - Operations with ``vulkan:private`` should flush or invalidate the cache. |
| - Operations that specify neither or both should conservatively |
| flush or invalidate the cache to ensure correctness. |
| |
| .. note:: |
| In such an implementation, dropping the metadata on an operation, while |
| not affecting correctness, may have big performance implications. |
| e.g. an operation bypasses the cache when it shouldn't. |
| |
| Memory Types |
| ------------ |
| |
| MMRAs may express the selective synchronization of |
| different memory types. |
| |
| As an example, a target may expose an ``sync-as:<N>`` tag to |
| pass information about which address spaces are synchronized by the |
| execution of a synchronizing operation. |
| |
| .. note:: |
| Address spaces are used here as a common example, but this concept |
| can apply for other "memory types". What "memory types" means here is |
| up to the target. |
| |
| .. code-block:: |
| |
| # let 1 = global address space |
| # let 3 = local address space |
| |
| Thread T1: |
| A: store %ptr1 # sync-as:1 |
| B: store %ptr2 # sync-as:3 |
| X: store atomic release ptr addrspace(0) %ptr3 # sync-as:3 |
| |
| Thread T2: |
| Y: load atomic acquire ptr addrspace(0) %ptr3 # sync-as:3 |
| C: load %ptr2 # sync-as:3 |
| D: load %ptr1 # sync-as:1 |
| |
| In the above figure, ``X`` and ``Y`` are atomic operations on a |
| location in the ``global`` address space. If ``X`` synchronizes with |
| ``Y``, then ``B`` happens-before ``C`` in the ``local`` address |
| space. But no such statement can be made about operations ``A`` and |
| ``D``, although they are peformed on a location in the ``global`` |
| address space. |
| |
| Implementation Example: Adding Address Space Information to Fences |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Languages such as OpenCL C provide fence operations such as |
| ``atomic_work_item_fence`` that can take an explicit address |
| space to fence. |
| |
| By default, LLVM has no means to carry that information in the IR, so |
| the information is lost during lowering to LLVM IR. This means that |
| targets such as AMDGPU have to conservatively emit instructions to |
| fence all address spaces in all cases, which can have a noticeable |
| performance impact in high-performance applications. |
| |
| MMRAs may be used to preserve that information at the IR level, all the |
| way through code generation. For example, a fence that only affects the |
| global address space ``addrspace(1)`` may be lowered as |
| |
| .. code-block:: |
| |
| fence release # sync-as:1 |
| |
| and the target may use the presence of ``sync-as:1`` to infer that it |
| must only emit instruction to fence the global address space. |
| |
| Note that as MMRAs are opt in, a fence that does not have MMRA metadata |
| could still be lowered conservatively, so this optimization would only |
| apply if the front-end emits the MMRA metadata on the fence instructions. |
| |
| Additional Topics |
| ================= |
| |
| .. note:: |
| |
| The following sections are informational. |
| |
| Performance Impact |
| ------------------ |
| |
| MMRAs are a way to capture optimization opportunities in the program. |
| But when an operation mentions no tags or conflicting tags, |
| the target may need to produce conservative code to ensure correctness |
| at the cost of performance. This can happen in the following situations: |
| |
| 1. When a target first introduces MMRAs, the |
| frontend might not have been updated to emit them. |
| 2. An optimization may drop MMRA metadata. |
| 3. An optimization may add arbitrary tags to an operation. |
| |
| Note that targets can always choose to ignore (or even drop) MMRAs |
| and revert to the default behavior/codegen heuristics without |
| affecting correctness. |
| |
| Consequences of the Absence of *happens-before* |
| ----------------------------------------------- |
| |
| In the :ref:`happens-before<HappensBefore>` section, we defined how an |
| *happens-before* relation between two instruction can be broken |
| by leveraging compatibility between MMRAs. When the instructions |
| are incompatible and there is no *happens-before* relation, we say |
| that the instructions "do not have to be ordered relative to each |
| other". |
| |
| "Ordering" in this context is a very broad term which covers both |
| static and runtime aspects. |
| |
| When there is no ordering constraint, we *could* statically reorder |
| the instructions in an optimizer transform if the reordering does |
| not break other constraints as single location coherence. |
| Static reordering is one consequence of breaking *happens-before*, |
| but is not the most interesting one. |
| |
| Run-time consequences are more interesting. When there is an |
| *happens-before* relation between instructions, the target has to emit |
| synchronization code to ensure other threads will observe the effects of |
| the instructions in the right order. |
| |
| For instance, the target may have to wait for previous loads & stores to |
| finish before starting a fence-release, or there may be a need to flush a |
| memory cache before executing the next instruction. |
| In the absence of *happens-before*, there is no such requirement and |
| no waiting or flushing is required. This may noticeably speed up |
| execution in some cases. |
| |
| Combining Operations |
| -------------------- |
| |
| If a pass can combine multiple memory or synchronizing operations |
| into one, it needs to be able to combine MMRAs. One possible way to |
| achieve this is by doing a prefix-wise union of the tag sets. |
| |
| Let A and B be two tags set, and U be the prefix-wise union of A and B. |
| For every unique tag prefix P present in A or B: |
| |
| * If either A or B has no tags with prefix P, no tags with prefix |
| P are added to U. |
| * If both A and B have at least one tag with prefix P, all tags with prefix |
| P from both sets are added to U. |
| |
| Passes should avoid aggressively combining MMRAs, as this can result |
| in significant losses of information. While this cannot affect |
| correctness, it may affect performance. |
| |
| As a general rule of thumb, common passes such as SimplifyCFG that |
| aggressively combine/reorder operations should only combine |
| instructions that have identical sets of tags. |
| Passes that combine less frequently, or that are well aware of the cost |
| of combining the MMRAs can use the prefix-wise union described above. |
| |
| Examples: |
| |
| .. code-block:: |
| |
| A: store release %ptr1 # foo:x, foo:y, bar:x |
| B: store release %ptr2 # foo:x, bar:y |
| |
| # Unique prefixes P = [foo, bar] |
| # "foo:x" is common to A and B so it's added to U. |
| # "bar:x" != "bar:y" so it's not added to U. |
| U: store release %ptr3 # foo:x |
| |
| .. code-block:: |
| |
| A: store release %ptr1 # foo:x, foo:y |
| B: store release %ptr2 # foo:x, bux:y |
| |
| # Unique prefixes P = [foo, bux] |
| # "foo:x" is common to A and B so it's added to U. |
| # No tags have the prefix "bux" in A. |
| U: store release %ptr3 # foo:x |
| |
| .. code-block:: |
| |
| A: store release %ptr1 |
| B: store release %ptr2 # foo:x, bar:y |
| |
| # Unique prefixes P = [foo, bar] |
| # No tags with "foo" or "bar" in A, so no tags added. |
| U: store release %ptr3 |