|  | =================================== | 
|  | Memory Model Relaxation Annotations | 
|  | =================================== | 
|  |  | 
|  | .. contents:: | 
|  | :local: | 
|  |  | 
|  | Introduction | 
|  | ============ | 
|  |  | 
|  | Memory Model Relaxation Annotations (MMRAs) are target-defined properties | 
|  | on instructions that can be used to selectively relax constraints placed | 
|  | by the memory model. For example: | 
|  |  | 
|  | * The use of ``VulkanMemoryModel`` in a SPIRV program allows certain | 
|  | memory operations to be reordered across ``acquire`` or ``release`` | 
|  | operations. | 
|  | * OpenCL APIs expose primitives to only fence a specific set of address | 
|  | spaces. Carrying that information to the backend can enable the | 
|  | use of faster synchronization instructions, rather than fencing all | 
|  | address spaces everytime. | 
|  |  | 
|  | MMRAs offer an opt-in system for targets to relax the default LLVM | 
|  | memory model. | 
|  | As such, they are attached to an operation using LLVM metadata which | 
|  | can always be dropped without affecting correctness. | 
|  |  | 
|  | Definitions | 
|  | =========== | 
|  |  | 
|  | memory operation | 
|  | A load, a store, an atomic, or a function call that is marked as | 
|  | accessing memory. | 
|  |  | 
|  | synchronizing operation | 
|  | An instruction that synchronizes memory with other threads (e.g. | 
|  | an atomic or a fence). | 
|  |  | 
|  | tag | 
|  | Metadata attached to a memory or synchronizing operation | 
|  | that represents some target-defined property regarding memory | 
|  | synchronization. | 
|  |  | 
|  | An operation may have multiple tags that each represent a different | 
|  | property. | 
|  |  | 
|  | A tag is composed of a pair of metadata string: a *prefix* and a *suffix*. | 
|  |  | 
|  | In LLVM IR, the pair is represented using a metadata tuple. | 
|  | In other cases (comments, documentation, etc.), we may use the | 
|  | ``prefix:suffix`` notation. | 
|  | For example: | 
|  |  | 
|  | .. code-block:: | 
|  | :caption: Example: Tags in Metadata | 
|  |  | 
|  | !0 = !{!"scope", !"workgroup"}  # scope:workgroup | 
|  | !1 = !{!"scope", !"device"}     # scope:device | 
|  | !2 = !{!"scope", !"system"}     # scope:system | 
|  |  | 
|  | .. note:: | 
|  |  | 
|  | The only semantics relevant to the optimizer is the | 
|  | "compatibility" relation defined below. All other | 
|  | semantics are target defined. | 
|  |  | 
|  | Tags can also be organised in lists to allow operations | 
|  | to specify all of the tags they belong to. Such a list | 
|  | is referred to as a "set of tags". | 
|  |  | 
|  | .. code-block:: | 
|  | :caption: Example: Set of Tags in Metadata | 
|  |  | 
|  | !0 = !{!"scope", !"workgroup"} | 
|  | !1 = !{!"sync-as", !"private"} | 
|  | !2 = !{!0, !2} | 
|  |  | 
|  | .. note:: | 
|  |  | 
|  | If an operation does not have MMRA metadata, it's treated as if | 
|  | it has an empty list (``!{}``) of tags. | 
|  |  | 
|  | Note that it is not an error if a tag is not recognized by the | 
|  | instruction it is applied to, or by the current target. | 
|  | Such tags are simply ignored. | 
|  |  | 
|  | Both synchronizing operations and memory operations can have | 
|  | zero or more tags attached to them using the ``!mmra`` syntax. | 
|  |  | 
|  | For the sake of readability in examples below, | 
|  | we use a (non-functional) short syntax to represent MMMRA metadata: | 
|  |  | 
|  | .. code-block:: | 
|  | :caption: Short Syntax Example | 
|  |  | 
|  | store %ptr1 # foo:bar | 
|  | store %ptr1 !mmra !{!"foo", !"bar"} | 
|  |  | 
|  | These two notations can be used in this document and are strictly | 
|  | equivalent. However, only the second version is functional. | 
|  |  | 
|  | compatibility | 
|  | Two sets of tags are said to be *compatible* iff, for every unique | 
|  | tag prefix P present in at least one set: | 
|  |  | 
|  | - the other set contains no tag with prefix P, or | 
|  | - at least one tag with prefix P is common to both sets. | 
|  |  | 
|  | The above definition implies that an empty set is always compatible | 
|  | with any other set. This is an important property as it ensures that | 
|  | if a transform drops the metadata on an operation, it can never affect | 
|  | correctness. In other words, the memory model cannot be relaxed further | 
|  | by deleting metadata from instructions. | 
|  |  | 
|  | .. _HappensBefore: | 
|  |  | 
|  | The *happens-before* Relation | 
|  | ============================== | 
|  |  | 
|  | Compatibility checks can be used to opt out of the *happens-before* relation | 
|  | established between two instructions. | 
|  |  | 
|  | Ordering | 
|  | When two instructions' metadata are not compatible, any program order | 
|  | between them are not in *happens-before*. | 
|  |  | 
|  | For example, consider two tags ``foo:bar`` and | 
|  | ``foo:baz`` exposed by a target: | 
|  |  | 
|  | .. code-block:: | 
|  |  | 
|  | A: store %ptr1                 # foo:bar | 
|  | B: store %ptr2                 # foo:baz | 
|  | X: store atomic release %ptr3  # foo:bar | 
|  |  | 
|  | In the above figure, ``A`` is compatible with ``X``, and hence ``A`` | 
|  | happens-before ``X``. But ``B`` is not compatible with | 
|  | ``X``, and hence it is not happens-before ``X``. | 
|  |  | 
|  | Synchronization | 
|  | If an synchronizing operation has one or more tags, then whether it | 
|  | synchronizes-with and participates in the  ``seq_cst`` order with | 
|  | other operations is target dependent. | 
|  |  | 
|  | Whether the following example synchronizes with another sequence depends | 
|  | on the target-defined semantics of ``foo:bar`` and ``foo:bux``. | 
|  |  | 
|  | .. code-block:: | 
|  |  | 
|  | fence release               # foo:bar | 
|  | store atomic %ptr1          # foo:bux | 
|  |  | 
|  | Examples | 
|  | -------- | 
|  |  | 
|  | Example 1: | 
|  | .. code-block:: | 
|  |  | 
|  | A: store ptr addrspace(1) %ptr2                  # sync-as:1 vulkan:nonprivate | 
|  | B: store atomic release ptr addrspace(1) %ptr3   # sync-as:0 vulkan:nonprivate | 
|  |  | 
|  | A and B are not ordered relative to each other | 
|  | (no *happens-before*) because their sets of tags are not compatible. | 
|  |  | 
|  | Note that the ``sync-as`` value does not have to match the ``addrspace`` value. | 
|  | e.g. In Example 1, a store-release to a location in ``addrspace(1)`` wants to | 
|  | only synchronize with operations happening in ``addrspace(0)``. | 
|  |  | 
|  | Example 2: | 
|  | .. code-block:: | 
|  |  | 
|  | A: store ptr addrspace(1) %ptr2                 # sync-as:1 vulkan:nonprivate | 
|  | B: store atomic release ptr addrspace(1) %ptr3  # sync-as:1 vulkan:nonprivate | 
|  |  | 
|  | The ordering of A and B is unaffected because their set of tags are | 
|  | compatible. | 
|  |  | 
|  | Note that A and B may or may not be in *happens-before* due to other reasons. | 
|  |  | 
|  | Example 3: | 
|  | .. code-block:: | 
|  |  | 
|  | A: store ptr addrspace(1) %ptr2                 # sync-as:1 vulkan:nonprivate | 
|  | B: store atomic release ptr addrspace(1) %ptr3  # vulkan:nonprivate | 
|  |  | 
|  | The ordering of A and B is unaffected because their set of tags are | 
|  | compatible. | 
|  |  | 
|  | Example 4: | 
|  | .. code-block:: | 
|  |  | 
|  | A: store ptr addrspace(1) %ptr2                 # sync-as:1 | 
|  | B: store atomic release ptr addrspace(1) %ptr3  # sync-as:2 | 
|  |  | 
|  | A and B do not have to be ordered relative to each other | 
|  | (no *happens-before*) because their sets of tags are not compatible. | 
|  |  | 
|  | Use-cases | 
|  | ========= | 
|  |  | 
|  | SPIRV ``NonPrivatePointer`` | 
|  | --------------------------- | 
|  |  | 
|  | MMRAs can support the SPIRV capability | 
|  | ``VulkanMemoryModel``, where synchronizing operations only affect | 
|  | memory operations that specify ``NonPrivatePointer`` semantics. | 
|  |  | 
|  | The example below is generated from a SPIRV program using the | 
|  | following recipe: | 
|  |  | 
|  | - Add ``vulkan:nonprivate`` to every synchronizing operation. | 
|  | - Add ``vulkan:nonprivate`` to every non-atomic memory operation | 
|  | that is marked ``NonPrivatePointer``. | 
|  | - Add ``vulkan:private`` to tags of every non-atomic memory operation | 
|  | that is not marked ``NonPrivatePointer``. | 
|  |  | 
|  | .. code-block:: | 
|  |  | 
|  | Thread T1: | 
|  | A: store %ptr1                 # vulkan:nonprivate | 
|  | B: store %ptr2                 # vulkan:private | 
|  | X: store atomic release %ptr3  # vulkan:nonprivate | 
|  |  | 
|  | Thread T2: | 
|  | Y: load atomic acquire %ptr3   # vulkan:nonprivate | 
|  | C: load %ptr2                  # vulkan:private | 
|  | D: load %ptr1                  # vulkan:nonprivate | 
|  |  | 
|  | Compatibility ensures that operation ``A`` is ordered | 
|  | relative to ``X`` while operation ``D`` is ordered relative to ``Y``. | 
|  | If ``X`` synchronizes with ``Y``, then ``A`` happens-before ``D``. | 
|  | No such relation can be inferred about operations ``B`` and ``C``. | 
|  |  | 
|  | .. note:: | 
|  | The `Vulkan Memory Model <https://registry.khronos.org/vulkan/specs/1.3-extensions/html/vkspec.html#memory-model-non-private>`_ | 
|  | considers all atomic operation non-private. | 
|  |  | 
|  | Whether ``vulkan:nonprivate`` would be specified on atomic operations is | 
|  | an implementation detail, as an atomic operation is always ``nonprivate``. | 
|  | The implementation may choose to be explicit and emit IR with | 
|  | ``vulkan:nonprivate`` on every atomic operation, or it could choose to | 
|  | only emit ``vulkan::private`` and assume ``vulkan:nonprivate`` | 
|  | by default. | 
|  |  | 
|  | Operations marked with ``vulkan:private`` effectively opt out of the | 
|  | happens-before order in a SPIRV program since they are incompatible | 
|  | with every synchronizing operation. Note that SPIRV operations that | 
|  | are not marked ``NonPrivatePointer`` are not entirely private to the | 
|  | thread --- they are implicitly synchronized at the start or end of a | 
|  | thread by the Vulkan *system-synchronizes-with* relationship. This | 
|  | example assumes that the target-defined semantics of | 
|  | ``vulkan:private`` correctly implements this property. | 
|  |  | 
|  | This scheme is general enough to express the interoperability of SPIRV | 
|  | programs with other environments. | 
|  |  | 
|  | .. code-block:: | 
|  |  | 
|  | Thread T1: | 
|  | A: store %ptr1                 # vulkan:nonprivate | 
|  | X: store atomic release %ptr2  # vulkan:nonprivate | 
|  |  | 
|  | Thread T2: | 
|  | Y: load atomic acquire %ptr2   # foo:bar | 
|  | B: load %ptr1 | 
|  |  | 
|  | In the above example, thread ``T1`` originates from a SPIRV program | 
|  | while thread ``T2`` originates from a non-SPIRV program. Whether ``X`` | 
|  | can synchronize with ``Y`` is target defined.  If ``X`` synchronizes | 
|  | with ``Y``, then ``A`` happens before ``B`` (because A/X and | 
|  | Y/B are compatible). | 
|  |  | 
|  | Implementation Example | 
|  | ~~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | Consider the implementation of SPIRV ``NonPrivatePointer`` on a target | 
|  | where all memory operations are cached, and the entire cache is | 
|  | flushed or invalidated at a ``release`` or ``acquire`` respectively. A | 
|  | possible scheme is that when translating a SPIRV program, memory | 
|  | operations marked ``NonPrivatePointer`` should not be cached, and the | 
|  | cache contents should not be touched during an ``acquire`` and | 
|  | ``release`` operation. | 
|  |  | 
|  | This could be implemented using the tags that share the ``vulkan:`` prefix, | 
|  | as follows: | 
|  |  | 
|  | - For memory operations: | 
|  |  | 
|  | - Operations with ``vulkan:nonprivate`` should bypass the cache. | 
|  | - Operations with ``vulkan:private`` should be cached. | 
|  | - Operations that specify neither or both should conservatively | 
|  | bypass the cache to ensure correctness. | 
|  |  | 
|  | - For synchronizing operations: | 
|  |  | 
|  | - Operations with ``vulkan:nonprivate`` should not flush or | 
|  | invalidate the cache. | 
|  | - Operations with ``vulkan:private`` should flush or invalidate the cache. | 
|  | - Operations that specify neither or both should conservatively | 
|  | flush or invalidate the cache to ensure correctness. | 
|  |  | 
|  | .. note:: | 
|  | In such an implementation, dropping the metadata on an operation, while | 
|  | not affecting correctness, may have big performance implications. | 
|  | e.g. an operation bypasses the cache when it shouldn't. | 
|  |  | 
|  | Memory Types | 
|  | ------------ | 
|  |  | 
|  | MMRAs may express the selective synchronization of | 
|  | different memory types. | 
|  |  | 
|  | As an example, a target may expose an ``sync-as:<N>`` tag to | 
|  | pass information about which address spaces are synchronized by the | 
|  | execution of a synchronizing operation. | 
|  |  | 
|  | .. note:: | 
|  | Address spaces are used here as a common example, but this concept | 
|  | can apply for other "memory types". What "memory types" means here is | 
|  | up to the target. | 
|  |  | 
|  | .. code-block:: | 
|  |  | 
|  | # let 1 = global address space | 
|  | # let 3 = local address space | 
|  |  | 
|  | Thread T1: | 
|  | A: store %ptr1                                  # sync-as:1 | 
|  | B: store %ptr2                                  # sync-as:3 | 
|  | X: store atomic release ptr addrspace(0) %ptr3  # sync-as:3 | 
|  |  | 
|  | Thread T2: | 
|  | Y: load atomic acquire ptr addrspace(0) %ptr3   # sync-as:3 | 
|  | C: load %ptr2                                   # sync-as:3 | 
|  | D: load %ptr1                                   # sync-as:1 | 
|  |  | 
|  | In the above figure, ``X`` and ``Y`` are atomic operations on a | 
|  | location in the ``global``  address space. If ``X`` synchronizes with | 
|  | ``Y``, then ``B`` happens-before ``C`` in the ``local`` address | 
|  | space. But no such statement can be made about operations ``A`` and | 
|  | ``D``, although they are performed on a location in the ``global`` | 
|  | address space. | 
|  |  | 
|  | Implementation Example: Adding Address Space Information to Fences | 
|  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | Languages such as OpenCL C provide fence operations such as | 
|  | ``atomic_work_item_fence`` that can take an explicit address | 
|  | space to fence. | 
|  |  | 
|  | By default, LLVM has no means to carry that information in the IR, so | 
|  | the information is lost during lowering to LLVM IR. This means that | 
|  | targets such as AMDGPU have to conservatively emit instructions to | 
|  | fence all address spaces in all cases, which can have a noticeable | 
|  | performance impact in high-performance applications. | 
|  |  | 
|  | MMRAs may be used to preserve that information at the IR level, all the | 
|  | way through code generation. For example, a fence that only affects the | 
|  | global address space ``addrspace(1)`` may be lowered as | 
|  |  | 
|  | .. code-block:: | 
|  |  | 
|  | fence release # sync-as:1 | 
|  |  | 
|  | and the target may use the presence of ``sync-as:1`` to infer that it | 
|  | must only emit instruction to fence the global address space. | 
|  |  | 
|  | Note that as MMRAs are opt in, a fence that does not have MMRA metadata | 
|  | could still be lowered conservatively, so this optimization would only | 
|  | apply if the front-end emits the MMRA metadata on the fence instructions. | 
|  |  | 
|  | Additional Topics | 
|  | ================= | 
|  |  | 
|  | .. note:: | 
|  |  | 
|  | The following sections are informational. | 
|  |  | 
|  | Performance Impact | 
|  | ------------------ | 
|  |  | 
|  | MMRAs are a way to capture optimization opportunities in the program. | 
|  | But when an operation mentions no tags or conflicting tags, | 
|  | the target may need to produce conservative code to ensure correctness | 
|  | at the cost of performance. This can happen in the following situations: | 
|  |  | 
|  | 1. When a target first introduces MMRAs, the | 
|  | frontend might not have been updated to emit them. | 
|  | 2. An optimization may drop MMRA metadata. | 
|  | 3. An optimization may add arbitrary tags to an operation. | 
|  |  | 
|  | Note that targets can always choose to ignore (or even drop) MMRAs | 
|  | and revert to the default behavior/codegen heuristics without | 
|  | affecting correctness. | 
|  |  | 
|  | Consequences of the Absence of *happens-before* | 
|  | ----------------------------------------------- | 
|  |  | 
|  | In the :ref:`happens-before<HappensBefore>` section, we defined how an | 
|  | *happens-before* relation between two instruction can be broken | 
|  | by leveraging compatibility between MMRAs. When the instructions | 
|  | are incompatible and there is no *happens-before* relation, we say | 
|  | that the instructions "do not have to be ordered relative to each | 
|  | other". | 
|  |  | 
|  | "Ordering" in this context is a very broad term which covers both | 
|  | static and runtime aspects. | 
|  |  | 
|  | When there is no ordering constraint, we *could* statically reorder | 
|  | the instructions in an optimizer transform if the reordering does | 
|  | not break other constraints as single location coherence. | 
|  | Static reordering is one consequence of breaking *happens-before*, | 
|  | but is not the most interesting one. | 
|  |  | 
|  | Run-time consequences are more interesting. When there is an | 
|  | *happens-before* relation between instructions, the target has to emit | 
|  | synchronization code to ensure other threads will observe the effects of | 
|  | the instructions in the right order. | 
|  |  | 
|  | For instance, the target may have to wait for previous loads & stores to | 
|  | finish before starting a fence-release, or there may be a need to flush a | 
|  | memory cache before executing the next instruction. | 
|  | In the absence of *happens-before*, there is no such requirement and | 
|  | no waiting or flushing is required. This may noticeably speed up | 
|  | execution in some cases. | 
|  |  | 
|  | Combining Operations | 
|  | -------------------- | 
|  |  | 
|  | If a pass can combine multiple memory or synchronizing operations | 
|  | into one, it needs to be able to combine MMRAs. One possible way to | 
|  | achieve this is by doing a prefix-wise union of the tag sets. | 
|  |  | 
|  | Let A and B be two tags set, and U be the prefix-wise union of A and B. | 
|  | For every unique tag prefix P present in A or B: | 
|  |  | 
|  | * If either A or B has no tags with prefix P, no tags with prefix | 
|  | P are added to U. | 
|  | * If both A and B have at least one tag with prefix P, all tags with prefix | 
|  | P from both sets are added to U. | 
|  |  | 
|  | Passes should avoid aggressively combining MMRAs, as this can result | 
|  | in significant losses of information. While this cannot affect | 
|  | correctness, it may affect performance. | 
|  |  | 
|  | As a general rule of thumb, common passes such as SimplifyCFG that | 
|  | aggressively combine/reorder operations should only combine | 
|  | instructions that have identical sets of tags. | 
|  | Passes that combine less frequently, or that are well aware of the cost | 
|  | of combining the MMRAs can use the prefix-wise union described above. | 
|  |  | 
|  | Examples: | 
|  |  | 
|  | .. code-block:: | 
|  |  | 
|  | A: store release %ptr1  # foo:x, foo:y, bar:x | 
|  | B: store release %ptr2  # foo:x, bar:y | 
|  |  | 
|  | # Unique prefixes P = [foo, bar] | 
|  | # "foo:x" is common to A and B so it's added to U. | 
|  | # "bar:x" != "bar:y" so it's not added to U. | 
|  | U: store release %ptr3  # foo:x | 
|  |  | 
|  | .. code-block:: | 
|  |  | 
|  | A: store release %ptr1  # foo:x, foo:y | 
|  | B: store release %ptr2  # foo:x, bux:y | 
|  |  | 
|  | # Unique prefixes P = [foo, bux] | 
|  | # "foo:x" is common to A and B so it's added to U. | 
|  | # No tags have the prefix "bux" in A. | 
|  | U: store release %ptr3  # foo:x | 
|  |  | 
|  | .. code-block:: | 
|  |  | 
|  | A: store release %ptr1 | 
|  | B: store release %ptr2  # foo:x, bar:y | 
|  |  | 
|  | # Unique prefixes P = [foo, bar] | 
|  | # No tags with "foo" or "bar" in A, so no tags added. | 
|  | U: store release %ptr3 |