| =================================== | 
 | Memory Model Relaxation Annotations | 
 | =================================== | 
 |  | 
 | .. contents:: | 
 |    :local: | 
 |  | 
 | Introduction | 
 | ============ | 
 |  | 
 | Memory Model Relaxation Annotations (MMRAs) are target-defined properties | 
 | on instructions that can be used to selectively relax constraints placed | 
 | by the memory model. For example: | 
 |  | 
 | * The use of ``VulkanMemoryModel`` in a SPIRV program allows certain | 
 |   memory operations to be reordered across ``acquire`` or ``release`` | 
 |   operations. | 
 | * OpenCL APIs expose primitives to only fence a specific set of address | 
 |   spaces. Carrying that information to the backend can enable the | 
 |   use of faster synchronization instructions, rather than fencing all | 
 |   address spaces everytime. | 
 |  | 
 | MMRAs offer an opt-in system for targets to relax the default LLVM | 
 | memory model. | 
 | As such, they are attached to an operation using LLVM metadata which | 
 | can always be dropped without affecting correctness. | 
 |  | 
 | Definitions | 
 | =========== | 
 |  | 
 | memory operation | 
 |     A load, a store, an atomic, or a function call that is marked as | 
 |     accessing memory. | 
 |  | 
 | synchronizing operation | 
 |     An instruction that synchronizes memory with other threads (e.g. | 
 |     an atomic or a fence). | 
 |  | 
 | tag | 
 |     Metadata attached to a memory or synchronizing operation | 
 |     that represents some target-defined property regarding memory | 
 |     synchronization. | 
 |  | 
 |     An operation may have multiple tags that each represent a different | 
 |     property. | 
 |  | 
 |     A tag is composed of a pair of metadata string: a *prefix* and a *suffix*. | 
 |  | 
 |     In LLVM IR, the pair is represented using a metadata tuple. | 
 |     In other cases (comments, documentation, etc.), we may use the | 
 |     ``prefix:suffix`` notation. | 
 |     For example: | 
 |  | 
 |     .. code-block:: | 
 |       :caption: Example: Tags in Metadata | 
 |  | 
 |       !0 = !{!"scope", !"workgroup"}  # scope:workgroup | 
 |       !1 = !{!"scope", !"device"}     # scope:device | 
 |       !2 = !{!"scope", !"system"}     # scope:system | 
 |  | 
 |     .. note:: | 
 |  | 
 |       The only semantics relevant to the optimizer is the | 
 |       "compatibility" relation defined below. All other | 
 |       semantics are target defined. | 
 |  | 
 |     Tags can also be organised in lists to allow operations | 
 |     to specify all of the tags they belong to. Such a list | 
 |     is referred to as a "set of tags". | 
 |  | 
 |     .. code-block:: | 
 |       :caption: Example: Set of Tags in Metadata | 
 |  | 
 |       !0 = !{!"scope", !"workgroup"} | 
 |       !1 = !{!"sync-as", !"private"} | 
 |       !2 = !{!0, !2} | 
 |  | 
 |     .. note:: | 
 |  | 
 |       If an operation does not have MMRA metadata, it's treated as if | 
 |       it has an empty list (``!{}``) of tags. | 
 |  | 
 |     Note that it is not an error if a tag is not recognized by the | 
 |     instruction it is applied to, or by the current target. | 
 |     Such tags are simply ignored. | 
 |  | 
 |     Both synchronizing operations and memory operations can have | 
 |     zero or more tags attached to them using the ``!mmra`` syntax. | 
 |  | 
 |     For the sake of readability in examples below, | 
 |     we use a (non-functional) short syntax to represent MMMRA metadata: | 
 |  | 
 |     .. code-block:: | 
 |       :caption: Short Syntax Example | 
 |  | 
 |       store %ptr1 # foo:bar | 
 |       store %ptr1 !mmra !{!"foo", !"bar"} | 
 |  | 
 |     These two notations can be used in this document and are strictly | 
 |     equivalent. However, only the second version is functional. | 
 |  | 
 | compatibility | 
 |     Two sets of tags are said to be *compatible* iff, for every unique | 
 |     tag prefix P present in at least one set: | 
 |  | 
 |     - the other set contains no tag with prefix P, or | 
 |     - at least one tag with prefix P is common to both sets. | 
 |  | 
 |     The above definition implies that an empty set is always compatible | 
 |     with any other set. This is an important property as it ensures that | 
 |     if a transform drops the metadata on an operation, it can never affect | 
 |     correctness. In other words, the memory model cannot be relaxed further | 
 |     by deleting metadata from instructions. | 
 |  | 
 | .. _HappensBefore: | 
 |  | 
 | The *happens-before* Relation | 
 | ============================== | 
 |  | 
 | Compatibility checks can be used to opt out of the *happens-before* relation | 
 | established between two instructions. | 
 |  | 
 | Ordering | 
 |     When two instructions' metadata are not compatible, any program order | 
 |     between them are not in *happens-before*. | 
 |  | 
 |     For example, consider two tags ``foo:bar`` and | 
 |     ``foo:baz`` exposed by a target: | 
 |  | 
 |     .. code-block:: | 
 |  | 
 |        A: store %ptr1                 # foo:bar | 
 |        B: store %ptr2                 # foo:baz | 
 |        X: store atomic release %ptr3  # foo:bar | 
 |  | 
 |     In the above figure, ``A`` is compatible with ``X``, and hence ``A`` | 
 |     happens-before ``X``. But ``B`` is not compatible with | 
 |     ``X``, and hence it is not happens-before ``X``. | 
 |  | 
 | Synchronization | 
 |     If an synchronizing operation has one or more tags, then whether it | 
 |     synchronizes-with and participates in the  ``seq_cst`` order with | 
 |     other operations is target dependent. | 
 |  | 
 |     Whether the following example synchronizes with another sequence depends | 
 |     on the target-defined semantics of ``foo:bar`` and ``foo:bux``. | 
 |  | 
 |     .. code-block:: | 
 |  | 
 |        fence release               # foo:bar | 
 |        store atomic %ptr1          # foo:bux | 
 |  | 
 | Examples | 
 | -------- | 
 |  | 
 | Example 1: | 
 |     .. code-block:: | 
 |  | 
 |       A: store ptr addrspace(1) %ptr2                  # sync-as:1 vulkan:nonprivate | 
 |       B: store atomic release ptr addrspace(1) %ptr3   # sync-as:0 vulkan:nonprivate | 
 |  | 
 |     A and B are not ordered relative to each other | 
 |     (no *happens-before*) because their sets of tags are not compatible. | 
 |  | 
 |     Note that the ``sync-as`` value does not have to match the ``addrspace`` value. | 
 |     e.g. In Example 1, a store-release to a location in ``addrspace(1)`` wants to | 
 |     only synchronize with operations happening in ``addrspace(0)``. | 
 |  | 
 | Example 2: | 
 |     .. code-block:: | 
 |  | 
 |       A: store ptr addrspace(1) %ptr2                 # sync-as:1 vulkan:nonprivate | 
 |       B: store atomic release ptr addrspace(1) %ptr3  # sync-as:1 vulkan:nonprivate | 
 |  | 
 |     The ordering of A and B is unaffected because their set of tags are | 
 |     compatible. | 
 |  | 
 |     Note that A and B may or may not be in *happens-before* due to other reasons. | 
 |  | 
 | Example 3: | 
 |     .. code-block:: | 
 |  | 
 |       A: store ptr addrspace(1) %ptr2                 # sync-as:1 vulkan:nonprivate | 
 |       B: store atomic release ptr addrspace(1) %ptr3  # vulkan:nonprivate | 
 |  | 
 |     The ordering of A and B is unaffected because their set of tags are | 
 |     compatible. | 
 |  | 
 | Example 4: | 
 |     .. code-block:: | 
 |  | 
 |       A: store ptr addrspace(1) %ptr2                 # sync-as:1 | 
 |       B: store atomic release ptr addrspace(1) %ptr3  # sync-as:2 | 
 |  | 
 |     A and B do not have to be ordered relative to each other | 
 |     (no *happens-before*) because their sets of tags are not compatible. | 
 |  | 
 | Use-cases | 
 | ========= | 
 |  | 
 | SPIRV ``NonPrivatePointer`` | 
 | --------------------------- | 
 |  | 
 | MMRAs can support the SPIRV capability | 
 | ``VulkanMemoryModel``, where synchronizing operations only affect | 
 | memory operations that specify ``NonPrivatePointer`` semantics. | 
 |  | 
 | The example below is generated from a SPIRV program using the | 
 | following recipe: | 
 |  | 
 | - Add ``vulkan:nonprivate`` to every synchronizing operation. | 
 | - Add ``vulkan:nonprivate`` to every non-atomic memory operation | 
 |   that is marked ``NonPrivatePointer``. | 
 | - Add ``vulkan:private`` to tags of every non-atomic memory operation | 
 |   that is not marked ``NonPrivatePointer``. | 
 |  | 
 | .. code-block:: | 
 |  | 
 |    Thread T1: | 
 |     A: store %ptr1                 # vulkan:nonprivate | 
 |     B: store %ptr2                 # vulkan:private | 
 |     X: store atomic release %ptr3  # vulkan:nonprivate | 
 |  | 
 |    Thread T2: | 
 |     Y: load atomic acquire %ptr3   # vulkan:nonprivate | 
 |     C: load %ptr2                  # vulkan:private | 
 |     D: load %ptr1                  # vulkan:nonprivate | 
 |  | 
 | Compatibility ensures that operation ``A`` is ordered | 
 | relative to ``X`` while operation ``D`` is ordered relative to ``Y``. | 
 | If ``X`` synchronizes with ``Y``, then ``A`` happens-before ``D``. | 
 | No such relation can be inferred about operations ``B`` and ``C``. | 
 |  | 
 | .. note:: | 
 |    The `Vulkan Memory Model <https://registry.khronos.org/vulkan/specs/1.3-extensions/html/vkspec.html#memory-model-non-private>`_ | 
 |    considers all atomic operation non-private. | 
 |  | 
 |    Whether ``vulkan:nonprivate`` would be specified on atomic operations is | 
 |    an implementation detail, as an atomic operation is always ``nonprivate``. | 
 |    The implementation may choose to be explicit and emit IR with | 
 |    ``vulkan:nonprivate`` on every atomic operation, or it could choose to | 
 |    only emit ``vulkan::private`` and assume ``vulkan:nonprivate`` | 
 |    by default. | 
 |  | 
 | Operations marked with ``vulkan:private`` effectively opt out of the | 
 | happens-before order in a SPIRV program since they are incompatible | 
 | with every synchronizing operation. Note that SPIRV operations that | 
 | are not marked ``NonPrivatePointer`` are not entirely private to the | 
 | thread --- they are implicitly synchronized at the start or end of a | 
 | thread by the Vulkan *system-synchronizes-with* relationship. This | 
 | example assumes that the target-defined semantics of | 
 | ``vulkan:private`` correctly implements this property. | 
 |  | 
 | This scheme is general enough to express the interoperability of SPIRV | 
 | programs with other environments. | 
 |  | 
 | .. code-block:: | 
 |  | 
 |    Thread T1: | 
 |    A: store %ptr1                 # vulkan:nonprivate | 
 |    X: store atomic release %ptr2  # vulkan:nonprivate | 
 |  | 
 |    Thread T2: | 
 |    Y: load atomic acquire %ptr2   # foo:bar | 
 |    B: load %ptr1 | 
 |  | 
 | In the above example, thread ``T1`` originates from a SPIRV program | 
 | while thread ``T2`` originates from a non-SPIRV program. Whether ``X`` | 
 | can synchronize with ``Y`` is target defined.  If ``X`` synchronizes | 
 | with ``Y``, then ``A`` happens before ``B`` (because A/X and | 
 | Y/B are compatible). | 
 |  | 
 | Implementation Example | 
 | ~~~~~~~~~~~~~~~~~~~~~~ | 
 |  | 
 | Consider the implementation of SPIRV ``NonPrivatePointer`` on a target | 
 | where all memory operations are cached, and the entire cache is | 
 | flushed or invalidated at a ``release`` or ``acquire`` respectively. A | 
 | possible scheme is that when translating a SPIRV program, memory | 
 | operations marked ``NonPrivatePointer`` should not be cached, and the | 
 | cache contents should not be touched during an ``acquire`` and | 
 | ``release`` operation. | 
 |  | 
 | This could be implemented using the tags that share the ``vulkan:`` prefix, | 
 | as follows: | 
 |  | 
 | - For memory operations: | 
 |  | 
 |   - Operations with ``vulkan:nonprivate`` should bypass the cache. | 
 |   - Operations with ``vulkan:private`` should be cached. | 
 |   - Operations that specify neither or both should conservatively | 
 |     bypass the cache to ensure correctness. | 
 |  | 
 | - For synchronizing operations: | 
 |  | 
 |   - Operations with ``vulkan:nonprivate`` should not flush or | 
 |     invalidate the cache. | 
 |   - Operations with ``vulkan:private`` should flush or invalidate the cache. | 
 |   - Operations that specify neither or both should conservatively | 
 |     flush or invalidate the cache to ensure correctness. | 
 |  | 
 | .. note:: | 
 |    In such an implementation, dropping the metadata on an operation, while | 
 |    not affecting correctness, may have big performance implications. | 
 |    e.g. an operation bypasses the cache when it shouldn't. | 
 |  | 
 | Memory Types | 
 | ------------ | 
 |  | 
 | MMRAs may express the selective synchronization of | 
 | different memory types. | 
 |  | 
 | As an example, a target may expose an ``sync-as:<N>`` tag to | 
 | pass information about which address spaces are synchronized by the | 
 | execution of a synchronizing operation. | 
 |  | 
 | .. note:: | 
 |   Address spaces are used here as a common example, but this concept | 
 |   can apply for other "memory types". What "memory types" means here is | 
 |   up to the target. | 
 |  | 
 | .. code-block:: | 
 |  | 
 |    # let 1 = global address space | 
 |    # let 3 = local address space | 
 |  | 
 |    Thread T1: | 
 |    A: store %ptr1                                  # sync-as:1 | 
 |    B: store %ptr2                                  # sync-as:3 | 
 |    X: store atomic release ptr addrspace(0) %ptr3  # sync-as:3 | 
 |  | 
 |    Thread T2: | 
 |    Y: load atomic acquire ptr addrspace(0) %ptr3   # sync-as:3 | 
 |    C: load %ptr2                                   # sync-as:3 | 
 |    D: load %ptr1                                   # sync-as:1 | 
 |  | 
 | In the above figure, ``X`` and ``Y`` are atomic operations on a | 
 | location in the ``global``  address space. If ``X`` synchronizes with | 
 | ``Y``, then ``B`` happens-before ``C`` in the ``local`` address | 
 | space. But no such statement can be made about operations ``A`` and | 
 | ``D``, although they are peformed on a location in the ``global`` | 
 | address space. | 
 |  | 
 | Implementation Example: Adding Address Space Information to Fences | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 |  | 
 | Languages such as OpenCL C provide fence operations such as | 
 | ``atomic_work_item_fence`` that can take an explicit address | 
 | space to fence. | 
 |  | 
 | By default, LLVM has no means to carry that information in the IR, so | 
 | the information is lost during lowering to LLVM IR. This means that | 
 | targets such as AMDGPU have to conservatively emit instructions to | 
 | fence all address spaces in all cases, which can have a noticeable | 
 | performance impact in high-performance applications. | 
 |  | 
 | MMRAs may be used to preserve that information at the IR level, all the | 
 | way through code generation. For example, a fence that only affects the | 
 | global address space ``addrspace(1)`` may be lowered as | 
 |  | 
 | .. code-block:: | 
 |  | 
 |     fence release # sync-as:1 | 
 |  | 
 | and the target may use the presence of ``sync-as:1`` to infer that it | 
 | must only emit instruction to fence the global address space. | 
 |  | 
 | Note that as MMRAs are opt in, a fence that does not have MMRA metadata | 
 | could still be lowered conservatively, so this optimization would only | 
 | apply if the front-end emits the MMRA metadata on the fence instructions. | 
 |  | 
 | Additional Topics | 
 | ================= | 
 |  | 
 | .. note:: | 
 |  | 
 |   The following sections are informational. | 
 |  | 
 | Performance Impact | 
 | ------------------ | 
 |  | 
 | MMRAs are a way to capture optimization opportunities in the program. | 
 | But when an operation mentions no tags or conflicting tags, | 
 | the target may need to produce conservative code to ensure correctness | 
 | at the cost of performance. This can happen in the following situations: | 
 |  | 
 | 1. When a target first introduces MMRAs, the | 
 |    frontend might not have been updated to emit them. | 
 | 2. An optimization may drop MMRA metadata. | 
 | 3. An optimization may add arbitrary tags to an operation. | 
 |  | 
 | Note that targets can always choose to ignore (or even drop) MMRAs | 
 | and revert to the default behavior/codegen heuristics without | 
 | affecting correctness. | 
 |  | 
 | Consequences of the Absence of *happens-before* | 
 | ----------------------------------------------- | 
 |  | 
 | In the :ref:`happens-before<HappensBefore>` section, we defined how an | 
 | *happens-before* relation between two instruction can be broken | 
 | by leveraging compatibility between MMRAs. When the instructions | 
 | are incompatible and there is no *happens-before* relation, we say | 
 | that the instructions "do not have to be ordered relative to each | 
 | other". | 
 |  | 
 | "Ordering" in this context is a very broad term which covers both | 
 | static and runtime aspects. | 
 |  | 
 | When there is no ordering constraint, we *could* statically reorder | 
 | the instructions in an optimizer transform if the reordering does | 
 | not break other constraints as single location coherence. | 
 | Static reordering is one consequence of breaking *happens-before*, | 
 | but is not the most interesting one. | 
 |  | 
 | Run-time consequences are more interesting. When there is an | 
 | *happens-before* relation between instructions, the target has to emit | 
 | synchronization code to ensure other threads will observe the effects of | 
 | the instructions in the right order. | 
 |  | 
 | For instance, the target may have to wait for previous loads & stores to | 
 | finish before starting a fence-release, or there may be a need to flush a | 
 | memory cache before executing the next instruction. | 
 | In the absence of *happens-before*, there is no such requirement and | 
 | no waiting or flushing is required. This may noticeably speed up | 
 | execution in some cases. | 
 |  | 
 | Combining Operations | 
 | -------------------- | 
 |  | 
 | If a pass can combine multiple memory or synchronizing operations | 
 | into one, it needs to be able to combine MMRAs. One possible way to | 
 | achieve this is by doing a prefix-wise union of the tag sets. | 
 |  | 
 | Let A and B be two tags set, and U be the prefix-wise union of A and B. | 
 | For every unique tag prefix P present in A or B: | 
 |  | 
 | * If either A or B has no tags with prefix P, no tags with prefix | 
 |   P are added to U. | 
 | * If both A and B have at least one tag with prefix P, all tags with prefix | 
 |   P from both sets are added to U. | 
 |  | 
 | Passes should avoid aggressively combining MMRAs, as this can result | 
 | in significant losses of information. While this cannot affect | 
 | correctness, it may affect performance. | 
 |  | 
 | As a general rule of thumb, common passes such as SimplifyCFG that | 
 | aggressively combine/reorder operations should only combine | 
 | instructions that have identical sets of tags. | 
 | Passes that combine less frequently, or that are well aware of the cost | 
 | of combining the MMRAs can use the prefix-wise union described above. | 
 |  | 
 | Examples: | 
 |  | 
 | .. code-block:: | 
 |  | 
 |     A: store release %ptr1  # foo:x, foo:y, bar:x | 
 |     B: store release %ptr2  # foo:x, bar:y | 
 |  | 
 |     # Unique prefixes P = [foo, bar] | 
 |     # "foo:x" is common to A and B so it's added to U. | 
 |     # "bar:x" != "bar:y" so it's not added to U. | 
 |     U: store release %ptr3  # foo:x | 
 |  | 
 | .. code-block:: | 
 |  | 
 |     A: store release %ptr1  # foo:x, foo:y | 
 |     B: store release %ptr2  # foo:x, bux:y | 
 |  | 
 |     # Unique prefixes P = [foo, bux] | 
 |     # "foo:x" is common to A and B so it's added to U. | 
 |     # No tags have the prefix "bux" in A. | 
 |     U: store release %ptr3  # foo:x | 
 |  | 
 | .. code-block:: | 
 |  | 
 |     A: store release %ptr1 | 
 |     B: store release %ptr2  # foo:x, bar:y | 
 |  | 
 |     # Unique prefixes P = [foo, bar] | 
 |     # No tags with "foo" or "bar" in A, so no tags added. | 
 |     U: store release %ptr3 |