docs/AMDGPUExecutionSynchronization.rst - llvm-project/llvm - Git at Google

 .. _amdgpu-execution-synchronization:

 ================================
 AMDGPU Execution Synchronization
 ================================

 .. contents::
    :local:

 .. _amdgpu-execution-synchronization-barriers:

 This document covers different ways of synchronizing execution of threads on AMD GPUs.

 .. note::

   This document is not exhaustive. There may be more ways of synchronizing execution
   that are not covered by this document.

 Barriers
 ========

 This section covers execution synchronization using barrier-style primitives.

 .. _amdgpu-execution-synchronization-barriers-execution-model:

 Execution Model
 ---------------

 This section contains a formal execution model that can be used to model the behavior of
 barriers on AMDGPU targets.

 .. note::

   The barrier execution model is experimental and subject to change.

 Threads can synchronize execution by performing barrier operations on barrier *objects* as described below:

 * Each barrier *object* has the following state:

   * An unsigned positive integer *expected count*: counts the number of *arrive* operations
     expected for this barrier *object*.
   * An unsigned non-negative integer *arrive count*: counts the number of *arrive* operations
     already performed on this barrier *object*.

       * The initial value of *arrive count* is zero.
       * When an operation causes *arrive count* to be equal to *expected count*, the barrier is completed,
         and the *arrive count* is reset to zero.

 * Barrier *objects* exist within a *scope* (see :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`),
   and each instance of a barrier *object* can only be accessed by threads in the same *scope* instance.
 * *Barrier-mutually-exclusive* is a symmetric relation between barrier *objects* that share resources
   in a way that restricts how a thread can use them at the same time.
 * Barrier operations are performed on barrier *objects*. A barrier operation is a dynamic instance
   of one of the following:

   * Barrier *init*

     * Barrier *init* takes an additional unsigned positive integer argument *k*.
     * Sets the *expected count* of the *barrier object* to *k*.
     * Resets the *arrive count* of the *barrier object* to zero.

   * Barrier *join*.

     * Allow the thread that executes the operation to *wait* on a barrier *object*.

   * Barrier *drop*.

     * Decrements *expected count* of the barrier *object* by one.

   * Barrier *arrive*.

     * Increments the *arrive count* of the barrier *object* by one.
     * If supported, an additional argument to  *arrive* can also update the *expected count* of the
       barrier *object* before the *arrive count* is incremented;
       the new *expected count* cannot be less than or equal to the *arrive count*,
       otherwise the behavior is undefined.

   * Barrier *wait*.

     * Introduces execution dependencies between threads; this operation depends on
       other barrier operations to complete.

 * Barrier modification operations are barrier operations that modify the barrier *object* state:

   * Barrier *init*.
   * Barrier *drop*.
   * Barrier *arrive*.

 * *Thread-barrier-order<BO>* is the subset of *program-order* that only
   relates barrier operations performed on a barrier *object* ``BO``.
 * All barrier modification operations on a barrier *object* ``BO`` occur in a strict total order called
   *barrier-modification-order<BO>*; it is the order in which ``BO`` observes barrier
   operations that change its state. For any valid *barrier-modification-order<BO>*, the
   following must be true:

   * Let ``A`` and ``B`` be two barrier modification operations where ``A -> B`` in
     *thread-barrier-order<BO>*, then ``A -> B`` is also in *barrier-modification-order<BO>*.
   * The first element in *barrier-modification-order<BO>* is always a barrier *init*, otherwise
     the behavior is undefined.

 * *barrier-participates-in* relates barrier operations to the barrier *waits* that depend on them
   to complete. A barrier operation ``X`` *barrier-participates-in* a barrier *wait* ``W``
   if and only if all of the following is true:

   * ``X`` and ``W`` are both performed on the same barrier *object* ``BO``.
   * ``X`` is a barrier *arrive* or *drop* operation.
   * ``X`` does not *barrier-participate-in* another distinct barrier *wait* ``W'`` in the same thread as ``W``.
   * ``W -> X`` not in *thread-barrier-order<BO>*.
   * All dependent constraint and relations are satisfied as well. [0]_

 * For the set ``S`` consisting of all barrier operations that *barrier-participate-in* a barrier *wait* ``W`` for some
   barrier *object* ``BO``:

   * The elements of ``S`` all exist in a continuous, uninterrupted interval of *barrier-modification-order<BO>*.
   * The *arrive count* of ``BO`` is zero before the first operation of ``S`` in *barrier-modification-order<BO>*.
   * The *arrive count* and *expected count* of ``BO`` are equal after the last operation of ``S`` in
     *barrier-modification-order<BO>*. The *arrive count* and *expected count* of ``BO`` cannot
     equal at any other point in ``S``.

 * A barrier *join* ``J`` is *barrier-joined-before* a barrier operation ``X`` if and only if all
   of the following is true:

   * ``J -> X`` in *thread-barrier-order<BO>*.
   * ``X`` is not a barrier *join*.
   * There is no barrier *join* or *drop* ``JD`` where ``J -> JD -> X`` in *thread-barrier-order<BO>*.
   * There is no barrier *join* ``J'`` on a distinct barrier *object* ``BO'`` such that ``J -> J' -> X`` in
     *program-order*, and ``BO`` *barrier-mutually-exclusive* ``BO'``.

 * A barrier operation ``A`` *barrier-executes-before* another barrier operation ``B`` if any of the
   following is true:

   * ``A -> B`` in *program-order*.
   * ``A -> B`` in *barrier-participates-in*.
   * ``A`` *barrier-executes-before* some barrier operation ``X``, and ``X``
     *barrier-executes-before* ``B``.

 * *Barrier-executes-before* is consistent with *barrier-modification-order<BO>*
   for every barrier object ``BO``.
 * For every barrier *drop* ``D`` performed on a barrier *object* ``BO``:

   * There is a barrier *join* ``J`` such that ``J -> D`` in *barrier-joined-before*;
     otherwise, the behavior is undefined.
   * ``D`` cannot cause the *expected count* of ``BO`` to become negative; otherwise, the behavior is undefined.

 * For every pair of barrier *arrive* ``A`` and barrier *drop* ``D`` performed on a barrier *object*
   ``BO``, such that ``A -> D`` in *thread-barrier-order<BO>*, one of the following must be true:

   * ``A`` does not *barrier-participates-in* any barrier *wait*.
   * ``A`` *barrier-participates-in* at least one barrier *wait* ``W``
     such that  ``W -> D`` in *barrier-executes-before*.

 * For every barrier *wait* ``W`` performed on a barrier *object* ``BO``:

   * There is a barrier *join* ``J`` such that ``J -> W`` in *barrier-joined-before*, and
     ``J`` must *barrier-executes-before* at least one operation ``X`` that
     *barrier-participates-in* ``W``; otherwise, the behavior is undefined.

 * *barrier-phase-with* is a symmetric relation over barrier operations defined as the
   transitive closure of: *barrier-participates-in* and its inverse relation.
 * For every barrier operation ``A`` that *barrier-participates-in* a barrier *wait* ``W`` on a barrier *object* ``BO``:

   * There is no barrier operation ``X`` on ``BO`` such that ``A -> X -> W`` in
     *barrier-executes-before*, and ``X`` *barrier-phase-with* a non-empty set of operations
     that does not include ``W``.

 .. note::

   Barriers only synchronize execution and do not affect the visibility of memory operations between threads.
   Refer to the :ref:`execution barriers memory model<amdgpu-amdhsa-execution-barriers-memory-model>`
   to determine how to synchronize memory operations through *barrier-executes-before*.


 .. [0] The definition of *barrier-participates-in* (in its current state) is non-deterministic and
        will be improved in the future: Within a valid execution, there may be multiple ways
        to build *barrier-participates-in*, however there is only one way to build it that also satisfies all
        other relations and constraints that depend on *barrier-participates-in* and relations derived from it.

 Informational Notes
 ~~~~~~~~~~~~~~~~~~~

 Informally, we can deduce from the above formal model that execution barriers behave as follows:

 * *Barrier-executes-before* relates the dynamic instances of operations from different threads together.
   For example, if ``A -> B`` in *barrier-executes-before*, then the execution of ``A`` must complete
   before the execution of ``B`` can complete.

   * This property can also be combined with *program-order*. For example, let two (non-barrier) operations
     ``X`` and ``Y`` where ``X -> A`` and ``B -> Y`` in *program-order*, then we know that the execution
     of ``X`` completes before the execution of ``Y`` does.

 * Barriers do not complete "out-of-thin-air"; a barrier *wait* ``W`` cannot depend on a barrier operation
   ``X`` to complete if ``W -> X`` in *barrier-executes-before*.
 * It is undefined behavior to operate on an uninitialized barrier object.
 * It is undefined behavior for a barrier *wait* to never complete.
 * It is not mandatory to *drop* a barrier after *joining* it.
 * A thread may not *arrive* and then *drop* a barrier *object* unless the barrier completes before the
   barrier *drop*. Incrementing the *arrive count* and decrementing the *expected count* directly
   after may cause undefined behavior.
 * *Joining* a barrier is only useful if the thread will *wait* on that same barrier *object* later.

 Barrier Implementations on AMDGPU Targets
 -----------------------------------------

 ``s_barrier``
 ~~~~~~~~~~~~~

 ``s_barrier`` are the primary barrier implementation of AMD GPUs.

 ``s_barrier`` instructions can only be used to synchronize threads at a wavefront granularity.
 ``s_barrier`` instructions are convergent within a wave, and thus can only be performed
 in wave-uniform control flow.

 The ``s_barrier`` family of instructions is available in some form on all GFX targets,
 and has evolved over time. The sub-sections below cover the capabilities offered by every major
 iteration of this feature separately.

 GFX6-11
 +++++++

 Targets from GFX6 through GFX11 included do not have the "split barrier" feature.
 The barrier *arrive* and barrier *wait* operations **cannot** be performed independently
 using ``s_barrier``.

 There is only one *workgroup barrier* object of ``workgroup`` scope that is implicitly used
 by all ``s_barrier`` instructions.

 The following code sequences can be used to implement the barrier operations defined by the
 :ref:`execution synchronization model<amdgpu-execution-synchronization-barriers-execution-model>` using
 ``s_barrier`` on GFX6 through GFX11:

 .. table:: s_barrier GFX6-11
     :name: amdgpu-execution-synchronization-barriers-sbarrier-gfx6-11
     :widths: 15 15 70

     ===================== ====================== ===========================================================
     Barrier Operation(s)  Barrier *Object*       AMDGPU Machine Code
     ===================== ====================== ===========================================================
     **Init, Join and Drop**
     --------------------------------------------------------------------------------------------------------
     *init*                - *Workgroup barrier*  Automatically initialized by the hardware when a workgroup
                                                  is launched. The *expected count* of this barrier is set
                                                  to the number of waves in the workgroup.

     *join*                - *Workgroup barrier*  Any thread launched within a workgroup automatically *joins*
                                                  this barrier *object*.

     *drop*                - *Workgroup barrier*  When a thread ends, it automatically *drops* this barrier
                                                  *object* if it had previously *joined* it.

     **Arrive and Wait**
     --------------------------------------------------------------------------------------------------------
     *arrive* then *wait*  - *Workgroup barrier*  | **BackOffBarrier**
                                                  | ``s_barrier``
                                                  | **No BackOffBarrier**
                                                  | ``s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)``
                                                  | ``s_waitcnt_vscnt null, 0x0``
                                                  | ``s_barrier``

                                                  - If the target does not have the BackOffBarrier feature,
                                                    then there cannot be any outstanding memory operations
                                                    before issuing the ``s_barrier`` instruction.
                                                  - The waitcnts can independently be moved earlier, or
                                                    removed entirely as long as the associated
                                                    counter remains at zero before issuing the
                                                    ``s_barrier`` instruction.
                                                  - The ``s_barrier`` instruction cannot complete
                                                    before all waves of the workgroup have launched.

     *arrive*              - *Workgroup barrier*  Not available separately, see *arrive* then *wait*

     *wait*                - *Workgroup barrier*  Not available separately, see *arrive* then *wait*
     ===================== ====================== ===========================================================

 GFX12
 +++++

 GFX12 targets have the split-barrier feature, and also allow ``s_barrier`` instructions to use
 one of multiple barrier *objects* available per workgroup. ``s_barrier`` instruction use the
 barrier ID operand to determine the barrier *object* they operate on.

 GFX12.5 additionally introduces new barrier *objects* that offer more flexibility for synchronizing the execution
 of a subset of waves of a workgroup, or synchronizing execution across workgroups within a workgroup cluster, via
 ``s_barrier``.

 .. note::

   Check the :ref:`the table below<amdgpu-execution-synchronization-barriers-sbarrier-ids-gfx12>` to determine
   which barrier IDs are available to ``s_barrier`` instructions on a given target.

 The following code sequences can be used to implement the barrier operations defined by the
 :ref:`execution synchronization model<amdgpu-execution-synchronization-barriers-execution-model>` using
 ``s_barrier`` on GFX12.0 and up:

 .. table:: s_barrier GFX12
     :name: amdgpu-execution-synchronization-barriers-sbarrier-gfx2
     :widths: 15 15 70

     ===================== =========================== ===========================================================
     Barrier Operation(s)  Barrier ID                  AMDGPU Machine Code
     ===================== =========================== ===========================================================
     **Init, Join and Drop**
     -------------------------------------------------------------------------------------------------------------
     *init*                - ``-2``, ``-1``            Automatically initialized by the hardware when a workgroup
                                                       is launched. The *expected count* of this barrier is set
                                                       to the number of waves in the workgroup.

     *init*                - ``-4``, ``-3``            Automatically initialized by the hardware when a workgroup
                                                       is launched as part of a workgroup cluster.
                                                       The *expected count* of this barrier is set to the number
                                                       of workgroups in the workgroup cluster.

     *init*                - ``0``                     Automatically initialized by the hardware and always
                                                       available. This barrier *object* is opaque and immutable
                                                       as all operations other than barrier *join* are no-ops.

     *init*                - ``[1, 16]``               | ``s_barrier_init <N>``

                                                       - ``<N>`` is an immediate constant, or stored in the lower
                                                         half of ``m0``.
                                                       - The value to set as the *expected count* of the barrier
                                                         is stored in the upper half of ``m0``.

     *join*                - ``-2``, ``-1``            Any thread launched within a workgroup automatically *joins*
                                                       this barrier *object*.

     *join*                - ``-4``, ``-3``            Any thread launched within a workgroup cluster
                                                       automatically *joins* this barrier *object*.

     *join*                - ``0``                     | ``s_barrier_join <N>``
                           - ``[1, 16]``
                                                       - ``<N>`` is an immediate constant, or stored in the lower
                                                         half of ``m0``.

     *drop*                - ``0``                     | ``s_barrier_leave``
                           - ``[1, 16]``
                                                       - ``s_barrier_leave`` takes no operand. It can only be used
                                                         to *drop* a barrier *object* ``BO`` if ``BO`` was
                                                         previously *joined* using ``s_barrier_join``.
                                                       - *Drops* the barrier *object* ``BO`` if and only if
                                                         there is a barrier *join* ``J`` such that ``J`` is
                                                         *barrier-joined-before* this barrier
                                                         *drop* operation.

     *drop*                - ``-2``, ``-1``            When a thread ends, it automatically *drops* this barrier
                           - ``-4``, ``-3``            *object* if it had previously *joined* it.

     **Arrive and Wait**
     -------------------------------------------------------------------------------------------------------------

     *arrive*              - ``-4``, ``-3``            | ``s_barrier_signal <N>``
                           - ``-2``, ``-1``            | Or
                           - ``0``                     | ``s_barrier_signal_isfirst <N>``
                           - ``[1, 16]``
                                                       - ``<N>`` is an immediate constant, or stored in bits ``[4:0]`` of ``m0``.
                                                       - The ``_isfirst`` variant sets ``SCC=1`` if this wave is the first
                                                         to signal the barrier, otherwise ``SCC=0``.
                                                       - For barrier *objects* ``[1, 16]``: When using ``m0`` as an operand,
                                                         if there is a non-zero value contained in the bits ``[22:16]`` of ``m0``,
                                                         the *expected count* of the barrier *object* is set to that value before
                                                         the *arrive count* of the barrier *object* is incremented.
                                                         The new *expected count* value must be greater than or equal to the
                                                         *arrive count*, otherwise the behavior is undefined.
                                                       - For barrier *objects* ``-4`` and ``-3``
                                                         (``cluster`` barriers): only one wave
                                                         per workgroup may arrive at the barrier on behalf of
                                                         its entire workgroup. However, any wave within the workgroup
                                                         cluster can then *wait* on this barrier *object*.
                                                       - This is a no-op on the *NULL named barrier object*
                                                         (barrier *object* ``0``).

     *wait*                - ``-4``, ``-3``            ``s_barrier_wait <N>``.
                           - ``-2``, ``-1``
                           - ``0``                     - ``<N>`` is an immediate constant.
                           - ``[1, 16]``               - For barrier *objects* ``-2`` and ``-1``: This instruction
                                                         cannot complete before all waves of the
                                                         workgroup have launched.
                                                       - For barrier *objects* ``-4`` and ``-3`` (``cluster`` barriers):
                                                         This instruction cannot complete before all waves of the
                                                         workgroup cluster have launched.
                                                       - This is a no-op on the *NULL named barrier object*
                                                         (barrier *object* ``0``).
                                                       - For *named barrier objects*, this instruction always waits on the
                                                         last *named barrier object* that the thread has *joined*, even
                                                         if it is different from the *barrier object* passed to the
                                                         instruction.
     ===================== =========================== ===========================================================


 The following barrier IDs are available:

 .. table:: s_barrier IDs GFX12
     :name: amdgpu-execution-synchronization-barriers-sbarrier-ids-gfx12
     :widths: 15 15 15 55

     =============== ============== ============ ==============================================================
     Barrier ID      Scope          Availability Description
     =============== ============== ============ ==============================================================
     ``-4``          ``cluster``    GFX12.5      *Cluster trap barrier*; *cluster barrier object* for use by
                                                 all workgroups of a workgroup cluster. Dedicated for the trap
                                                 handler and only available in privileged execution mode
                                                 (not accessible by the shader).

     ``-3``          ``cluster``    GFX12.5      *Cluster user barrier*; *cluster barrier object* for use by
                                                 all workgroups of a workgroup cluster.

     ``-2``          ``workgroup``  GFX12 (all)  *Workgroup trap barrier*, dedicated for the trap handler and
                                                 only available in privileged execution mode
                                                 (not accessible by the shader).

     ``-1``          ``workgroup``  GFX12 (all)  *Workgroup barrier*.

     ``0``           ``workgroup``  GFX12.5      *NULL named barrier object*. *Barrier-mutually-exclusive* with
                                                 barriers ``[1, 16]``.

     ``[1, 16]``     ``workgroup``  GFX12.5      *Named barrier object*. All barrier *objects* in this range are
                                                 *barrier-mutually-exclusive* with other barriers in ``[0, 16]``.
     =============== ============== ============ ==============================================================


 Informally, we can note that:

 * All operations on the *NULL named barrier object* other than *join* are no-ops.

   * As the *NULL named barrier object* (barrier ID ``0``) is *barrier-mutually-exclusive* with all other
     *named barrier objects* (barrier IDs ``[1, 16]``), a thread can use a *join* on the *NULL*
     barrier as a way to "unjoin" a *named barrier* (break *barrier-joined-before*) without
     having to use a *drop* operation.

 * When a thread ends, it does **not** implicitly *drop* any *named barrier objects*
   (barrier IDs ``[0, 16]``) it has *joined*.
	.. _amdgpu-execution-synchronization:

	================================
	AMDGPU Execution Synchronization
	================================

	.. contents::
	:local:

	.. _amdgpu-execution-synchronization-barriers:

	This document covers different ways of synchronizing execution of threads on AMD GPUs.

	.. note::

	This document is not exhaustive. There may be more ways of synchronizing execution
	that are not covered by this document.

	Barriers
	========

	This section covers execution synchronization using barrier-style primitives.

	.. _amdgpu-execution-synchronization-barriers-execution-model:

	Execution Model
	---------------

	This section contains a formal execution model that can be used to model the behavior of
	barriers on AMDGPU targets.

	.. note::

	The barrier execution model is experimental and subject to change.

	Threads can synchronize execution by performing barrier operations on barrier objects as described below:

	* Each barrier object has the following state:

	* An unsigned positive integer expected count: counts the number of arrive operations
	expected for this barrier object.
	* An unsigned non-negative integer arrive count: counts the number of arrive operations
	already performed on this barrier object.

	* The initial value of arrive count is zero.
	* When an operation causes arrive count to be equal to expected count, the barrier is completed,
	and the arrive count is reset to zero.

	* Barrier objects exist within a scope (see :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`),
	and each instance of a barrier object can only be accessed by threads in the same scope instance.
	* Barrier-mutually-exclusive is a symmetric relation between barrier objects that share resources
	in a way that restricts how a thread can use them at the same time.
	* Barrier operations are performed on barrier objects. A barrier operation is a dynamic instance
	of one of the following:

	* Barrier init

	* Barrier init takes an additional unsigned positive integer argument k.
	* Sets the expected count of the barrier object to k.
	* Resets the arrive count of the barrier object to zero.

	* Barrier join.

	* Allow the thread that executes the operation to wait on a barrier object.

	* Barrier drop.

	* Decrements expected count of the barrier object by one.

	* Barrier arrive.

	* Increments the arrive count of the barrier object by one.
	* If supported, an additional argument to arrive can also update the expected count of the
	barrier object before the arrive count is incremented;
	the new expected count cannot be less than or equal to the arrive count,
	otherwise the behavior is undefined.

	* Barrier wait.

	* Introduces execution dependencies between threads; this operation depends on
	other barrier operations to complete.

	* Barrier modification operations are barrier operations that modify the barrier object state:

	* Barrier init.
	* Barrier drop.
	* Barrier arrive.

	* Thread-barrier-order<BO> is the subset of program-order that only
	relates barrier operations performed on a barrier object ``BO``.
	* All barrier modification operations on a barrier object ``BO`` occur in a strict total order called
	barrier-modification-order<BO>; it is the order in which ``BO`` observes barrier
	operations that change its state. For any valid barrier-modification-order<BO>, the
	following must be true:

	* Let ``A`` and ``B`` be two barrier modification operations where ``A -> B`` in
	thread-barrier-order<BO>, then ``A -> B`` is also in barrier-modification-order<BO>.
	* The first element in barrier-modification-order<BO> is always a barrier init, otherwise
	the behavior is undefined.

	* barrier-participates-in relates barrier operations to the barrier waits that depend on them
	to complete. A barrier operation ``X`` barrier-participates-in a barrier wait ``W``
	if and only if all of the following is true:

	* ``X`` and ``W`` are both performed on the same barrier object ``BO``.
	* ``X`` is a barrier arrive or drop operation.
	* ``X`` does not barrier-participate-in another distinct barrier wait ``W'`` in the same thread as ``W``.
	* ``W -> X`` not in thread-barrier-order<BO>.
	* All dependent constraint and relations are satisfied as well. [0]_

	* For the set ``S`` consisting of all barrier operations that barrier-participate-in a barrier wait ``W`` for some
	barrier object ``BO``:

	* The elements of ``S`` all exist in a continuous, uninterrupted interval of barrier-modification-order<BO>.
	* The arrive count of ``BO`` is zero before the first operation of ``S`` in barrier-modification-order<BO>.
	* The arrive count and expected count of ``BO`` are equal after the last operation of ``S`` in
	barrier-modification-order<BO>. The arrive count and expected count of ``BO`` cannot
	equal at any other point in ``S``.

	* A barrier join ``J`` is barrier-joined-before a barrier operation ``X`` if and only if all
	of the following is true:

	* ``J -> X`` in thread-barrier-order<BO>.
	* ``X`` is not a barrier join.
	* There is no barrier join or drop ``JD`` where ``J -> JD -> X`` in thread-barrier-order<BO>.
	* There is no barrier join ``J'`` on a distinct barrier object ``BO'`` such that ``J -> J' -> X`` in
	program-order, and ``BO`` barrier-mutually-exclusive ``BO'``.

	* A barrier operation ``A`` barrier-executes-before another barrier operation ``B`` if any of the
	following is true:

	* ``A -> B`` in program-order.
	* ``A -> B`` in barrier-participates-in.
	* ``A`` barrier-executes-before some barrier operation ``X``, and ``X``
	barrier-executes-before ``B``.

	* Barrier-executes-before is consistent with barrier-modification-order<BO>
	for every barrier object ``BO``.
	* For every barrier drop ``D`` performed on a barrier object ``BO``:

	* There is a barrier join ``J`` such that ``J -> D`` in barrier-joined-before;
	otherwise, the behavior is undefined.
	* ``D`` cannot cause the expected count of ``BO`` to become negative; otherwise, the behavior is undefined.

	* For every pair of barrier arrive ``A`` and barrier drop ``D`` performed on a barrier object
	``BO``, such that ``A -> D`` in thread-barrier-order<BO>, one of the following must be true:

	* ``A`` does not barrier-participates-in any barrier wait.
	* ``A`` barrier-participates-in at least one barrier wait ``W``
	such that ``W -> D`` in barrier-executes-before.

	* For every barrier wait ``W`` performed on a barrier object ``BO``:

	* There is a barrier join ``J`` such that ``J -> W`` in barrier-joined-before, and
	``J`` must barrier-executes-before at least one operation ``X`` that
	barrier-participates-in ``W``; otherwise, the behavior is undefined.

	* barrier-phase-with is a symmetric relation over barrier operations defined as the
	transitive closure of: barrier-participates-in and its inverse relation.
	* For every barrier operation ``A`` that barrier-participates-in a barrier wait ``W`` on a barrier object ``BO``:

	* There is no barrier operation ``X`` on ``BO`` such that ``A -> X -> W`` in
	barrier-executes-before, and ``X`` barrier-phase-with a non-empty set of operations
	that does not include ``W``.

	.. note::

	Barriers only synchronize execution and do not affect the visibility of memory operations between threads.
	Refer to the :ref:`execution barriers memory model<amdgpu-amdhsa-execution-barriers-memory-model>`
	to determine how to synchronize memory operations through barrier-executes-before.


	.. [0] The definition of barrier-participates-in (in its current state) is non-deterministic and
	will be improved in the future: Within a valid execution, there may be multiple ways
	to build barrier-participates-in, however there is only one way to build it that also satisfies all
	other relations and constraints that depend on barrier-participates-in and relations derived from it.

	Informational Notes
	~~~~~~~~~~~~~~~~~~~

	Informally, we can deduce from the above formal model that execution barriers behave as follows:

	* Barrier-executes-before relates the dynamic instances of operations from different threads together.
	For example, if ``A -> B`` in barrier-executes-before, then the execution of ``A`` must complete
	before the execution of ``B`` can complete.

	* This property can also be combined with program-order. For example, let two (non-barrier) operations
	``X`` and ``Y`` where ``X -> A`` and ``B -> Y`` in program-order, then we know that the execution
	of ``X`` completes before the execution of ``Y`` does.

	* Barriers do not complete "out-of-thin-air"; a barrier wait ``W`` cannot depend on a barrier operation
	``X`` to complete if ``W -> X`` in barrier-executes-before.
	* It is undefined behavior to operate on an uninitialized barrier object.
	* It is undefined behavior for a barrier wait to never complete.
	* It is not mandatory to drop a barrier after joining it.
	* A thread may not arrive and then drop a barrier object unless the barrier completes before the
	barrier drop. Incrementing the arrive count and decrementing the expected count directly
	after may cause undefined behavior.
	* Joining a barrier is only useful if the thread will wait on that same barrier object later.

	Barrier Implementations on AMDGPU Targets
	-----------------------------------------

	``s_barrier``
	~~~~~~~~~~~~~

	``s_barrier`` are the primary barrier implementation of AMD GPUs.

	``s_barrier`` instructions can only be used to synchronize threads at a wavefront granularity.
	``s_barrier`` instructions are convergent within a wave, and thus can only be performed
	in wave-uniform control flow.

	The ``s_barrier`` family of instructions is available in some form on all GFX targets,
	and has evolved over time. The sub-sections below cover the capabilities offered by every major
	iteration of this feature separately.

	GFX6-11
	+++++++

	Targets from GFX6 through GFX11 included do not have the "split barrier" feature.
	The barrier arrive and barrier wait operations cannot be performed independently
	using ``s_barrier``.

	There is only one workgroup barrier object of ``workgroup`` scope that is implicitly used
	by all ``s_barrier`` instructions.

	The following code sequences can be used to implement the barrier operations defined by the
	:ref:`execution synchronization model<amdgpu-execution-synchronization-barriers-execution-model>` using
	``s_barrier`` on GFX6 through GFX11:

	.. table:: s_barrier GFX6-11
	:name: amdgpu-execution-synchronization-barriers-sbarrier-gfx6-11
	:widths: 15 15 70

	===================== ====================== ===========================================================
	Barrier Operation(s) Barrier Object AMDGPU Machine Code
	===================== ====================== ===========================================================
	Init, Join and Drop
	--------------------------------------------------------------------------------------------------------
	init - Workgroup barrier Automatically initialized by the hardware when a workgroup
	is launched. The expected count of this barrier is set
	to the number of waves in the workgroup.

	join - Workgroup barrier Any thread launched within a workgroup automatically joins
	this barrier object.

	drop - Workgroup barrier When a thread ends, it automatically drops this barrier
	object if it had previously joined it.

	Arrive and Wait
	--------------------------------------------------------------------------------------------------------
	arrive then wait - Workgroup barrier \| BackOffBarrier
	\| ``s_barrier``
	\| No BackOffBarrier
	\| ``s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)``
	\| ``s_waitcnt_vscnt null, 0x0``
	\| ``s_barrier``

	- If the target does not have the BackOffBarrier feature,
	then there cannot be any outstanding memory operations
	before issuing the ``s_barrier`` instruction.
	- The waitcnts can independently be moved earlier, or
	removed entirely as long as the associated
	counter remains at zero before issuing the
	``s_barrier`` instruction.
	- The ``s_barrier`` instruction cannot complete
	before all waves of the workgroup have launched.

	arrive - Workgroup barrier Not available separately, see arrive then wait

	wait - Workgroup barrier Not available separately, see arrive then wait
	===================== ====================== ===========================================================

	GFX12
	+++++

	GFX12 targets have the split-barrier feature, and also allow ``s_barrier`` instructions to use
	one of multiple barrier objects available per workgroup. ``s_barrier`` instruction use the
	barrier ID operand to determine the barrier object they operate on.

	GFX12.5 additionally introduces new barrier objects that offer more flexibility for synchronizing the execution
	of a subset of waves of a workgroup, or synchronizing execution across workgroups within a workgroup cluster, via
	``s_barrier``.

	.. note::

	Check the :ref:`the table below<amdgpu-execution-synchronization-barriers-sbarrier-ids-gfx12>` to determine
	which barrier IDs are available to ``s_barrier`` instructions on a given target.

	The following code sequences can be used to implement the barrier operations defined by the
	:ref:`execution synchronization model<amdgpu-execution-synchronization-barriers-execution-model>` using
	``s_barrier`` on GFX12.0 and up:

	.. table:: s_barrier GFX12
	:name: amdgpu-execution-synchronization-barriers-sbarrier-gfx2
	:widths: 15 15 70

	===================== =========================== ===========================================================
	Barrier Operation(s) Barrier ID AMDGPU Machine Code
	===================== =========================== ===========================================================
	Init, Join and Drop
	-------------------------------------------------------------------------------------------------------------
	init - ``-2``, ``-1`` Automatically initialized by the hardware when a workgroup
	is launched. The expected count of this barrier is set
	to the number of waves in the workgroup.

	init - ``-4``, ``-3`` Automatically initialized by the hardware when a workgroup
	is launched as part of a workgroup cluster.
	The expected count of this barrier is set to the number
	of workgroups in the workgroup cluster.

	init - ``0`` Automatically initialized by the hardware and always
	available. This barrier object is opaque and immutable
	as all operations other than barrier join are no-ops.

	init - ``[1, 16]`` \| ``s_barrier_init <N>``

	- ``<N>`` is an immediate constant, or stored in the lower
	half of ``m0``.
	- The value to set as the expected count of the barrier
	is stored in the upper half of ``m0``.

	join - ``-2``, ``-1`` Any thread launched within a workgroup automatically joins
	this barrier object.

	join - ``-4``, ``-3`` Any thread launched within a workgroup cluster
	automatically joins this barrier object.

	join - ``0`` \| ``s_barrier_join <N>``
	- ``[1, 16]``
	- ``<N>`` is an immediate constant, or stored in the lower
	half of ``m0``.

	drop - ``0`` \| ``s_barrier_leave``
	- ``[1, 16]``
	- ``s_barrier_leave`` takes no operand. It can only be used
	to drop a barrier object ``BO`` if ``BO`` was
	previously joined using ``s_barrier_join``.
	- Drops the barrier object ``BO`` if and only if
	there is a barrier join ``J`` such that ``J`` is
	barrier-joined-before this barrier
	drop operation.

	drop - ``-2``, ``-1`` When a thread ends, it automatically drops this barrier
	- ``-4``, ``-3`` object if it had previously joined it.

	Arrive and Wait
	-------------------------------------------------------------------------------------------------------------

	arrive - ``-4``, ``-3`` \| ``s_barrier_signal <N>``
	- ``-2``, ``-1`` \| Or
	- ``0`` \| ``s_barrier_signal_isfirst <N>``
	- ``[1, 16]``
	- ``<N>`` is an immediate constant, or stored in bits ``[4:0]`` of ``m0``.
	- The ``_isfirst`` variant sets ``SCC=1`` if this wave is the first
	to signal the barrier, otherwise ``SCC=0``.
	- For barrier objects ``[1, 16]``: When using ``m0`` as an operand,
	if there is a non-zero value contained in the bits ``[22:16]`` of ``m0``,
	the expected count of the barrier object is set to that value before
	the arrive count of the barrier object is incremented.
	The new expected count value must be greater than or equal to the
	arrive count, otherwise the behavior is undefined.
	- For barrier objects ``-4`` and ``-3``
	(``cluster`` barriers): only one wave
	per workgroup may arrive at the barrier on behalf of
	its entire workgroup. However, any wave within the workgroup
	cluster can then wait on this barrier object.
	- This is a no-op on the NULL named barrier object
	(barrier object ``0``).

	wait - ``-4``, ``-3`` ``s_barrier_wait <N>``.
	- ``-2``, ``-1``
	- ``0`` - ``<N>`` is an immediate constant.
	- ``[1, 16]`` - For barrier objects ``-2`` and ``-1``: This instruction
	cannot complete before all waves of the
	workgroup have launched.
	- For barrier objects ``-4`` and ``-3`` (``cluster`` barriers):
	This instruction cannot complete before all waves of the
	workgroup cluster have launched.
	- This is a no-op on the NULL named barrier object
	(barrier object ``0``).
	- For named barrier objects, this instruction always waits on the
	last named barrier object that the thread has joined, even
	if it is different from the barrier object passed to the
	instruction.
	===================== =========================== ===========================================================


	The following barrier IDs are available:

	.. table:: s_barrier IDs GFX12
	:name: amdgpu-execution-synchronization-barriers-sbarrier-ids-gfx12
	:widths: 15 15 15 55

	=============== ============== ============ ==============================================================
	Barrier ID Scope Availability Description
	=============== ============== ============ ==============================================================
	``-4`` ``cluster`` GFX12.5 Cluster trap barrier; cluster barrier object for use by
	all workgroups of a workgroup cluster. Dedicated for the trap
	handler and only available in privileged execution mode
	(not accessible by the shader).

	``-3`` ``cluster`` GFX12.5 Cluster user barrier; cluster barrier object for use by
	all workgroups of a workgroup cluster.

	``-2`` ``workgroup`` GFX12 (all) Workgroup trap barrier, dedicated for the trap handler and
	only available in privileged execution mode
	(not accessible by the shader).

	``-1`` ``workgroup`` GFX12 (all) Workgroup barrier.

	``0`` ``workgroup`` GFX12.5 NULL named barrier object. Barrier-mutually-exclusive with
	barriers ``[1, 16]``.

	``[1, 16]`` ``workgroup`` GFX12.5 Named barrier object. All barrier objects in this range are
	barrier-mutually-exclusive with other barriers in ``[0, 16]``.
	=============== ============== ============ ==============================================================


	Informally, we can note that:

	* All operations on the NULL named barrier object other than join are no-ops.

	* As the NULL named barrier object (barrier ID ``0``) is barrier-mutually-exclusive with all other
	named barrier objects (barrier IDs ``[1, 16]``), a thread can use a join on the NULL
	barrier as a way to "unjoin" a named barrier (break barrier-joined-before) without
	having to use a drop operation.

	* When a thread ends, it does not implicitly drop any named barrier objects
	(barrier IDs ``[0, 16]``) it has joined.