docs/AMDGPUUsage.rst - llvm - Git at Google

 ==============================
 User Guide for AMDGPU Back-end
 ==============================

 Introduction
 ============

 The AMDGPU back-end provides ISA code generation for AMD GPUs, starting with
 the R600 family up until the current Volcanic Islands (GCN Gen 3).

 Refer to `AMDGPU section in Architecture & Platform Information for Compiler Writers <CompilerWriterInfo.html#amdgpu>`_
 for additional documentation.

 Conventions
 ===========

 Address Spaces
 --------------

 The AMDGPU back-end uses the following address space mapping:

    ================== =================== ==============
    LLVM Address Space DWARF Address Space Memory Space
    ================== =================== ==============
    0                  1                   Private
    1                  N/A                 Global
    2                  N/A                 Constant
    3                  2                   Local
    4                  N/A                 Generic (Flat)
    5                  N/A                 Region
    ================== =================== ==============

 The terminology in the table, aside from the region memory space, is from the
 OpenCL standard.

 LLVM Address Space is used throughout LLVM (for example, in LLVM IR). DWARF
 Address Space is emitted in DWARF, and is used by tools, such as debugger,
 profiler and others.

 Trap Handler ABI
 ----------------
 The OS element of the target triple controls the trap handler behavior.

 HSA OS
 ^^^^^^
 For code objects generated by AMDGPU back-end for the HSA OS, the runtime
 installs a trap handler that supports the s_trap instruction with the following
 usage:

  +--------------+-------------+-------------------+----------------------------+
  |Usage         |Code Sequence|Trap Handler Inputs|Description                 |
  +==============+=============+===================+============================+
  |reserved      |s_trap 0x00  |                   |Reserved by hardware.       |
  +--------------+-------------+-------------------+----------------------------+
  |HSA debugtrap |s_trap 0x01  |SGPR0-1: queue_ptr |Reserved for HSA debugtrap  |
  |(arg)         |             |VGPR0: arg         |intrinsic (not implemented).|
  +--------------+-------------+-------------------+----------------------------+
  |llvm.trap     |s_trap 0x02  |SGPR0-1: queue_ptr |Causes dispatch to be       |
  |              |             |                   |terminated and its          |
  |              |             |                   |associated queue put into   |
  |              |             |                   |the error state.            |
  +--------------+-------------+-------------------+----------------------------+
  |llvm.debugtrap| s_trap 0x03 |SGPR0-1: queue_ptr |If debugger not installed   |
  |              |             |                   |handled same as llvm.trap.  |
  +--------------+-------------+-------------------+----------------------------+
  |debugger      |s_trap 0x07  |                   |Reserved for debugger       |
  |breakpoint    |             |                   |breakpoints.                |
  +--------------+-------------+-------------------+----------------------------+
  |debugger      |s_trap 0x08  |                   |Reserved for debugger.      |
  +--------------+-------------+-------------------+----------------------------+
  |debugger      |s_trap 0xfe  |                   |Reserved for debugger.      |
  +--------------+-------------+-------------------+----------------------------+
  |debugger      |s_trap 0xff  |                   |Reserved for debugger.      |
  +--------------+-------------+-------------------+----------------------------+

 Non-HSA OS
 ^^^^^^^^^^
 For code objects generated by AMDGPU back-end for non-HSA OS, the runtime does
 not install a trap handler. The llvm.trap and llvm.debugtrap instructions are
 handler as follows:

    =============== ============= ===============================================
    Usage           Code Sequence Description
    =============== ============= ===============================================
    llvm.trap       s_endpgm      Causes wavefront to be terminated.
    llvm.debugtrap  s_nop         No operation. Compiler warning generated that
                                  there is no trap handler installed.
    =============== ============= ===============================================

 Assembler
 =========

 AMDGPU backend has LLVM-MC based assembler which is currently in development.
 It supports Southern Islands ISA, Sea Islands and Volcanic Islands.

 This document describes general syntax for instructions and operands. For more
 information about instructions, their semantics and supported combinations
 of operands, refer to one of Instruction Set Architecture manuals.

 An instruction has the following syntax (register operands are
 normally comma-separated while extra operands are space-separated):

 *<opcode> <register_operand0>, ... <extra_operand0> ...*


 Operands
 --------

 The following syntax for register operands is supported:

 * SGPR registers: s0, ... or s[0], ...
 * VGPR registers: v0, ... or v[0], ...
 * TTMP registers: ttmp0, ... or ttmp[0], ...
 * Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
 * Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
 * Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
 * Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
 * Register index expressions: v[2*2], s[1-1:2-1]
 * 'off' indicates that an operand is not enabled

 The following extra operands are supported:

 * offset, offset0, offset1
 * idxen, offen bits
 * glc, slc, tfe bits
 * waitcnt: integer or combination of counter values
 * VOP3 modifiers:

   - abs (\| \|), neg (\-)

 * DPP modifiers:

   - row_shl, row_shr, row_ror, row_rol
   - row_mirror, row_half_mirror, row_bcast
   - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
   - row_mask, bank_mask, bound_ctrl

 * SDWA modifiers:

   - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
   - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
   - abs, neg, sext

 DS Instructions Examples
 ------------------------

 .. code-block:: nasm

   ds_add_u32 v2, v4 offset:16
   ds_write_src2_b64 v2 offset0:4 offset1:8
   ds_cmpst_f32 v2, v4, v6
   ds_min_rtn_f64 v[8:9], v2, v[4:5]


 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.

 FLAT Instruction Examples
 --------------------------

 .. code-block:: nasm

   flat_load_dword v1, v[3:4]
   flat_store_dwordx3 v[3:4], v[5:7]
   flat_atomic_swap v1, v[3:4], v5 glc
   flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
   flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc

 For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.

 MUBUF Instruction Examples
 ---------------------------

 .. code-block:: nasm

   buffer_load_dword v1, off, s[4:7], s1
   buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
   buffer_store_format_xy v[1:2], off, s[4:7], s1
   buffer_wbinvl1
   buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc

 For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.

 SMRD/SMEM Instruction Examples
 -------------------------------

 .. code-block:: nasm

   s_load_dword s1, s[2:3], 0xfc
   s_load_dwordx8 s[8:15], s[2:3], s4
   s_load_dwordx16 s[88:103], s[2:3], s4
   s_dcache_inv_vol
   s_memtime s[4:5]

 For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.

 SOP1 Instruction Examples
 --------------------------

 .. code-block:: nasm

   s_mov_b32 s1, s2
   s_mov_b64 s[0:1], 0x80000000
   s_cmov_b32 s1, 200
   s_wqm_b64 s[2:3], s[4:5]
   s_bcnt0_i32_b64 s1, s[2:3]
   s_swappc_b64 s[2:3], s[4:5]
   s_cbranch_join s[4:5]

 For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.

 SOP2 Instruction Examples
 -------------------------

 .. code-block:: nasm

   s_add_u32 s1, s2, s3
   s_and_b64 s[2:3], s[4:5], s[6:7]
   s_cselect_b32 s1, s2, s3
   s_andn2_b32 s2, s4, s6
   s_lshr_b64 s[2:3], s[4:5], s6
   s_ashr_i32 s2, s4, s6
   s_bfm_b64 s[2:3], s4, s6
   s_bfe_i64 s[2:3], s[4:5], s6
   s_cbranch_g_fork s[4:5], s[6:7]

 For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.

 SOPC Instruction Examples
 --------------------------

 .. code-block:: nasm

   s_cmp_eq_i32 s1, s2
   s_bitcmp1_b32 s1, s2
   s_bitcmp0_b64 s[2:3], s4
   s_setvskip s3, s5

 For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.

 SOPP Instruction Examples
 --------------------------

 .. code-block:: nasm

   s_barrier
   s_nop 2
   s_endpgm
   s_waitcnt 0 ; Wait for all counters to be 0
   s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
   s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
   s_sethalt 9
   s_sleep 10
   s_sendmsg 0x1
   s_sendmsg sendmsg(MSG_INTERRUPT)
   s_trap 1

 For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.

 Unless otherwise mentioned, little verification is performed on the operands
 of SOPP Instructions, so it is up to the programmer to be familiar with the
 range or acceptable values.

 Vector ALU Instruction Examples
 -------------------------------

 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
 the assembler will automatically use optimal encoding based on its operands.
 To force specific encoding, one can add a suffix to the opcode of the instruction:

 * _e32 for 32-bit VOP1/VOP2/VOPC
 * _e64 for 64-bit VOP3
 * _dpp for VOP_DPP
 * _sdwa for VOP_SDWA

 VOP1/VOP2/VOP3/VOPC examples:

 .. code-block:: nasm

   v_mov_b32 v1, v2
   v_mov_b32_e32 v1, v2
   v_nop
   v_cvt_f64_i32_e32 v[1:2], v2
   v_floor_f32_e32 v1, v2
   v_bfrev_b32_e32 v1, v2
   v_add_f32_e32 v1, v2, v3
   v_mul_i32_i24_e64 v1, v2, 3
   v_mul_i32_i24_e32 v1, -3, v3
   v_mul_i32_i24_e32 v1, -100, v3
   v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
   v_max_f16_e32 v1, v2, v3

 VOP_DPP examples:

 .. code-block:: nasm

   v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
   v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
   v_mov_b32 v0, v0 wave_shl:1
   v_mov_b32 v0, v0 row_mirror
   v_mov_b32 v0, v0 row_bcast:31
   v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
   v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
   v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0

 VOP_SDWA examples:

 .. code-block:: nasm

   v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
   v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
   v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
   v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
   v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0

 For full list of supported instructions, refer to "Vector ALU instructions".

 HSA Code Object Directives
 --------------------------

 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
 one can specify them with assembler directives.

 .hsa_code_object_version major, minor
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 *major* and *minor* are integers that specify the version of the HSA code
 object that will be generated by the assembler.

 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 *major*, *minor*, and *stepping* are all integers that describe the instruction
 set architecture (ISA) version of the assembly program.

 *vendor* and *arch* are quoted strings.  *vendor* should always be equal to
 "AMD" and *arch* should always be equal to "AMDGPU".

 By default, the assembler will derive the ISA version, *vendor*, and *arch*
 from the value of the -mcpu option that is passed to the assembler.

 .amdgpu_hsa_kernel (name)
 ^^^^^^^^^^^^^^^^^^^^^^^^^

 This directives specifies that the symbol with given name is a kernel entry point
 (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.

 .amd_kernel_code_t
 ^^^^^^^^^^^^^^^^^^

 This directive marks the beginning of a list of key / value pairs that are used
 to specify the amd_kernel_code_t object that will be emitted by the assembler.
 The list must be terminated by the *.end_amd_kernel_code_t* directive.  For
 any amd_kernel_code_t values that are unspecified a default value will be
 used.  The default value for all keys is 0, with the following exceptions:

 - *kernel_code_version_major* defaults to 1.
 - *machine_kind* defaults to 1.
 - *machine_version_major*, *machine_version_minor*, and
   *machine_version_stepping* are derived from the value of the -mcpu option
   that is passed to the assembler.
 - *kernel_code_entry_byte_offset* defaults to 256.
 - *wavefront_size* defaults to 6.
 - *kernarg_segment_alignment*, *group_segment_alignment*, and
   *private_segment_alignment* default to 4.  Note that alignments are specified
   as a power of two, so a value of **n** means an alignment of 2^ **n**.

 The *.amd_kernel_code_t* directive must be placed immediately after the
 function label and before any instructions.

 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.

 Here is an example of a minimal amd_kernel_code_t specification:

 .. code-block:: none

    .hsa_code_object_version 1,0
    .hsa_code_object_isa

    .hsatext
    .globl  hello_world
    .p2align 8
    .amdgpu_hsa_kernel hello_world

    hello_world:

       .amd_kernel_code_t
          enable_sgpr_kernarg_segment_ptr = 1
          is_ptr64 = 1
          compute_pgm_rsrc1_vgprs = 0
          compute_pgm_rsrc1_sgprs = 0
          compute_pgm_rsrc2_user_sgpr = 2
          kernarg_segment_byte_size = 8
          wavefront_sgpr_count = 2
          workitem_vgpr_count = 3
      .end_amd_kernel_code_t

      s_load_dwordx2 s[0:1], s[0:1] 0x0
      v_mov_b32 v0, 3.14159
      s_waitcnt lgkmcnt(0)
      v_mov_b32 v1, s0
      v_mov_b32 v2, s1
      flat_store_dword v[1:2], v0
      s_endpgm
    .Lfunc_end0:
         .size   hello_world, .Lfunc_end0-hello_world
	==============================
	User Guide for AMDGPU Back-end
	==============================

	Introduction
	============

	The AMDGPU back-end provides ISA code generation for AMD GPUs, starting with
	the R600 family up until the current Volcanic Islands (GCN Gen 3).

	Refer to `AMDGPU section in Architecture & Platform Information for Compiler Writers <CompilerWriterInfo.html#amdgpu>`_
	for additional documentation.

	Conventions
	===========

	Address Spaces
	--------------

	The AMDGPU back-end uses the following address space mapping:

	================== =================== ==============
	LLVM Address Space DWARF Address Space Memory Space
	================== =================== ==============
	0 1 Private
	1 N/A Global
	2 N/A Constant
	3 2 Local
	4 N/A Generic (Flat)
	5 N/A Region
	================== =================== ==============

	The terminology in the table, aside from the region memory space, is from the
	OpenCL standard.

	LLVM Address Space is used throughout LLVM (for example, in LLVM IR). DWARF
	Address Space is emitted in DWARF, and is used by tools, such as debugger,
	profiler and others.

	Trap Handler ABI
	----------------
	The OS element of the target triple controls the trap handler behavior.

	HSA OS
	^^^^^^
	For code objects generated by AMDGPU back-end for the HSA OS, the runtime
	installs a trap handler that supports the s_trap instruction with the following
	usage:

	+--------------+-------------+-------------------+----------------------------+
	\|Usage \|Code Sequence\|Trap Handler Inputs\|Description \|
	+==============+=============+===================+============================+
	\|reserved \|s_trap 0x00 \| \|Reserved by hardware. \|
	+--------------+-------------+-------------------+----------------------------+
	\|HSA debugtrap \|s_trap 0x01 \|SGPR0-1: queue_ptr \|Reserved for HSA debugtrap \|
	\|(arg) \| \|VGPR0: arg \|intrinsic (not implemented).\|
	+--------------+-------------+-------------------+----------------------------+
	\|llvm.trap \|s_trap 0x02 \|SGPR0-1: queue_ptr \|Causes dispatch to be \|
	\| \| \| \|terminated and its \|
	\| \| \| \|associated queue put into \|
	\| \| \| \|the error state. \|
	+--------------+-------------+-------------------+----------------------------+
	\|llvm.debugtrap\| s_trap 0x03 \|SGPR0-1: queue_ptr \|If debugger not installed \|
	\| \| \| \|handled same as llvm.trap. \|
	+--------------+-------------+-------------------+----------------------------+
	\|debugger \|s_trap 0x07 \| \|Reserved for debugger \|
	\|breakpoint \| \| \|breakpoints. \|
	+--------------+-------------+-------------------+----------------------------+
	\|debugger \|s_trap 0x08 \| \|Reserved for debugger. \|
	+--------------+-------------+-------------------+----------------------------+
	\|debugger \|s_trap 0xfe \| \|Reserved for debugger. \|
	+--------------+-------------+-------------------+----------------------------+
	\|debugger \|s_trap 0xff \| \|Reserved for debugger. \|
	+--------------+-------------+-------------------+----------------------------+

	Non-HSA OS
	^^^^^^^^^^
	For code objects generated by AMDGPU back-end for non-HSA OS, the runtime does
	not install a trap handler. The llvm.trap and llvm.debugtrap instructions are
	handler as follows:

	=============== ============= ===============================================
	Usage Code Sequence Description
	=============== ============= ===============================================
	llvm.trap s_endpgm Causes wavefront to be terminated.
	llvm.debugtrap s_nop No operation. Compiler warning generated that
	there is no trap handler installed.
	=============== ============= ===============================================

	Assembler
	=========

	AMDGPU backend has LLVM-MC based assembler which is currently in development.
	It supports Southern Islands ISA, Sea Islands and Volcanic Islands.

	This document describes general syntax for instructions and operands. For more
	information about instructions, their semantics and supported combinations
	of operands, refer to one of Instruction Set Architecture manuals.

	An instruction has the following syntax (register operands are
	normally comma-separated while extra operands are space-separated):

	<opcode> <register_operand0>, ... <extra_operand0> ...


	Operands
	--------

	The following syntax for register operands is supported:

	* SGPR registers: s0, ... or s[0], ...
	* VGPR registers: v0, ... or v[0], ...
	* TTMP registers: ttmp0, ... or ttmp[0], ...
	* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
	* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
	* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
	* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
	* Register index expressions: v[2*2], s[1-1:2-1]
	* 'off' indicates that an operand is not enabled

	The following extra operands are supported:

	* offset, offset0, offset1
	* idxen, offen bits
	* glc, slc, tfe bits
	* waitcnt: integer or combination of counter values
	* VOP3 modifiers:

	- abs (\\| \\|), neg (\-)

	* DPP modifiers:

	- row_shl, row_shr, row_ror, row_rol
	- row_mirror, row_half_mirror, row_bcast
	- wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
	- row_mask, bank_mask, bound_ctrl

	* SDWA modifiers:

	- dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
	- dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
	- abs, neg, sext

	DS Instructions Examples
	------------------------

	.. code-block:: nasm

	ds_add_u32 v2, v4 offset:16
	ds_write_src2_b64 v2 offset0:4 offset1:8
	ds_cmpst_f32 v2, v4, v6
	ds_min_rtn_f64 v[8:9], v2, v[4:5]


	For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.

	FLAT Instruction Examples
	--------------------------

	.. code-block:: nasm

	flat_load_dword v1, v[3:4]
	flat_store_dwordx3 v[3:4], v[5:7]
	flat_atomic_swap v1, v[3:4], v5 glc
	flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
	flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc

	For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.

	MUBUF Instruction Examples
	---------------------------

	.. code-block:: nasm

	buffer_load_dword v1, off, s[4:7], s1
	buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
	buffer_store_format_xy v[1:2], off, s[4:7], s1
	buffer_wbinvl1
	buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc

	For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.

	SMRD/SMEM Instruction Examples
	-------------------------------

	.. code-block:: nasm

	s_load_dword s1, s[2:3], 0xfc
	s_load_dwordx8 s[8:15], s[2:3], s4
	s_load_dwordx16 s[88:103], s[2:3], s4
	s_dcache_inv_vol
	s_memtime s[4:5]

	For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.

	SOP1 Instruction Examples
	--------------------------

	.. code-block:: nasm

	s_mov_b32 s1, s2
	s_mov_b64 s[0:1], 0x80000000
	s_cmov_b32 s1, 200
	s_wqm_b64 s[2:3], s[4:5]
	s_bcnt0_i32_b64 s1, s[2:3]
	s_swappc_b64 s[2:3], s[4:5]
	s_cbranch_join s[4:5]

	For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.

	SOP2 Instruction Examples
	-------------------------

	.. code-block:: nasm

	s_add_u32 s1, s2, s3
	s_and_b64 s[2:3], s[4:5], s[6:7]
	s_cselect_b32 s1, s2, s3
	s_andn2_b32 s2, s4, s6
	s_lshr_b64 s[2:3], s[4:5], s6
	s_ashr_i32 s2, s4, s6
	s_bfm_b64 s[2:3], s4, s6
	s_bfe_i64 s[2:3], s[4:5], s6
	s_cbranch_g_fork s[4:5], s[6:7]

	For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.

	SOPC Instruction Examples
	--------------------------

	.. code-block:: nasm

	s_cmp_eq_i32 s1, s2
	s_bitcmp1_b32 s1, s2
	s_bitcmp0_b64 s[2:3], s4
	s_setvskip s3, s5

	For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.

	SOPP Instruction Examples
	--------------------------

	.. code-block:: nasm

	s_barrier
	s_nop 2
	s_endpgm
	s_waitcnt 0 ; Wait for all counters to be 0
	s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
	s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
	s_sethalt 9
	s_sleep 10
	s_sendmsg 0x1
	s_sendmsg sendmsg(MSG_INTERRUPT)
	s_trap 1

	For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.

	Unless otherwise mentioned, little verification is performed on the operands
	of SOPP Instructions, so it is up to the programmer to be familiar with the
	range or acceptable values.

	Vector ALU Instruction Examples
	-------------------------------

	For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
	the assembler will automatically use optimal encoding based on its operands.
	To force specific encoding, one can add a suffix to the opcode of the instruction:

	* _e32 for 32-bit VOP1/VOP2/VOPC
	* _e64 for 64-bit VOP3
	* _dpp for VOP_DPP
	* _sdwa for VOP_SDWA

	VOP1/VOP2/VOP3/VOPC examples:

	.. code-block:: nasm

	v_mov_b32 v1, v2
	v_mov_b32_e32 v1, v2
	v_nop
	v_cvt_f64_i32_e32 v[1:2], v2
	v_floor_f32_e32 v1, v2
	v_bfrev_b32_e32 v1, v2
	v_add_f32_e32 v1, v2, v3
	v_mul_i32_i24_e64 v1, v2, 3
	v_mul_i32_i24_e32 v1, -3, v3
	v_mul_i32_i24_e32 v1, -100, v3
	v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
	v_max_f16_e32 v1, v2, v3

	VOP_DPP examples:

	.. code-block:: nasm

	v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
	v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
	v_mov_b32 v0, v0 wave_shl:1
	v_mov_b32 v0, v0 row_mirror
	v_mov_b32 v0, v0 row_bcast:31
	v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
	v_add_f32 v0, v0, \|v0\| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
	v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0

	VOP_SDWA examples:

	.. code-block:: nasm

	v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
	v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
	v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
	v_fract_f32 v0, \|v0\| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
	v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0

	For full list of supported instructions, refer to "Vector ALU instructions".

	HSA Code Object Directives
	--------------------------

	AMDGPU ABI defines auxiliary data in output code object. In assembly source,
	one can specify them with assembler directives.

	.hsa_code_object_version major, minor
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	major and minor are integers that specify the version of the HSA code
	object that will be generated by the assembler.

	.hsa_code_object_isa [major, minor, stepping, vendor, arch]
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	major, minor, and stepping are all integers that describe the instruction
	set architecture (ISA) version of the assembly program.

	vendor and arch are quoted strings. vendor should always be equal to
	"AMD" and arch should always be equal to "AMDGPU".

	By default, the assembler will derive the ISA version, vendor, and arch
	from the value of the -mcpu option that is passed to the assembler.

	.amdgpu_hsa_kernel (name)
	^^^^^^^^^^^^^^^^^^^^^^^^^

	This directives specifies that the symbol with given name is a kernel entry point
	(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.

	.amd_kernel_code_t
	^^^^^^^^^^^^^^^^^^

	This directive marks the beginning of a list of key / value pairs that are used
	to specify the amd_kernel_code_t object that will be emitted by the assembler.
	The list must be terminated by the .end_amd_kernel_code_t directive. For
	any amd_kernel_code_t values that are unspecified a default value will be
	used. The default value for all keys is 0, with the following exceptions:

	- kernel_code_version_major defaults to 1.
	- machine_kind defaults to 1.
	- machine_version_major, machine_version_minor, and
	machine_version_stepping are derived from the value of the -mcpu option
	that is passed to the assembler.
	- kernel_code_entry_byte_offset defaults to 256.
	- wavefront_size defaults to 6.
	- kernarg_segment_alignment, group_segment_alignment, and
	private_segment_alignment default to 4. Note that alignments are specified
	as a power of two, so a value of n means an alignment of 2^ n.

	The .amd_kernel_code_t directive must be placed immediately after the
	function label and before any instructions.

	For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
	comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.

	Here is an example of a minimal amd_kernel_code_t specification:

	.. code-block:: none

	.hsa_code_object_version 1,0
	.hsa_code_object_isa

	.hsatext
	.globl hello_world
	.p2align 8
	.amdgpu_hsa_kernel hello_world

	hello_world:

	.amd_kernel_code_t
	enable_sgpr_kernarg_segment_ptr = 1
	is_ptr64 = 1
	compute_pgm_rsrc1_vgprs = 0
	compute_pgm_rsrc1_sgprs = 0
	compute_pgm_rsrc2_user_sgpr = 2
	kernarg_segment_byte_size = 8
	wavefront_sgpr_count = 2
	workitem_vgpr_count = 3
	.end_amd_kernel_code_t

	s_load_dwordx2 s[0:1], s[0:1] 0x0
	v_mov_b32 v0, 3.14159
	s_waitcnt lgkmcnt(0)
	v_mov_b32 v1, s0
	v_mov_b32 v2, s1
	flat_store_dword v[1:2], v0
	s_endpgm
	.Lfunc_end0:
	.size hello_world, .Lfunc_end0-hello_world