| ***************************************************** |
| Support for AArch64 Scalable Matrix Extension in LLVM |
| ***************************************************** |
| |
| .. contents:: |
| :local: |
| |
| 1. Introduction |
| =============== |
| |
| The :ref:`AArch64 SME ACLE <aarch64_sme_acle>` provides a number of |
| attributes for users to control PSTATE.SM and PSTATE.ZA. |
| The :ref:`AArch64 SME ABI<aarch64_sme_abi>` describes the requirements for |
| calls between functions when at least one of those functions uses PSTATE.SM or |
| PSTATE.ZA. |
| |
| This document describes how the SME ACLE attributes map to LLVM IR |
| attributes and how LLVM lowers these attributes to implement the rules and |
| requirements of the ABI. |
| |
| Below we describe the LLVM IR attributes and their relation to the C/C++ |
| level ACLE attributes: |
| |
| ``aarch64_pstate_sm_enabled`` |
| is used for functions with ``__attribute__((arm_streaming))`` |
| |
| ``aarch64_pstate_sm_compatible`` |
| is used for functions with ``__attribute__((arm_streaming_compatible))`` |
| |
| ``aarch64_pstate_sm_body`` |
| is used for functions with ``__attribute__((arm_locally_streaming))`` and is |
| only valid on function definitions (not declarations) |
| |
| ``aarch64_pstate_za_new`` |
| is used for functions with ``__attribute__((arm_new_za))`` |
| |
| ``aarch64_pstate_za_shared`` |
| is used for functions with ``__attribute__((arm_shared_za))`` |
| |
| ``aarch64_pstate_za_preserved`` |
| is used for functions with ``__attribute__((arm_preserves_za))`` |
| |
| ``aarch64_expanded_pstate_za`` |
| is used for functions with ``__attribute__((arm_new_za))`` |
| |
| Clang must ensure that the above attributes are added both to the |
| function's declaration/definition as well as to their call-sites. This is |
| important for calls to attributed function pointers, where there is no |
| definition or declaration available. |
| |
| |
| 2. Handling PSTATE.SM |
| ===================== |
| |
| When changing PSTATE.SM the execution of FP/vector operations may be transferred |
| to another processing element. This has three important implications: |
| |
| * The runtime SVE vector length may change. |
| |
| * The contents of FP/AdvSIMD/SVE registers are zeroed. |
| |
| * The set of allowable instructions changes. |
| |
| This leads to certain restrictions on IR and optimizations. For example, it |
| is undefined behaviour to share vector-length dependent state between functions |
| that may operate with different values for PSTATE.SM. Front-ends must honour |
| these restrictions when generating LLVM IR. |
| |
| Even though the runtime SVE vector length may change, for the purpose of LLVM IR |
| and almost all parts of CodeGen we can assume that the runtime value for |
| ``vscale`` does not. If we let the compiler insert the appropriate ``smstart`` |
| and ``smstop`` instructions around call boundaries, then the effects on SVE |
| state can be mitigated. By limiting the state changes to a very brief window |
| around the call we can control how the operations are scheduled and how live |
| values remain preserved between state transitions. |
| |
| In order to control PSTATE.SM at this level of granularity, we use function and |
| callsite attributes rather than intrinsics. |
| |
| |
| Restrictions on attributes |
| -------------------------- |
| |
| * It is undefined behaviour to pass or return (pointers to) scalable vector |
| objects to/from functions which may use a different SVE vector length. |
| This includes functions with a non-streaming interface, but marked with |
| ``aarch64_pstate_sm_body``. |
| |
| * It is not allowed for a function to be decorated with both |
| ``aarch64_pstate_sm_compatible`` and ``aarch64_pstate_sm_enabled``. |
| |
| * It is not allowed for a function to be decorated with both |
| ``aarch64_pstate_za_new`` and ``aarch64_pstate_za_preserved``. |
| |
| * It is not allowed for a function to be decorated with both |
| ``aarch64_pstate_za_new`` and ``aarch64_pstate_za_shared``. |
| |
| These restrictions also apply in the higher level SME ACLE, which means we can |
| emit diagnostics in Clang to signal users about incorrect behaviour. |
| |
| |
| Compiler inserted streaming-mode changes |
| ---------------------------------------- |
| |
| The table below describes the transitions in PSTATE.SM the compiler has to |
| account for when doing calls between functions with different attributes. |
| In this table, we use the following abbreviations: |
| |
| ``N`` |
| functions with a normal interface (PSTATE.SM=0 on entry, PSTATE.SM=0 on |
| return) |
| |
| ``S`` |
| functions with a Streaming interface (PSTATE.SM=1 on entry, PSTATE.SM=1 |
| on return) |
| |
| ``SC`` |
| functions with a Streaming-Compatible interface (PSTATE.SM can be |
| either 0 or 1 on entry, and is unchanged on return). |
| |
| Functions with ``__attribute__((arm_locally_streaming))`` are excluded from this |
| table because for the caller the attribute is synonymous to 'streaming', and |
| for the callee it is merely an implementation detail that is explicitly not |
| exposed to the caller. |
| |
| .. table:: Combinations of calls for functions with different attributes |
| |
| ==== ==== =============================== ============================== ============================== |
| From To Before call After call After exception |
| ==== ==== =============================== ============================== ============================== |
| N N |
| N S SMSTART SMSTOP |
| N SC |
| S N SMSTOP SMSTART SMSTART |
| S S SMSTART |
| S SC SMSTART |
| SC N If PSTATE.SM before call is 1, If PSTATE.SM before call is 1, If PSTATE.SM before call is 1, |
| then SMSTOP then SMSTART then SMSTART |
| SC S If PSTATE.SM before call is 0, If PSTATE.SM before call is 0, If PSTATE.SM before call is 1, |
| then SMSTART then SMSTOP then SMSTART |
| SC SC If PSTATE.SM before call is 1, |
| then SMSTART |
| ==== ==== =============================== ============================== ============================== |
| |
| |
| Because changing PSTATE.SM zeroes the FP/vector registers, it is best to emit |
| the ``smstart`` and ``smstop`` instructions before register allocation, so that |
| the register allocator can spill/reload registers around the mode change. |
| |
| The compiler should also have sufficient information on which operations are |
| part of the call/function's arguments/result and which operations are part of |
| the function's body, so that it can place the mode changes in exactly the right |
| position. The suitable place to do this seems to be SelectionDAG, where it lowers |
| the call's arguments/return values to implement the specified calling convention. |
| SelectionDAG provides Chains and Glue to specify the order of operations and give |
| preliminary control over the instruction's scheduling. |
| |
| |
| Example of preserving state |
| --------------------------- |
| |
| When passing and returning a ``float`` value to/from a function |
| that has a streaming interface from a function that has a normal interface, the |
| call-site will need to ensure that the argument/result registers are preserved |
| and that no other code is scheduled in between the ``smstart/smstop`` and the call. |
| |
| .. code-block:: llvm |
| |
| define float @foo(float %f) nounwind { |
| %res = call float @bar(float %f) "aarch64_pstate_sm_enabled" |
| ret float %res |
| } |
| |
| declare float @bar(float) "aarch64_pstate_sm_enabled" |
| |
| The program needs to preserve the value of the floating point argument and |
| return value in register ``s0``: |
| |
| .. code-block:: none |
| |
| foo: // @foo |
| // %bb.0: |
| stp d15, d14, [sp, #-80]! // 16-byte Folded Spill |
| stp d13, d12, [sp, #16] // 16-byte Folded Spill |
| stp d11, d10, [sp, #32] // 16-byte Folded Spill |
| stp d9, d8, [sp, #48] // 16-byte Folded Spill |
| str x30, [sp, #64] // 8-byte Folded Spill |
| str s0, [sp, #76] // 4-byte Folded Spill |
| smstart sm |
| ldr s0, [sp, #76] // 4-byte Folded Reload |
| bl bar |
| str s0, [sp, #76] // 4-byte Folded Spill |
| smstop sm |
| ldp d9, d8, [sp, #48] // 16-byte Folded Reload |
| ldp d11, d10, [sp, #32] // 16-byte Folded Reload |
| ldp d13, d12, [sp, #16] // 16-byte Folded Reload |
| ldr s0, [sp, #76] // 4-byte Folded Reload |
| ldr x30, [sp, #64] // 8-byte Folded Reload |
| ldp d15, d14, [sp], #80 // 16-byte Folded Reload |
| ret |
| |
| Setting the correct register masks on the ISD nodes and inserting the |
| ``smstart/smstop`` in the right places should ensure this is done correctly. |
| |
| |
| Instruction Selection Nodes |
| --------------------------- |
| |
| .. code-block:: none |
| |
| AArch64ISD::SMSTART Chain, [SM|ZA|Both], CurrentState, ExpectedState[, RegMask] |
| AArch64ISD::SMSTOP Chain, [SM|ZA|Both], CurrentState, ExpectedState[, RegMask] |
| |
| The ``SMSTART/SMSTOP`` nodes take ``CurrentState`` and ``ExpectedState`` operand for |
| the case of a conditional SMSTART/SMSTOP. The instruction will only be executed |
| if CurrentState != ExpectedState. |
| |
| When ``CurrentState`` and ``ExpectedState`` can be evaluated at compile-time |
| (i.e. they are both constants) then an unconditional ``smstart/smstop`` |
| instruction is emitted. Otherwise the node is matched to a Pseudo instruction |
| which expands to a compare/branch and a ``smstart/smstop``. This is necessary to |
| implement transitions from ``SC -> N`` and ``SC -> S``. |
| |
| |
| Unchained Function calls |
| ------------------------ |
| When a function with "``aarch64_pstate_sm_enabled``" calls a function that is not |
| streaming compatible, the compiler has to insert a SMSTOP before the call and |
| insert a SMSTOP after the call. |
| |
| If the function that is called is an intrinsic with no side-effects which in |
| turn is lowered to a function call (e.g. ``@llvm.cos()``), then the call to |
| ``@llvm.cos()`` is not part of any Chain; it can be scheduled freely. |
| |
| Lowering of a Callsite creates a small chain of nodes which: |
| |
| - starts a call sequence |
| |
| - copies input values from virtual registers to physical registers specified by |
| the ABI |
| |
| - executes a branch-and-link |
| |
| - stops the call sequence |
| |
| - copies the output values from their physical registers to virtual registers |
| |
| When the callsite's Chain is not used, only the result value from the chained |
| sequence is used, but the Chain itself is discarded. |
| |
| The ``SMSTART`` and ``SMSTOP`` ISD nodes return a Chain, but no real |
| values, so when the ``SMSTART/SMSTOP`` nodes are part of a Chain that isn't |
| used, these nodes are not considered for scheduling and are |
| removed from the DAG. In order to prevent these nodes |
| from being removed, we need a way to ensure the results from the |
| ``CopyFromReg`` can only be **used after** the ``SMSTART/SMSTOP`` has been |
| executed. |
| |
| We can use a CopyToReg -> CopyFromReg sequence for this, which moves the |
| value to/from a virtual register and chains these nodes with the |
| SMSTART/SMSTOP to make them part of the expression that calculates |
| the result value. The resulting COPY nodes are removed by the register |
| allocator. |
| |
| The example below shows how this is used in a DAG that does not link |
| together the result by a Chain, but rather by a value: |
| |
| .. code-block:: none |
| |
| t0: ch,glue = AArch64ISD::SMSTOP ... |
| t1: ch,glue = ISD::CALL .... |
| t2: res,ch,glue = CopyFromReg t1, ... |
| t3: ch,glue = AArch64ISD::SMSTART t2:1, .... <- this is now part of the expression that returns the result value. |
| t4: ch = CopyToReg t3, Register:f64 %vreg, t2 |
| t5: res,ch = CopyFromReg t4, Register:f64 %vreg |
| t6: res = FADD t5, t9 |
| |
| We also need this for locally streaming functions, where an ``SMSTART`` needs to |
| be inserted into the DAG at the start of the function. |
| |
| Functions with __attribute__((arm_locally_streaming)) |
| ----------------------------------------------------- |
| |
| If a function is marked as ``arm_locally_streaming``, then the runtime SVE |
| vector length in the prologue/epilogue may be different from the vector length |
| in the function's body. This happens because we invoke smstart after setting up |
| the stack-frame and similarly invoke smstop before deallocating the stack-frame. |
| |
| To ensure we use the correct SVE vector length to allocate the locals with, we |
| can use the streaming vector-length to allocate the stack-slots through the |
| ``ADDSVL`` instruction, even when the CPU is not yet in streaming mode. |
| |
| This only works for locals and not callee-save slots, since LLVM doesn't support |
| mixing two different scalable vector lengths in one stack frame. That means that the |
| case where a function is marked ``arm_locally_streaming`` and needs to spill SVE |
| callee-saves in the prologue is currently unsupported. However, it is unlikely |
| for this to happen without user intervention, because ``arm_locally_streaming`` |
| functions cannot take or return vector-length-dependent values. This would otherwise |
| require forcing both the SVE PCS using '``aarch64_sve_pcs``' combined with using |
| ``arm_locally_streaming`` in order to encounter this problem. This combination |
| can be prevented in Clang through emitting a diagnostic. |
| |
| |
| An example of how the prologue/epilogue would look for a function that is |
| attributed with ``arm_locally_streaming``: |
| |
| .. code-block:: c++ |
| |
| #define N 64 |
| |
| void __attribute__((arm_streaming_compatible)) some_use(svfloat32_t *); |
| |
| // Use a float argument type, to check the value isn't clobbered by smstart. |
| // Use a float return type to check the value isn't clobbered by smstop. |
| float __attribute__((noinline, arm_locally_streaming)) foo(float arg) { |
| // Create local for SVE vector to check local is created with correct |
| // size when not yet in streaming mode (ADDSVL). |
| float array[N]; |
| svfloat32_t vector; |
| |
| some_use(&vector); |
| svst1_f32(svptrue_b32(), &array[0], vector); |
| return array[N - 1] + arg; |
| } |
| |
| should use ADDSVL for allocating the stack space and should avoid clobbering |
| the return/argument values. |
| |
| .. code-block:: none |
| |
| _Z3foof: // @_Z3foof |
| // %bb.0: // %entry |
| stp d15, d14, [sp, #-96]! // 16-byte Folded Spill |
| stp d13, d12, [sp, #16] // 16-byte Folded Spill |
| stp d11, d10, [sp, #32] // 16-byte Folded Spill |
| stp d9, d8, [sp, #48] // 16-byte Folded Spill |
| stp x29, x30, [sp, #64] // 16-byte Folded Spill |
| add x29, sp, #64 |
| str x28, [sp, #80] // 8-byte Folded Spill |
| addsvl sp, sp, #-1 |
| sub sp, sp, #256 |
| str s0, [x29, #28] // 4-byte Folded Spill |
| smstart sm |
| sub x0, x29, #64 |
| addsvl x0, x0, #-1 |
| bl _Z10some_usePu13__SVFloat32_t |
| sub x8, x29, #64 |
| ptrue p0.s |
| ld1w { z0.s }, p0/z, [x8, #-1, mul vl] |
| ldr s1, [x29, #28] // 4-byte Folded Reload |
| st1w { z0.s }, p0, [sp] |
| ldr s0, [sp, #252] |
| fadd s0, s0, s1 |
| str s0, [x29, #28] // 4-byte Folded Spill |
| smstop sm |
| ldr s0, [x29, #28] // 4-byte Folded Reload |
| addsvl sp, sp, #1 |
| add sp, sp, #256 |
| ldp x29, x30, [sp, #64] // 16-byte Folded Reload |
| ldp d9, d8, [sp, #48] // 16-byte Folded Reload |
| ldp d11, d10, [sp, #32] // 16-byte Folded Reload |
| ldp d13, d12, [sp, #16] // 16-byte Folded Reload |
| ldr x28, [sp, #80] // 8-byte Folded Reload |
| ldp d15, d14, [sp], #96 // 16-byte Folded Reload |
| ret |
| |
| |
| Preventing the use of illegal instructions in Streaming Mode |
| ------------------------------------------------------------ |
| |
| * When executing a program in streaming-mode (PSTATE.SM=1) a subset of SVE/SVE2 |
| instructions and most AdvSIMD/NEON instructions are invalid. |
| |
| * When executing a program in normal mode (PSTATE.SM=0), a subset of SME |
| instructions are invalid. |
| |
| * Streaming-compatible functions must only use instructions that are valid when |
| either PSTATE.SM=0 or PSTATE.SM=1. |
| |
| The value of PSTATE.SM is not controlled by the feature flags, but rather by the |
| function attributes. This means that we can compile for '``+sme``' and the compiler |
| will code-generate any instructions, even if they are not legal under the requested |
| streaming mode. The compiler needs to use the function attributes to ensure the |
| compiler doesn't do transformations under the assumption that certain operations |
| are available at runtime. |
| |
| We made a conscious choice not to model this with feature flags, because we |
| still want to support inline-asm in either mode (with the user placing |
| smstart/smstop manually), and this became rather complicated to implement at the |
| individual instruction level (see `D120261 <https://reviews.llvm.org/D120261>`_ |
| and `D121208 <https://reviews.llvm.org/D121208>`_) because of limitations in |
| TableGen. |
| |
| As a first step, this means we'll disable vectorization (LoopVectorize/SLP) |
| entirely when the a function has either of the ``aarch64_pstate_sm_enabled``, |
| ``aarch64_pstate_sm_body`` or ``aarch64_pstate_sm_compatible`` attributes, |
| in order to avoid the use of vector instructions. |
| |
| Later on we'll aim to relax these restrictions to enable scalable |
| auto-vectorization with a subset of streaming-compatible instructions, but that |
| requires changes to the CostModel, Legalization and SelectionDAG lowering. |
| |
| We will also emit diagnostics in Clang to prevent the use of |
| non-streaming(-compatible) operations, e.g. through ACLE intrinsics, when a |
| function is decorated with the streaming mode attributes. |
| |
| |
| Other things to consider |
| ------------------------ |
| |
| * Inlining must be disabled when the call-site needs to toggle PSTATE.SM or |
| when the callee's function body is executed in a different streaming mode than |
| its caller. This is needed because function calls are the boundaries for |
| streaming mode changes. |
| |
| * Tail call optimization must be disabled when the call-site needs to toggle |
| PSTATE.SM, such that the caller can restore the original value of PSTATE.SM. |
| |
| |
| 3. Handling PSTATE.ZA |
| ===================== |
| |
| In contrast to PSTATE.SM, enabling PSTATE.ZA does not affect the SVE vector |
| length and also doesn't clobber FP/AdvSIMD/SVE registers. This means it is safe |
| to toggle PSTATE.ZA using intrinsics. This also makes it simpler to setup a |
| lazy-save mechanism for calls to private-ZA functions (i.e. functions that may |
| either directly or indirectly clobber ZA state). |
| |
| For the purpose of handling functions marked with ``aarch64_pstate_za_new``, |
| we have introduced a new LLVM IR pass (SMEABIPass) that is run just before |
| SelectionDAG. Any such functions dealt with by this pass are marked with |
| ``aarch64_expanded_pstate_za``. |
| |
| Setting up a lazy-save |
| ---------------------- |
| |
| Committing a lazy-save |
| ---------------------- |
| |
| Exception handling and ZA |
| ------------------------- |
| |
| 4. Types |
| ======== |
| |
| AArch64 Predicate-as-Counter Type |
| --------------------------------- |
| |
| :Overview: |
| |
| The predicate-as-counter type represents the type of a predicate-as-counter |
| value held in a AArch64 SVE predicate register. Such a value contains |
| information about the number of active lanes, the element width and a bit that |
| tells whether the generated mask should be inverted. ACLE intrinsics should be |
| used to move the predicate-as-counter value to/from a predicate vector. |
| |
| There are certain limitations on the type: |
| |
| * The type can be used for function parameters and return values. |
| |
| * The supported LLVM operations on this type are limited to ``load``, ``store``, |
| ``phi``, ``select`` and ``alloca`` instructions. |
| |
| The predicate-as-counter type is a scalable type. |
| |
| :Syntax: |
| |
| :: |
| |
| target("aarch64.svcount") |
| |
| |
| |
| 5. References |
| ============= |
| |
| .. _aarch64_sme_acle: |
| |
| 1. `SME ACLE Pull-request <https://github.com/ARM-software/acle/pull/188>`__ |
| |
| .. _aarch64_sme_abi: |
| |
| 2. `SME ABI Pull-request <https://github.com/ARM-software/abi-aa/pull/123>`__ |