| .. _amdgpu-async-operations: |
| |
| =============================== |
| AMDGPU Asynchronous Operations |
| =============================== |
| |
| .. contents:: |
| :local: |
| |
| Introduction |
| ============ |
| |
| Asynchronous operations are memory transfers (usually between the global memory |
| and LDS) that are completed independently at an unspecified scope. A thread that |
| requests one or more asynchronous transfers can use *async marks* to track |
| their completion. The thread waits for each mark to be *completed*, which |
| indicates that requests initiated in program order before this mark have also |
| completed. |
| |
| Operations |
| ========== |
| |
| Memory Accesses |
| --------------- |
| |
| LDS DMA Operations |
| ^^^^^^^^^^^^^^^^^^ |
| |
| .. code-block:: llvm |
| |
| ; "Legacy" LDS DMA operations |
| void @llvm.amdgcn.load.async.to.lds(ptr %src, ptr %dst) |
| void @llvm.amdgcn.global.load.async.lds(ptr %src, ptr %dst) |
| void @llvm.amdgcn.raw.buffer.load.async.lds(ptr %src, ptr %dst) |
| void @llvm.amdgcn.raw.ptr.buffer.load.async.lds(ptr %src, ptr %dst) |
| void @llvm.amdgcn.struct.buffer.load.async.lds(ptr %src, ptr %dst) |
| void @llvm.amdgcn.struct.ptr.buffer.load.async.lds(ptr %src, ptr %dst) |
| |
| Request an async operation that copies the specified number of bytes from the |
| global/buffer pointer ``%src`` to the LDS pointer ``%dst``. |
| |
| .. note:: |
| |
| The above listing is *merely representative*. The actual function signatures |
| are identical to their non-async variants, and supported only on the |
| corresponding architectures (GFX9 and GFX10). |
| |
| Async Mark Operations |
| --------------------- |
| |
| An *async mark* in the abstract machine tracks all the async operations that |
| are program ordered before that mark. A mark M is said to be *completed* |
| only when all async operations program ordered before M are reported by the |
| implementation as having finished, and it is said to be *outstanding* otherwise. |
| |
| Thus we have the following sufficient condition: |
| |
| An async operation X is *completed* at a program point P if there exists a |
| mark M such that X is program ordered before M, M is program ordered before |
| P, and M is completed. X is said to be *outstanding* at P otherwise. |
| |
| The abstract machine maintains a sequence of *async marks* during the |
| execution of a function body, which excludes any marks produced by calls to |
| other functions encountered in the currently executing function. |
| |
| |
| ``@llvm.amdgcn.asyncmark()`` |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| When executed, inserts an async mark in the sequence associated with the |
| currently executing function body. |
| |
| ``@llvm.amdgcn.wait.asyncmark(i16 %N)`` |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Waits until there are at most N outstanding marks in the sequence associated |
| with the currently executing function body. |
| |
| Memory Consistency Model |
| ======================== |
| |
| Each asynchronous operation consists of a non-atomic read on the source and a |
| non-atomic write on the destination. Async "LDS DMA" intrinsics result in async |
| accesses that guarantee visibility relative to other memory operations as |
| follows: |
| |
| An asynchronous operation `A` program ordered before an overlapping memory |
| operation `X` happens-before `X` only if `A` is completed before `X`. |
| |
| A memory operation `X` program ordered before an overlapping asynchronous |
| operation `A` happens-before `A`. |
| |
| .. note:: |
| |
| The *only if* in the above wording implies that unlike the default LLVM |
| memory model, certain program order edges are not automatically included in |
| ``happens-before``. |
| |
| Examples |
| ======== |
| |
| Uneven blocks of async transfers |
| -------------------------------- |
| |
| .. code-block:: c++ |
| |
| void foo(global int *g, local int *l) { |
| // first block |
| async_load_to_lds(l, g); |
| async_load_to_lds(l, g); |
| async_load_to_lds(l, g); |
| asyncmark(); |
| |
| // second block; longer |
| async_load_to_lds(l, g); |
| async_load_to_lds(l, g); |
| async_load_to_lds(l, g); |
| async_load_to_lds(l, g); |
| async_load_to_lds(l, g); |
| asyncmark(); |
| |
| // third block; shorter |
| async_load_to_lds(l, g); |
| async_load_to_lds(l, g); |
| asyncmark(); |
| |
| // Wait for first block |
| wait.asyncmark(2); |
| } |
| |
| Software pipeline |
| ----------------- |
| |
| .. code-block:: c++ |
| |
| void foo(global int *g, local int *l) { |
| // first block |
| asyncmark(); |
| |
| // second block |
| asyncmark(); |
| |
| // third block |
| asyncmark(); |
| |
| for (;;) { |
| wait.asyncmark(2); |
| // use data |
| |
| // next block |
| asyncmark(); |
| } |
| |
| // flush one block |
| wait.asyncmark(2); |
| |
| // flush one more block |
| wait.asyncmark(1); |
| |
| // flush last block |
| wait.asyncmark(0); |
| } |
| |
| Ordinary function call |
| ---------------------- |
| |
| .. code-block:: c++ |
| |
| extern void bar(); // may or may not make async calls |
| |
| void foo(global int *g, local int *l) { |
| // first block |
| asyncmark(); |
| |
| // second block |
| asyncmark(); |
| |
| // function call |
| bar(); |
| |
| // third block |
| asyncmark(); |
| |
| wait.asyncmark(1); // will wait for at least the second block, possibly including bar() |
| wait.asyncmark(0); // will wait for third block, including bar() |
| } |
| |
| Implementation notes |
| ==================== |
| |
| [This section is informational.] |
| |
| Optimization |
| ------------ |
| |
| The implementation may eliminate async mark/wait intrinsics in the following cases: |
| |
| 1. An ``asyncmark`` operation which is not included in the wait count of a later |
| wait operation in the current function. In particular, an ``asyncmark`` which |
| is not post-dominated by any ``wait.asyncmark``. |
| 2. A ``wait.asyncmark`` whose wait count is more than the outstanding async |
| marks at that point. In particular, a ``wait.asyncmark`` that is not |
| dominated by any ``asyncmark``. |
| |
| In general, at a function call, if the caller uses sufficient waits to track |
| its own async operations, the actions performed by the callee cannot affect |
| correctness. But inlining such a call may result in redundant waits. |
| |
| .. code-block:: c++ |
| |
| void foo() { |
| asyncmark(); // A |
| } |
| |
| void bar() { |
| asyncmark(); // B |
| asyncmark(); // C |
| foo(); |
| wait.asyncmark(1); |
| } |
| |
| Before inlining, the ``wait.asyncmark`` waits for mark B to be completed. |
| |
| .. code-block:: c++ |
| |
| void foo() { |
| } |
| |
| void bar() { |
| asyncmark(); // B |
| asyncmark(); // C |
| asyncmark(); // A from call to foo() |
| wait.asyncmark(1); |
| } |
| |
| After inlining, the asyncmark-wait now waits for mark C to complete, which is |
| longer than necessary. Ideally, the optimizer should have eliminated mark A in |
| the body of foo() itself. |