blob: a55b4d94a5e7ac9d114c047fadd6c2657556e33b [file] [log] [blame] [edit]
.. _amdgpu-async-operations:
===============================
AMDGPU Asynchronous Operations
===============================
.. contents::
:local:
Introduction
============
Asynchronous operations are memory transfers (usually between the global memory
and LDS) that are completed independently at an unspecified scope. A thread that
requests one or more asynchronous transfers can use *async marks* to track
their completion. The thread waits for each mark to be *completed*, which
indicates that requests initiated in program order before this mark have also
completed.
Operations
==========
Memory Accesses
---------------
LDS DMA Operations
^^^^^^^^^^^^^^^^^^
.. code-block:: llvm
; "Legacy" LDS DMA operations
void @llvm.amdgcn.load.async.to.lds(ptr %src, ptr %dst)
void @llvm.amdgcn.global.load.async.lds(ptr %src, ptr %dst)
void @llvm.amdgcn.raw.buffer.load.async.lds(ptr %src, ptr %dst)
void @llvm.amdgcn.raw.ptr.buffer.load.async.lds(ptr %src, ptr %dst)
void @llvm.amdgcn.struct.buffer.load.async.lds(ptr %src, ptr %dst)
void @llvm.amdgcn.struct.ptr.buffer.load.async.lds(ptr %src, ptr %dst)
Request an async operation that copies the specified number of bytes from the
global/buffer pointer ``%src`` to the LDS pointer ``%dst``.
.. note::
The above listing is *merely representative*. The actual function signatures
are identical to their non-async variants, and supported only on the
corresponding architectures (GFX9 and GFX10).
Async Mark Operations
---------------------
An *async mark* in the abstract machine tracks all the async operations that
are program ordered before that mark. A mark M is said to be *completed*
only when all async operations program ordered before M are reported by the
implementation as having finished, and it is said to be *outstanding* otherwise.
Thus we have the following sufficient condition:
An async operation X is *completed* at a program point P if there exists a
mark M such that X is program ordered before M, M is program ordered before
P, and M is completed. X is said to be *outstanding* at P otherwise.
The abstract machine maintains a sequence of *async marks* during the
execution of a function body, which excludes any marks produced by calls to
other functions encountered in the currently executing function.
``@llvm.amdgcn.asyncmark()``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When executed, inserts an async mark in the sequence associated with the
currently executing function body.
``@llvm.amdgcn.wait.asyncmark(i16 %N)``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Waits until there are at most N outstanding marks in the sequence associated
with the currently executing function body.
Memory Consistency Model
========================
Each asynchronous operation consists of a non-atomic read on the source and a
non-atomic write on the destination. Async "LDS DMA" intrinsics result in async
accesses that guarantee visibility relative to other memory operations as
follows:
An asynchronous operation `A` program ordered before an overlapping memory
operation `X` happens-before `X` only if `A` is completed before `X`.
A memory operation `X` program ordered before an overlapping asynchronous
operation `A` happens-before `A`.
.. note::
The *only if* in the above wording implies that unlike the default LLVM
memory model, certain program order edges are not automatically included in
``happens-before``.
Examples
========
Uneven blocks of async transfers
--------------------------------
.. code-block:: c++
void foo(global int *g, local int *l) {
// first block
async_load_to_lds(l, g);
async_load_to_lds(l, g);
async_load_to_lds(l, g);
asyncmark();
// second block; longer
async_load_to_lds(l, g);
async_load_to_lds(l, g);
async_load_to_lds(l, g);
async_load_to_lds(l, g);
async_load_to_lds(l, g);
asyncmark();
// third block; shorter
async_load_to_lds(l, g);
async_load_to_lds(l, g);
asyncmark();
// Wait for first block
wait.asyncmark(2);
}
Software pipeline
-----------------
.. code-block:: c++
void foo(global int *g, local int *l) {
// first block
asyncmark();
// second block
asyncmark();
// third block
asyncmark();
for (;;) {
wait.asyncmark(2);
// use data
// next block
asyncmark();
}
// flush one block
wait.asyncmark(2);
// flush one more block
wait.asyncmark(1);
// flush last block
wait.asyncmark(0);
}
Ordinary function call
----------------------
.. code-block:: c++
extern void bar(); // may or may not make async calls
void foo(global int *g, local int *l) {
// first block
asyncmark();
// second block
asyncmark();
// function call
bar();
// third block
asyncmark();
wait.asyncmark(1); // will wait for at least the second block, possibly including bar()
wait.asyncmark(0); // will wait for third block, including bar()
}
Implementation notes
====================
[This section is informational.]
Optimization
------------
The implementation may eliminate async mark/wait intrinsics in the following cases:
1. An ``asyncmark`` operation which is not included in the wait count of a later
wait operation in the current function. In particular, an ``asyncmark`` which
is not post-dominated by any ``wait.asyncmark``.
2. A ``wait.asyncmark`` whose wait count is more than the outstanding async
marks at that point. In particular, a ``wait.asyncmark`` that is not
dominated by any ``asyncmark``.
In general, at a function call, if the caller uses sufficient waits to track
its own async operations, the actions performed by the callee cannot affect
correctness. But inlining such a call may result in redundant waits.
.. code-block:: c++
void foo() {
asyncmark(); // A
}
void bar() {
asyncmark(); // B
asyncmark(); // C
foo();
wait.asyncmark(1);
}
Before inlining, the ``wait.asyncmark`` waits for mark B to be completed.
.. code-block:: c++
void foo() {
}
void bar() {
asyncmark(); // B
asyncmark(); // C
asyncmark(); // A from call to foo()
wait.asyncmark(1);
}
After inlining, the asyncmark-wait now waits for mark C to complete, which is
longer than necessary. Ideally, the optimizer should have eliminated mark A in
the body of foo() itself.