| .. _openmp_runtimes: |
| |
| LLVM/OpenMP Runtimes |
| ==================== |
| |
| There are four distinct types of LLVM/OpenMP runtimes |
| |
| LLVM/OpenMP Host Runtime (``libomp``) |
| ------------------------------------- |
| |
| An `early (2015) design document <https://openmp.llvm.org/Reference.pdf>`_ for |
| the LLVM/OpenMP host runtime, aka. `libomp.so`, is available as a `pdf |
| <https://openmp.llvm.org/Reference.pdf>`_. |
| |
| |
| LLVM/OpenMP Target Host Runtime (``libomptarget``) |
| -------------------------------------------------- |
| |
| .. _libopenmptarget_environment_vars: |
| |
| Environment Variables |
| ^^^^^^^^^^^^^^^^^^^^^ |
| |
| ``libomptarget`` uses environment variables to control different features of the |
| library at runtime. This allows the user to obtain useful runtime information as |
| well as enable or disable certain features. A full list of supported environment |
| variables is defined below. |
| |
| * ``LIBOMPTARGET_DEBUG=<Num>`` |
| * ``LIBOMPTARGET_PROFILE=<Filename>`` |
| * ``LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD=<Num>`` |
| * ``LIBOMPTARGET_INFO=<Num>`` |
| |
| LIBOMPTARGET_DEBUG |
| """""""""""""""""" |
| |
| ``LIBOMPTARGET_DEBUG`` controls whether or not debugging information will be |
| displayed. This feature is only availible if ``libomptarget`` was built with |
| ``-DOMPTARGET_DEBUG``. The debugging output provided is intended for use by |
| ``libomptarget`` developers. More user-friendly output is presented when using |
| ``LIBOMPTARGET_INFO``. |
| |
| LIBOMPTARGET_PROFILE |
| """""""""""""""""""" |
| ``LIBOMPTARGET_PROFILE`` allows ``libomptarget`` to generate time profile output |
| similar to Clang's ``-ftime-trace`` option. This generates a JSON file based on |
| `Chrome Tracing`_ that can be viewed with ``chrome://tracing`` or the |
| `Speedscope App`_. Building this feature depends on the `LLVM Support Library`_ |
| for time trace output. Using this library is enabled by default when building |
| using the CMake option ``OPENMP_ENABLE_LIBOMPTARGET_PROFILING``. The output will |
| be saved to the filename specified by the environment variable. For multi-threaded |
| applications, profiling in ``libomp`` is also needed. Setting the CMake option |
| ``OPENMP_ENABLE_LIBOMP_PROFILING=ON`` to enable the feature. Note that this will |
| turn ``libomp`` into a C++ library. |
| |
| .. _`Chrome Tracing`: https://www.chromium.org/developers/how-tos/trace-event-profiling-tool |
| |
| .. _`Speedscope App`: https://www.speedscope.app/ |
| |
| .. _`LLVM Support Library`: https://llvm.org/docs/SupportLibrary.html |
| |
| LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD |
| """"""""""""""""""""""""""""""""""""" |
| |
| ``LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD`` sets the threshold size for which the |
| ``libomptarget`` memory manager will handle the allocation. Any allocations |
| larger than this threshold will not use the memory manager and be freed after |
| the device kernel exits. The default threshold value is ``8KB``. If |
| ``LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD`` is set to ``0`` the memory manager |
| will be completely disabled. |
| |
| LIBOMPTARGET_INFO |
| """"""""""""""""" |
| |
| ``LIBOMPTARGET_INFO`` allows the user to request different types of runtime |
| information from ``libomptarget``. ``LIBOMPTARGET_INFO`` uses a 32-bit field to |
| enable or disable different types of information. This includes information |
| about data-mappings and kernel execution. It is recommended to build your |
| application with debugging information enabled, this will enable filenames and |
| variable declarations in the information messages. OpenMP Debugging information |
| is enabled at any level of debugging so a full debug runtime is not required. |
| For minimal debugging information compile with `-gline-tables-only`, or compile |
| with `-g` for full debug information. A full list of flags supported by |
| ``LIBOMPTARGET_INFO`` is given below. |
| |
| * Print all data arguments upon entering an OpenMP device kernel: ``0x01`` |
| * Indicate when a mapped address already exists in the device mapping table: |
| ``0x02`` |
| * Dump the contents of the device pointer map at kernel exit: ``0x04`` |
| * Indicate when an entry is changed in the device mapping table: ``0x08`` |
| * Print OpenMP kernel information from device plugins: ``0x10`` |
| |
| Any combination of these flags can be used by setting the appropriate bits. For |
| example, to enable printing all data active in an OpenMP target region along |
| with ``CUDA`` information, run the following ``bash`` command. |
| |
| .. code-block:: console |
| |
| $ env LIBOMPTARGET_INFO=$((1 << 0x1 | 1 << 0x10)) ./your-application |
| |
| Or, to enable every flag run with every bit set. |
| |
| .. code-block:: console |
| |
| $ env LIBOMPTARGET_INFO=-1 ./your-application |
| |
| For example, given a small application implementing the ``ZAXPY`` BLAS routine, |
| ``Libomptarget`` can provide useful information about data mappings and thread |
| usages. |
| |
| .. code-block:: c++ |
| |
| #include <complex> |
| |
| using complex = std::complex<double>; |
| |
| void zaxpy(complex *X, complex *Y, complex D, std::size_t N) { |
| #pragma omp target teams distribute parallel for |
| for (std::size_t i = 0; i < N; ++i) |
| Y[i] = D * X[i] + Y[i]; |
| } |
| |
| int main() { |
| const std::size_t N = 1024; |
| complex X[N], Y[N], D; |
| #pragma omp target data map(to:X[0 : N]) map(tofrom:Y[0 : N]) |
| zaxpy(X, Y, D, N); |
| } |
| |
| Compiling this code targeting ``nvptx64`` with all information enabled will |
| provide the following output from the runtime library. |
| |
| .. code-block:: console |
| |
| $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O3 -gline-tables-only zaxpy.cpp -o zaxpy |
| $ env LIBOMPTARGET_INFO=-1 ./zaxpy |
| |
| .. code-block:: text |
| |
| Info: Device supports up to 65536 CUDA blocks and 1024 threads with a warp size of 32 |
| Info: Entering OpenMP data region at zaxpy.cpp:14:1 with 2 arguments: |
| Info: to(X[0:N])[16384] |
| Info: tofrom(Y[0:N])[16384] |
| Info: Creating new map entry with HstPtrBegin=0x00007fff963f4000, |
| TgtPtrBegin=0x00007fff963f4000, Size=16384, Name=X[0:N] |
| Info: Creating new map entry with HstPtrBegin=0x00007fff963f8000, |
| TgtPtrBegin=0x00007fff963f00000, Size=16384, Name=Y[0:N] |
| Info: OpenMP Host-Device pointer mappings after block at zaxpy.cpp:14:1: |
| Info: Host Ptr Target Ptr Size (B) RefCount Declaration |
| Info: 0x00007fff963f4000 0x00007fd225004000 16384 1 Y[0:N] at zaxpy.cpp:13:17 |
| Info: 0x00007fff963f8000 0x00007fd225000000 16384 1 X[0:N] at zaxpy.cpp:13:11 |
| Info: Entering OpenMP kernel at zaxpy.cpp:6:1 with 4 arguments: |
| Info: firstprivate(N)[8] (implicit) |
| Info: use_address(Y)[0] (implicit) |
| Info: tofrom(D)[16] (implicit) |
| Info: use_address(X)[0] (implicit) |
| Info: Mapping exists (implicit) with HstPtrBegin=0x00007ffe37d8be80, |
| TgtPtrBegin=0x00007f90ff004000, Size=0, updated RefCount=2, Name=Y |
| Info: Creating new map entry with HstPtrBegin=0x00007fff963f33ff0, |
| TgtPtrBegin=0x00007fd225003ff0, Size=16, Name=D |
| Info: Mapping exists (implicit) with HstPtrBegin=0x00007ffe37d8fe80, |
| TgtPtrBegin=0x00007f90ff000000, Size=0, updated RefCount=2, Name=X |
| Info: Launching kernel __omp_offloading_fd02_c2c4ac1a__Z5daxpyPNSt3__17complexIdEES2_S1_m_l6 |
| with 8 blocks and 128 threads in SPMD mode |
| Info: Removing map entry with HstPtrBegin=0x00007fff963f33ff0, |
| TgtPtrBegin=0x00007fd225003ff0, Size=16, Name=D |
| Info: OpenMP Host-Device pointer mappings after block at zaxpy.cpp:6:1: |
| Info: Host Ptr Target Ptr Size (B) RefCount Declaration |
| Info: 0x00007fff963f4000 0x00007fd225004000 16384 1 Y[0:N] at zaxpy.cpp:13:17 |
| Info: 0x00007fff963f8000 0x00007fd225000000 16384 1 X[0:N] at zaxpy.cpp:13:11 |
| Info: Exiting OpenMP data region at zaxpy.cpp:14:1 with 2 arguments: |
| Info: to(X[0:N])[16384] |
| Info: tofrom(Y[0:N])[16384] |
| Info: Removing map entry with HstPtrBegin=0x00007fff963f4000, |
| TgtPtrBegin=0x00007fff963f4000, Size=16384, Name=X[0:N] |
| Info: Removing map entry with HstPtrBegin=0x00007fff963f8000, |
| TgtPtrBegin=0x00007fff963f00000, Size=16384, Name=Y[0:N] |
| |
| From this information, we can see the OpenMP kernel being launched on the CUDA |
| device with enough threads and blocks for all ``1024`` iterations of the loop in |
| simplified :doc:`SPMD Mode <Offloading>`. The information from the OpenMP data |
| region shows the two arrays ``X`` and ``Y`` being copied from the host to the |
| device. This creates an entry in the host-device mapping table associating the |
| host pointers to the newly created device data. The data mappings in the OpenMP |
| device kernel show the default mappings being used for all the variables used |
| implicitly on the device. Because ``X`` and ``Y`` are already mapped in the |
| device's table, no new entries are created. Additionally, the default mapping |
| shows that ``D`` will be copied back from the device once the OpenMP device |
| kernel region ends even though it isn't written to. Finally, at the end of the |
| OpenMP data region the entries for ``X`` and ``Y`` are removed from the table. |
| |
| .. _libopenmptarget_errors: |
| |
| Errors: |
| ^^^^^^^ |
| |
| ``libomptarget`` provides error messages when the program fails inside the |
| OpenMP target region. Common causes of failure could be an invalid pointer |
| access, running out of device memory, or trying to offload when the device is |
| busy. If the application was built with debugging symbols the error messages |
| will additionally provide the source location of the OpenMP target region. |
| |
| For example, consider the following code that implements a simple parallel |
| reduction on the GPU. This code has a bug that causes it to fail in the |
| offloading region. |
| |
| .. code-block:: c++ |
| |
| #include <cstdio> |
| |
| double sum(double *A, std::size_t N) { |
| double sum = 0.0; |
| #pragma omp target teams distribute parallel for reduction(+:sum) |
| for (int i = 0; i < N; ++i) |
| sum += A[i]; |
| |
| return sum; |
| } |
| |
| int main() { |
| const int N = 1024; |
| double A[N]; |
| sum(A, N); |
| } |
| |
| If this code is compiled and run, there will be an error message indicating what is |
| going wrong. |
| |
| .. code-block:: console |
| |
| $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O3 -gline-tables-only sum.cpp -o sum |
| $ ./sum |
| |
| .. code-block:: text |
| |
| CUDA error: an illegal memory access was encountered |
| Libomptarget error: Copying data from device failed. |
| Libomptarget error: Call to targetDataEnd failed, abort target. |
| Libomptarget error: Failed to process data after launching the kernel. |
| Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings. |
| sum.cpp:5:1: Libomptarget error 1: failure of target construct while offloading is mandatory |
| |
| This shows that there is an illegal memory access occuring inside the OpenMP |
| target region once execution has moved to the CUDA device, suggesting a |
| segmentation fault. This then causes a chain reaction of failures in |
| ``libomptarget``. Another message suggests using the ``LIBOMPTARGET_INFO`` |
| environment variable as described in :ref:`libopenmptarget_environment_vars`. If |
| we do this it will print the sate of the host-target pointer mappings at the |
| time of failure. |
| |
| .. code-block:: console |
| |
| $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O3 -gline-tables-only sum.cpp -o sum |
| $ env LIBOMPTARGET_INFO=4 ./sum |
| |
| .. code-block:: text |
| |
| info: OpenMP Host-Device pointer mappings after block at sum.cpp:5:1: |
| info: Host Ptr Target Ptr Size (B) RefCount Declaration |
| info: 0x00007ffc058280f8 0x00007f4186600000 8 1 sum at sum.cpp:4:10 |
| |
| This tells us that the only data mapped between the host and the device is the |
| ``sum`` variable that will be copied back from the device once the reduction has |
| ended. There is no entry mapping the host array ``A`` to the device. In this |
| situation, the compiler cannot determine the size of the array at compile time |
| so it will simply assume that the pointer is mapped on the device already by |
| default. The solution is to add an explicit map clause in the target region. |
| |
| .. code-block:: c++ |
| |
| double sum(double *A, std::size_t N) { |
| double sum = 0.0; |
| #pragma omp target teams distribute parallel for reduction(+:sum) map(to:A[0 : N]) |
| for (int i = 0; i < N; ++i) |
| sum += A[i]; |
| |
| return sum; |
| } |
| |
| .. toctree:: |
| :hidden: |
| :maxdepth: 1 |
| |
| Offloading |
| |
| LLVM/OpenMP Target Host Runtime Plugins (``libomptarget.rtl.XXXX``) |
| ------------------------------------------------------------------- |
| |
| .. _device_runtime: |
| |
| |
| .. _remote_offloading_plugin: |
| |
| Remote Offloading Plugin: |
| ^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| The remote offloading plugin permits the execution of OpenMP target regions |
| on devices in remote hosts in addition to the devices connected to the local |
| host. All target devices on the remote host will be exposed to the |
| application as if they were local devices, that is, the remote host CPU or |
| its GPUs can be offloaded to with the appropriate device number. If the |
| server is running on the same host, each device may be identified twice: |
| once through the device plugins and once through the device plugins that the |
| server application has access to. |
| |
| This plugin consists of ``libomptarget.rtl.rpc.so`` and |
| ``openmp-offloading-server`` which should be running on the (remote) host. The |
| server application does not have to be running on a remote host, and can |
| instead be used on the same host in order to debug memory mapping during offloading. |
| These are implemented via gRPC/protobuf so these libraries are required to |
| build and use this plugin. The server must also have access to the necessary |
| target-specific plugins in order to perform the offloading. |
| |
| Due to the experimental nature of this plugin, the CMake variable |
| ``LIBOMPTARGET_ENABLE_EXPERIMENTAL_REMOTE_PLUGIN`` must be set in order to |
| build this plugin. For example, the rpc plugin is not designed to be |
| thread-safe, the server cannot concurrently handle offloading from multiple |
| applications at once (it is synchronous) and will terminate after a single |
| execution. Note that ``openmp-offloading-server`` is unable to |
| remote offload onto a remote host itself and will error out if this is attempted. |
| |
| Remote offloading is configured via environment variables at runtime of the OpenMP application: |
| * ``LIBOMPTARGET_RPC_ADDRESS=<Address>:<Port>`` |
| * ``LIBOMPTARGET_RPC_ALLOCATOR_MAX=<NumBytes>`` |
| * ``LIBOMPTARGET_BLOCK_SIZE=<NumBytes>`` |
| * ``LIBOMPTARGET_RPC_LATENCY=<Seconds>`` |
| |
| LIBOMPTARGET_RPC_ADDRESS |
| """""""""""""""""""""""" |
| The address and port at which the server is running. This needs to be set for |
| the server and the application, the default is ``0.0.0.0:50051``. A single |
| OpenMP executable can offload onto multiple remote hosts by setting this to |
| comma-seperated values of the addresses. |
| |
| LIBOMPTARGET_RPC_ALLOCATOR_MAX |
| """""""""""""""""""""""""""""" |
| After allocating this size, the protobuf allocator will clear. This can be set for both endpoints. |
| |
| LIBOMPTARGET_BLOCK_SIZE |
| """"""""""""""""""""""" |
| This is the maximum size of a single message while streaming data transfers between the two endpoints and can be set for both endpoints. |
| |
| LIBOMPTARGET_RPC_LATENCY |
| """""""""""""""""""""""" |
| This is the maximum amount of time the client will wait for a response from the server. |
| |
| LLVM/OpenMP Target Device Runtime (``libomptarget-ARCH-SUBARCH.bc``) |
| -------------------------------------------------------------------- |
| |