| =========================== |
| OpenMP 16.0.0 Release Notes |
| =========================== |
| |
| |
| .. warning:: |
| These are in-progress notes for the upcoming LLVM 16.0.0 release. |
| Release notes for previous releases can be found on |
| `the Download Page <https://releases.llvm.org/download.html>`_. |
| |
| |
| Introduction |
| ============ |
| |
| This document contains the release notes for the OpenMP runtime, release 16.0.0. |
| Here we describe the status of OpenMP, including major improvements |
| from the previous release. All OpenMP releases may be downloaded |
| from the `LLVM releases web site <https://llvm.org/releases/>`_. |
| |
| Non-comprehensive list of changes in this release |
| ================================================= |
| |
| * OpenMP target offloading will no longer support on 32-bit Linux systems. |
| ``libomptarget`` and plugins will not be built on 32-bit systems. |
| |
| * OpenMP target offloading plugins are re-implemented and named as the NextGen |
| plugins. These have an internal unified interface that implement the common |
| behavior of all the plugins. This way, generic optimizations or features can |
| be implemented once, in the plugin interface, so all the plugins include them |
| with no additional effort. Also, all new plugins now behave more similarly and |
| debugging is simplified. The NextGen module includes the NVIDIA CUDA, the |
| AMDGPU and the GenericELF64bit plugins. These NextGen plugins are enabled by |
| default and replace the original ones. The new plugins can be disabled by |
| setting the environment variable ``LIBOMPTARGET_NEXTGEN_PLUGINS`` to ``false`` |
| (default: ``true``). |
| |
| * Support for building the OpenMP runtime for Windows on AArch64 and ARM |
| with MinGW based toolchains. |
| |
| * Made the OpenMP runtime tests run successfully on Windows. |
| |
| * Improved performance and internalization when compiling in LTO mode using |
| ``-foffload-lto``. |
| |
| * Created the ``nvptx-arch`` and ``amdgpu-arch`` tools to query the user's |
| installed GPUs. |
| |
| * Removed ``CLANG_OPENMP_NVPTX_DEFAULT_ARCH`` in favor of using the new |
| ``nvptx-arch`` tool. |
| |
| * Added support for ``--offload-arch=native`` which queries the user's locally |
| available GPU architectures. Now ``-fopenmp --offload-arch=native`` is |
| sufficient to target all of the user's GPUs. |
| |
| * Added ``-fopenmp-target-jit`` to enable JIT support. Only basic JIT feature is |
| supported in this release. A couple of JIT related environment variables were |
| added, which can be found on `LLVM/OpenMP runtimes page <https://openmp.llvm.org/design/Runtimes.html#libomptarget-jit-opt-level>`. |
| |
| * OpenMP now supports ``-Xarch_host`` to control sending compiler arguments only |
| to the host compilation. |
| |
| * Improved ``clang-format`` when used on OpenMP offloading applications. |
| |
| * ``f16`` suffix is supported when compiling OpenMP programs if the target |
| supports it. |
| |
| * Python 3 is required to run OpenMP LIT tests now. |
| |
| * Fixed a number of bugs and regressions. |
| |
| * Improved host thread utilization on target nowait regions. Target tasks are |
| now continuously re-enqueued by the OpenMP runtime until their device-side |
| operations are completed, unblocking the host thread to execute other tasks. |
| |
| * Target tasks re-enqueue can be controlled on a per-thread basis based on |
| exponential backoff counting. ``OMPTARGET_QUERY_COUNT_THRESHOLD`` defines how |
| many target tasks must be re-enqueued before the thread starts blocking on the |
| device operations (defaults to 10). ``OMPTARGET_QUERY_COUNT_MAX`` defines the |
| maximum value for the per-thread re-enqueue counter (defaults to 5). |
| ``OMPTARGET_QUERY_COUNT_BACKOFF_FACTOR`` defines the decrement factor applied |
| to the counter when a target task is completed (defaults to 0.5). |
| |
| * GPU dynamic shared memory (aka. local data share (lds)) can now be allocated |
| per kernel via the ``ompx_dyn_cgroup_mem(<Bytes>)`` clause. For an example, |
| see https://openmp.llvm.org/design/Runtimes.html#dynamic-shared-memory. |
| |
| * OpenMP-Opt (run as part of O1/O2/O3) will more effectively lower GPU resource |
| usage and improve performance. |
| |
| * Support record-and-replay functionality for individual OpenMP offload kernels. |
| Enabling recording in the host OpenMP target runtime library stores per-kernel |
| the device image, device memory state, and kernel launching information. The |
| newly added command-line tool `llvm-omp-kernel-replay` replays kernel execution. |
| Environment variables control recording/replaying: |
| * LIBOMPTARGET_RECORDING=<0|1>, 0: disable recording (default), 1: enable recording |
| * LIBOMPTARGET_RR_DEVMEM_SIZE = <integer in bytes>, default 64GB, amount of device |
| memory to pre-allocate for storing/loading when recording/replaying |
| * LIBOMPTARGET_RR_SAVE_OUTPUT=<0|1>, 0: disable saving device memory post-kernel execution |
| (default), 1: enable saving device memory post-kernel execution (used for verification |
| with `llvm-omp-kernel-replay`) |