blob: 9941204727427235f9c7012a4197cb02db34c0c5 [file] [log] [blame]
OpenMP 16.0.0 Release Notes
.. warning::
These are in-progress notes for the upcoming LLVM 16.0.0 release.
Release notes for previous releases can be found on
`the Download Page <>`_.
This document contains the release notes for the OpenMP runtime, release 16.0.0.
Here we describe the status of OpenMP, including major improvements
from the previous release. All OpenMP releases may be downloaded
from the `LLVM releases web site <>`_.
Non-comprehensive list of changes in this release
* OpenMP target offloading will no longer support on 32-bit Linux systems.
``libomptarget`` and plugins will not be built on 32-bit systems.
* OpenMP target offloading plugins are re-implemented and named as the NextGen
plugins. These have an internal unified interface that implement the common
behavior of all the plugins. This way, generic optimizations or features can
be implemented once, in the plugin interface, so all the plugins include them
with no additional effort. Also, all new plugins now behave more similarly and
debugging is simplified. The NextGen module includes the NVIDIA CUDA, the
AMDGPU and the GenericELF64bit plugins. These NextGen plugins are enabled by
default and replace the original ones. The new plugins can be disabled by
setting the environment variable ``LIBOMPTARGET_NEXTGEN_PLUGINS`` to ``false``
(default: ``true``).
* Support for building the OpenMP runtime for Windows on AArch64 and ARM
with MinGW based toolchains.
* Made the OpenMP runtime tests run successfully on Windows.
* Improved performance and internalization when compiling in LTO mode using
* Created the ``nvptx-arch`` and ``amdgpu-arch`` tools to query the user's
installed GPUs.
* Removed ``CLANG_OPENMP_NVPTX_DEFAULT_ARCH`` in favor of using the new
``nvptx-arch`` tool.
* Added support for ``--offload-arch=native`` which queries the user's locally
available GPU architectures. Now ``-fopenmp --offload-arch=native`` is
sufficient to target all of the user's GPUs.
* Added ``-fopenmp-target-jit`` to enable JIT support. Only basic JIT feature is
supported in this release. A couple of JIT related environment variables were
added, which can be found on `LLVM/OpenMP runtimes page <>`.
* OpenMP now supports ``-Xarch_host`` to control sending compiler arguments only
to the host compilation.
* Improved ``clang-format`` when used on OpenMP offloading applications.
* ``f16`` suffix is supported when compiling OpenMP programs if the target
supports it.
* Python 3 is required to run OpenMP LIT tests now.
* Fixed a number of bugs and regressions.
* Improved host thread utilization on target nowait regions. Target tasks are
now continuously re-enqueued by the OpenMP runtime until their device-side
operations are completed, unblocking the host thread to execute other tasks.
* Target tasks re-enqueue can be controlled on a per-thread basis based on
exponential backoff counting. ``OMPTARGET_QUERY_COUNT_THRESHOLD`` defines how
many target tasks must be re-enqueued before the thread starts blocking on the
device operations (defaults to 10). ``OMPTARGET_QUERY_COUNT_MAX`` defines the
maximum value for the per-thread re-enqueue counter (defaults to 5).
``OMPTARGET_QUERY_COUNT_BACKOFF_FACTOR`` defines the decrement factor applied
to the counter when a target task is completed (defaults to 0.5).
* GPU dynamic shared memory (aka. local data share (lds)) can now be allocated
per kernel via the ``ompx_dyn_cgroup_mem(<Bytes>)`` clause. For an example,
* OpenMP-Opt (run as part of O1/O2/O3) will more effectively lower GPU resource
usage and improve performance.
* Support record-and-replay functionality for individual OpenMP offload kernels.
Enabling recording in the host OpenMP target runtime library stores per-kernel
the device image, device memory state, and kernel launching information. The
newly added command-line tool `llvm-omp-kernel-replay` replays kernel execution.
Environment variables control recording/replaying:
* LIBOMPTARGET_RECORDING=<0|1>, 0: disable recording (default), 1: enable recording
* LIBOMPTARGET_RR_DEVMEM_SIZE = <integer in bytes>, default 64GB, amount of device
memory to pre-allocate for storing/loading when recording/replaying
* LIBOMPTARGET_RR_SAVE_OUTPUT=<0|1>, 0: disable saving device memory post-kernel execution
(default), 1: enable saving device memory post-kernel execution (used for verification
with `llvm-omp-kernel-replay`)