|  | ========================== | 
|  | OpenMP-Aware Optimizations | 
|  | ========================== | 
|  |  | 
|  | LLVM, since `version 11 <https://releases.llvm.org/download.html#11.0.0>`_ (12 | 
|  | Oct 2020), supports an :ref:`OpenMP-Aware optimization pass <OpenMPOpt>`. This | 
|  | optimization pass will attempt to optimize the module with OpenMP-specific | 
|  | domain-knowledge. This pass is enabled by default at high optimization levels | 
|  | (O2 / O3) if compiling with OpenMP support enabled. | 
|  |  | 
|  | .. _OpenMPOpt: | 
|  |  | 
|  | OpenMPOpt | 
|  | ========= | 
|  |  | 
|  | .. contents:: | 
|  | :local: | 
|  | :depth: 1 | 
|  |  | 
|  | OpenMPOpt contains several OpenMP-Aware optimizations. This pass is run early on | 
|  | the entire Module, and later on the entire call graph. Most optimizations done | 
|  | by OpenMPOpt support remarks. Optimization remarks can be enabled by compiling | 
|  | with the following flags. | 
|  |  | 
|  | .. code-block:: console | 
|  |  | 
|  | $ clang -Rpass=openmp-opt -Rpass-missed=openmp-opt -Rpass-analysis=openmp-opt | 
|  |  | 
|  | OpenMP Runtime Call Deduplication | 
|  | --------------------------------- | 
|  |  | 
|  | The OpenMP runtime library contains several functions used to implement features | 
|  | of the OpenMP standard. Several of the runtime calls are constant within a | 
|  | parallel region. A common optimization is to replace invariant code with a | 
|  | single reference, but in this case the compiler will only see an opaque call | 
|  | into the runtime library. To get around this, OpenMPOpt maintains a list of | 
|  | OpenMP runtime functions that are constant and will manually deduplicate them. | 
|  |  | 
|  | Globalization | 
|  | ------------- | 
|  |  | 
|  | The OpenMP standard requires that data can be shared between different threads. | 
|  | This requirement poses a unique challenge when offloading to GPU accelerators. | 
|  | Data cannot be shared between the threads in a GPU by default, in order to do | 
|  | this it must either be placed in global or shared memory. This needs to be done | 
|  | every time a variable may potentially be shared in order to create correct | 
|  | OpenMP programs. Unfortunately, this has significant performance implications | 
|  | and is not needed in the majority of cases. For example, when Clang is | 
|  | generating code for this offloading region, it will see that the variable `x` | 
|  | escapes and is potentially shared. This will require globalizing the variable, | 
|  | which means it cannot reside in the registers on the device. | 
|  |  | 
|  | .. code-block:: c++ | 
|  |  | 
|  | void use(void *) { } | 
|  |  | 
|  | void foo() { | 
|  | int x; | 
|  | use(&x); | 
|  | } | 
|  |  | 
|  | int main() { | 
|  | #pragma omp target parallel | 
|  | foo(); | 
|  | } | 
|  |  | 
|  | In many cases, this transformation is not actually necessary but still carries a | 
|  | significant performance penalty. Because of this, OpenMPOpt can perform and | 
|  | inter-procedural optimization and scan each known usage of the globalized | 
|  | variable and determine if it is potentially captured and shared by another | 
|  | thread. If it is not actually captured, it can safely be moved back to fast | 
|  | register memory. | 
|  |  | 
|  | Another case is memory that is intentionally shared between the threads, but is | 
|  | shared from one thread to all the others. Such variables can be moved to shared | 
|  | memory when compiled without needing to go through the runtime library.  This | 
|  | allows for users to confidently declare shared memory on the device without | 
|  | needing to use custom OpenMP allocators or rely on the runtime. | 
|  |  | 
|  |  | 
|  | .. code-block:: c++ | 
|  |  | 
|  | static void share(void *); | 
|  |  | 
|  | static void foo() { | 
|  | int x[64]; | 
|  | #pragma omp parallel | 
|  | share(x); | 
|  | } | 
|  |  | 
|  | int main() { | 
|  | #pragma omp target | 
|  | foo(); | 
|  | } | 
|  |  | 
|  | These optimizations can have very large performance implications. Both of these | 
|  | optimizations rely heavily on inter-procedural analysis. Because of this, | 
|  | offloading applications should ideally be contained in a single translation unit | 
|  | and functions should not be externally visible unless needed. OpenMPOpt will | 
|  | inform the user if any globalization calls remain if remarks are enabled. This | 
|  | should be treated as a defect in the program. | 
|  |  | 
|  | Resources | 
|  | ========= | 
|  |  | 
|  | - 2021 OpenMP Webinar: "A Compiler's View of OpenMP" https://youtu.be/eIMpgez61r4 | 
|  | - 2020 LLVM Developers’ Meeting: "(OpenMP) Parallelism-Aware Optimizations" https://youtu.be/gtxWkeLCxmU | 
|  | - 2019 EuroLLVM Developers’ Meeting: "Compiler Optimizations for (OpenMP) Target Offloading to GPUs" https://youtu.be/3AbS82C3X30 |