docs/optimizations/OpenMPOpt.rst - llvm-project/openmp - Git at Google

 ==========================
 OpenMP-Aware Optimizations
 ==========================

 LLVM, since `version 11 <https://releases.llvm.org/download.html#11.0.0>`_ (12
 Oct 2020), supports an :ref:`OpenMP-Aware optimization pass <OpenMPOpt>`. This
 optimization pass will attempt to optimize the module with OpenMP-specific
 domain-knowledge. This pass is enabled by default at high optimization levels
 (O2 / O3) if compiling with OpenMP support enabled.

 .. _OpenMPOpt:

 OpenMPOpt
 =========

 .. contents::
    :local:
    :depth: 1

 OpenMPOpt contains several OpenMP-Aware optimizations. This pass is run early on
 the entire Module, and later on the entire call graph. Most optimizations done
 by OpenMPOpt support remarks. Optimization remarks can be enabled by compiling
 with the following flags.

 .. code-block:: console

   $ clang -Rpass=openmp-opt -Rpass-missed=openmp-opt -Rpass-analysis=openmp-opt

 OpenMP Runtime Call Deduplication
 ---------------------------------

 The OpenMP runtime library contains several functions used to implement features
 of the OpenMP standard. Several of the runtime calls are constant within a
 parallel region. A common optimization is to replace invariant code with a
 single reference, but in this case the compiler will only see an opaque call
 into the runtime library. To get around this, OpenMPOpt maintains a list of
 OpenMP runtime functions that are constant and will manually deduplicate them.

 Globalization
 -------------

 The OpenMP standard requires that data can be shared between different threads.
 This requirement poses a unique challenge when offloading to GPU accelerators.
 Data cannot be shared between the threads in a GPU by default, in order to do
 this it must either be placed in global or shared memory. This needs to be done
 every time a variable may potentially be shared in order to create correct
 OpenMP programs. Unfortunately, this has significant performance implications
 and is not needed in the majority of cases. For example, when Clang is
 generating code for this offloading region, it will see that the variable `x`
 escapes and is potentially shared. This will require globalizing the variable,
 which means it cannot reside in the registers on the device.

 .. code-block:: c++

   void use(void *) { }

   void foo() {
     int x;
     use(&x);
   }

   int main() {
   #pragma omp target parallel
     foo();
   }

 In many cases, this transformation is not actually necessary but still carries a
 significant performance penalty. Because of this, OpenMPOpt can perform and
 inter-procedural optimization and scan each known usage of the globalized
 variable and determine if it is potentially captured and shared by another
 thread. If it is not actually captured, it can safely be moved back to fast
 register memory.

 Another case is memory that is intentionally shared between the threads, but is
 shared from one thread to all the others. Such variables can be moved to shared
 memory when compiled without needing to go through the runtime library.  This
 allows for users to confidently declare shared memory on the device without
 needing to use custom OpenMP allocators or rely on the runtime.


 .. code-block:: c++

   static void share(void *);

   static void foo() {
     int x[64];
   #pragma omp parallel
     share(x);
   }

   int main() {
     #pragma omp target
     foo();
   }

 These optimizations can have very large performance implications. Both of these
 optimizations rely heavily on inter-procedural analysis. Because of this,
 offloading applications should ideally be contained in a single translation unit
 and functions should not be externally visible unless needed. OpenMPOpt will
 inform the user if any globalization calls remain if remarks are enabled. This
 should be treated as a defect in the program.

 Resources
 =========

 - 2021 OpenMP Webinar: "A Compiler's View of OpenMP" https://youtu.be/eIMpgez61r4
 - 2020 LLVM Developers’ Meeting: "(OpenMP) Parallelism-Aware Optimizations" https://youtu.be/gtxWkeLCxmU
 - 2019 EuroLLVM Developers’ Meeting: "Compiler Optimizations for (OpenMP) Target Offloading to GPUs" https://youtu.be/3AbS82C3X30
	==========================
	OpenMP-Aware Optimizations
	==========================

	LLVM, since `version 11 <https://releases.llvm.org/download.html#11.0.0>`_ (12
	Oct 2020), supports an :ref:`OpenMP-Aware optimization pass <OpenMPOpt>`. This
	optimization pass will attempt to optimize the module with OpenMP-specific
	domain-knowledge. This pass is enabled by default at high optimization levels
	(O2 / O3) if compiling with OpenMP support enabled.

	.. _OpenMPOpt:

	OpenMPOpt
	=========

	.. contents::
	:local:
	:depth: 1

	OpenMPOpt contains several OpenMP-Aware optimizations. This pass is run early on
	the entire Module, and later on the entire call graph. Most optimizations done
	by OpenMPOpt support remarks. Optimization remarks can be enabled by compiling
	with the following flags.

	.. code-block:: console

	$ clang -Rpass=openmp-opt -Rpass-missed=openmp-opt -Rpass-analysis=openmp-opt

	OpenMP Runtime Call Deduplication
	---------------------------------

	The OpenMP runtime library contains several functions used to implement features
	of the OpenMP standard. Several of the runtime calls are constant within a
	parallel region. A common optimization is to replace invariant code with a
	single reference, but in this case the compiler will only see an opaque call
	into the runtime library. To get around this, OpenMPOpt maintains a list of
	OpenMP runtime functions that are constant and will manually deduplicate them.

	Globalization
	-------------

	The OpenMP standard requires that data can be shared between different threads.
	This requirement poses a unique challenge when offloading to GPU accelerators.
	Data cannot be shared between the threads in a GPU by default, in order to do
	this it must either be placed in global or shared memory. This needs to be done
	every time a variable may potentially be shared in order to create correct
	OpenMP programs. Unfortunately, this has significant performance implications
	and is not needed in the majority of cases. For example, when Clang is
	generating code for this offloading region, it will see that the variable `x`
	escapes and is potentially shared. This will require globalizing the variable,
	which means it cannot reside in the registers on the device.

	.. code-block:: c++

	void use(void *) { }

	void foo() {
	int x;
	use(&x);
	}

	int main() {
	#pragma omp target parallel
	foo();
	}

	In many cases, this transformation is not actually necessary but still carries a
	significant performance penalty. Because of this, OpenMPOpt can perform and
	inter-procedural optimization and scan each known usage of the globalized
	variable and determine if it is potentially captured and shared by another
	thread. If it is not actually captured, it can safely be moved back to fast
	register memory.

	Another case is memory that is intentionally shared between the threads, but is
	shared from one thread to all the others. Such variables can be moved to shared
	memory when compiled without needing to go through the runtime library. This
	allows for users to confidently declare shared memory on the device without
	needing to use custom OpenMP allocators or rely on the runtime.


	.. code-block:: c++

	static void share(void *);

	static void foo() {
	int x[64];
	#pragma omp parallel
	share(x);
	}

	int main() {
	#pragma omp target
	foo();
	}

	These optimizations can have very large performance implications. Both of these
	optimizations rely heavily on inter-procedural analysis. Because of this,
	offloading applications should ideally be contained in a single translation unit
	and functions should not be externally visible unless needed. OpenMPOpt will
	inform the user if any globalization calls remain if remarks are enabled. This
	should be treated as a defect in the program.

	Resources
	=========

	- 2021 OpenMP Webinar: "A Compiler's View of OpenMP" https://youtu.be/eIMpgez61r4
	- 2020 LLVM Developers’ Meeting: "(OpenMP) Parallelism-Aware Optimizations" https://youtu.be/gtxWkeLCxmU
	- 2019 EuroLLVM Developers’ Meeting: "Compiler Optimizations for (OpenMP) Target Offloading to GPUs" https://youtu.be/3AbS82C3X30