openmp/docs/remarks/OMP111.rst - llvm-project - Git at Google

 .. _omp111:

 Replaced globalized variable with X bytes of shared memory. [OMP111]
 ====================================================================

 This optimization occurs when a globalized variable's data is shared between
 multiple threads, but requires a constant amount of memory that can be
 determined at compile time. This is the case when only a single thread creates
 the memory and is then shared between every thread. The memory can then be
 pushed to a static buffer of shared memory on the device. This optimization
 allows users to declare shared memory on the device without using OpenMP's
 custom allocators.

 Globalization occurs when a pointer to a thread-local variable escapes the
 current scope. If a single thread is known to be responsible for creating and
 sharing the data it can instead be mapped directly to the device's shared
 memory. Checking if only a single thread can execute an instruction requires
 that the parent functions have internal linkage. Otherwise, an external caller
 could invalidate this analysis but having multiple threads call that function.
 The optimization pass will make internal copies of each function to use for this
 reason, but it is still recommended to mark them as internal using keywords like
 ``static`` whenever possible.

 Example
 -------

 This optimization should apply to any variable declared in an OpenMP target
 region that is then shared with every thread in a parallel region. This allows
 the user to declare shared memory without using custom allocators. A simple
 stencil calculation shows how this can be used.

 .. code-block:: c++

   void stencil(int M, int N, double *X, double *Y) {
   #pragma omp target teams distribute collapse(2) \
     map(to : X [0:M * N]) map(tofrom : Y [0:M * N])
     for (int i0 = 0; i0 < M; i0 += MC) {
       for (int j0 = 0; j0 < N; j0 += NC) {
         double sX[MC][NC];

   #pragma omp parallel for collapse(2) shared(sX) default(firstprivate)
         for (int i1 = 0; i1 < MC; ++i1)
           for (int j1 = 0; j1 < NC; ++j1)
             sX[i1][j1] = X[(i0 + i1) * N + (j0 + j1)];

   #pragma omp parallel for collapse(2) shared(sX) default(firstprivate)
         for (int i1 = 1; i1 < MC - 1; ++i1)
           for (int j1 = 1; j1 < NC - 1; ++j1)
             Y[(i0 + i1) * N + j0 * j1] = (sX[i1 + 1][j1] + sX[i1 - 1][j1] +
                                           sX[i1][j1 + 1] + sX[i1][j1 - 1] +
                                           -4.0 * sX[i1][j1]) / (dX * dX);
       }
     }
   }

 .. code-block:: console


   $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass=openmp-opt -fopenmp-version=51 omp111.cpp
   omp111.cpp:10:14: remark: Replaced globalized variable with 8192 bytes of shared memory. [OMP111]
       double sX[MC][NC];
              ^

 The default mapping for variables captured in an OpenMP parallel region is
 ``shared``. This means taking a pointer to the object which will ultimately
 result in globalization that will be mapped to shared memory when it could have
 been placed in registers. To avoid this, make sure each variable that can be
 copied into the region is marked ``firstprivate`` either explicitly or using the
 OpenMP 5.1 feature ``default(firstprivate)``.

 Diagnostic Scope
 ----------------

 OpenMP target offloading optimization remark.
	.. _omp111:

	Replaced globalized variable with X bytes of shared memory. [OMP111]
	====================================================================

	This optimization occurs when a globalized variable's data is shared between
	multiple threads, but requires a constant amount of memory that can be
	determined at compile time. This is the case when only a single thread creates
	the memory and is then shared between every thread. The memory can then be
	pushed to a static buffer of shared memory on the device. This optimization
	allows users to declare shared memory on the device without using OpenMP's
	custom allocators.

	Globalization occurs when a pointer to a thread-local variable escapes the
	current scope. If a single thread is known to be responsible for creating and
	sharing the data it can instead be mapped directly to the device's shared
	memory. Checking if only a single thread can execute an instruction requires
	that the parent functions have internal linkage. Otherwise, an external caller
	could invalidate this analysis but having multiple threads call that function.
	The optimization pass will make internal copies of each function to use for this
	reason, but it is still recommended to mark them as internal using keywords like
	``static`` whenever possible.

	Example
	-------

	This optimization should apply to any variable declared in an OpenMP target
	region that is then shared with every thread in a parallel region. This allows
	the user to declare shared memory without using custom allocators. A simple
	stencil calculation shows how this can be used.

	.. code-block:: c++

	void stencil(int M, int N, double X, double Y) {
	#pragma omp target teams distribute collapse(2) \
	map(to : X [0:M * N]) map(tofrom : Y [0:M * N])
	for (int i0 = 0; i0 < M; i0 += MC) {
	for (int j0 = 0; j0 < N; j0 += NC) {
	double sX[MC][NC];

	#pragma omp parallel for collapse(2) shared(sX) default(firstprivate)
	for (int i1 = 0; i1 < MC; ++i1)
	for (int j1 = 0; j1 < NC; ++j1)
	sX[i1][j1] = X[(i0 + i1) * N + (j0 + j1)];

	#pragma omp parallel for collapse(2) shared(sX) default(firstprivate)
	for (int i1 = 1; i1 < MC - 1; ++i1)
	for (int j1 = 1; j1 < NC - 1; ++j1)
	Y[(i0 + i1) * N + j0 * j1] = (sX[i1 + 1][j1] + sX[i1 - 1][j1] +
	sX[i1][j1 + 1] + sX[i1][j1 - 1] +
	-4.0 * sX[i1][j1]) / (dX * dX);
	}
	}
	}

	.. code-block:: console


	$ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass=openmp-opt -fopenmp-version=51 omp111.cpp
	omp111.cpp:10:14: remark: Replaced globalized variable with 8192 bytes of shared memory. [OMP111]
	double sX[MC][NC];
	^

	The default mapping for variables captured in an OpenMP parallel region is
	``shared``. This means taking a pointer to the object which will ultimately
	result in globalization that will be mapped to shared memory when it could have
	been placed in registers. To avoid this, make sure each variable that can be
	copied into the region is marked ``firstprivate`` either explicitly or using the
	OpenMP 5.1 feature ``default(firstprivate)``.

	Diagnostic Scope
	----------------

	OpenMP target offloading optimization remark.