| .. _omp111: |
| |
| Replaced globalized variable with X bytes of shared memory. [OMP111] |
| ==================================================================== |
| |
| This optimization occurs when a globalized variable's data is shared between |
| multiple threads, but requires a constant amount of memory that can be |
| determined at compile time. This is the case when only a single thread creates |
| the memory and is then shared between every thread. The memory can then be |
| pushed to a static buffer of shared memory on the device. This optimization |
| allows users to declare shared memory on the device without using OpenMP's |
| custom allocators. |
| |
| Globalization occurs when a pointer to a thread-local variable escapes the |
| current scope. If a single thread is known to be responsible for creating and |
| sharing the data it can instead be mapped directly to the device's shared |
| memory. Checking if only a single thread can execute an instruction requires |
| that the parent functions have internal linkage. Otherwise, an external caller |
| could invalidate this analysis but having multiple threads call that function. |
| The optimization pass will make internal copies of each function to use for this |
| reason, but it is still recommended to mark them as internal using keywords like |
| ``static`` whenever possible. |
| |
| Example |
| ------- |
| |
| This optimization should apply to any variable declared in an OpenMP target |
| region that is then shared with every thread in a parallel region. This allows |
| the user to declare shared memory without using custom allocators. A simple |
| stencil calculation shows how this can be used. |
| |
| .. code-block:: c++ |
| |
| void stencil(int M, int N, double *X, double *Y) { |
| #pragma omp target teams distribute collapse(2) \ |
| map(to : X [0:M * N]) map(tofrom : Y [0:M * N]) |
| for (int i0 = 0; i0 < M; i0 += MC) { |
| for (int j0 = 0; j0 < N; j0 += NC) { |
| double sX[MC][NC]; |
| |
| #pragma omp parallel for collapse(2) shared(sX) default(firstprivate) |
| for (int i1 = 0; i1 < MC; ++i1) |
| for (int j1 = 0; j1 < NC; ++j1) |
| sX[i1][j1] = X[(i0 + i1) * N + (j0 + j1)]; |
| |
| #pragma omp parallel for collapse(2) shared(sX) default(firstprivate) |
| for (int i1 = 1; i1 < MC - 1; ++i1) |
| for (int j1 = 1; j1 < NC - 1; ++j1) |
| Y[(i0 + i1) * N + j0 * j1] = (sX[i1 + 1][j1] + sX[i1 - 1][j1] + |
| sX[i1][j1 + 1] + sX[i1][j1 - 1] + |
| -4.0 * sX[i1][j1]) / (dX * dX); |
| } |
| } |
| } |
| |
| .. code-block:: console |
| |
| |
| $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass=openmp-opt -fopenmp-version=51 omp111.cpp |
| omp111.cpp:10:14: remark: Replaced globalized variable with 8192 bytes of shared memory. [OMP111] |
| double sX[MC][NC]; |
| ^ |
| |
| The default mapping for variables captured in an OpenMP parallel region is |
| ``shared``. This means taking a pointer to the object which will ultimately |
| result in globalization that will be mapped to shared memory when it could have |
| been placed in registers. To avoid this, make sure each variable that can be |
| copied into the region is marked ``firstprivate`` either explicitly or using the |
| OpenMP 5.1 feature ``default(firstprivate)``. |
| |
| Diagnostic Scope |
| ---------------- |
| |
| OpenMP target offloading optimization remark. |