--- local: ---
A variant form of Fortran's primary looping construct was added to the Fortran 2008 language standard with the apparent intent of enabling more effective automatic parallel execution of code written in the standard language without the use of non-standard directives. Spelled
DO CONCURRENT, the construct takes a rectilinear iteration space specification like
FORALL and allows us to write a multidimensional loop nest construct with a single
DO CONCURRENT statement and a single terminating
END DO statement.
Within the body of a
DO CONCURRENT loop the program must respect a long list of restrictions on its use of Fortran language features. Actions that obviously can‘t be executed in parallel or that don’t allow all iterations to execute are prohibited. These include:
EXIT, and any
CYCLEthat leaves the construct.
DEALLOCATEof a coarray.
In return for accepting these restrictions, a
DO CONCURRENT might compile into code that exploits the parallel features of the target machine to run the iterations of the
DO CONCURRENT construct. One needn't necessarily require OpenACC or OpenMP directives.
But it turns out that these rules, though necessary for safe parallel execution, are not sufficient. One may write conforming
DO CONCURRENT constructs that cannot be safely parallelized by a compiler; worse, one may write conforming
DO CONCURRENT constructs whose parallelizability a compiler cannot determine even in principle -- forcing a conforming compiler to assume the worst and generate sequential code.
The Fortran language standard does not actually define
DO CONCURRENT as a concurrent construct, or even as a construct that imposes sufficient requirements on the programmer to allow for parallel execution.
DO CONCURRENT is instead defined as executing the iterations of the loop in some arbitrary order (see subclause 188.8.131.52.3 paragraph 3).
DO CONCURRENT construct cannot modify an object in one iteration and expect to be able to read it in another, or read it in one before it gets modified by another -- there's no way to synchronize inter-iteration communication with critical sections or atomics.
But a conforming
DO CONCURRENT construct can modify an object in multiple iterations of the loop so long as its only reads from that object after having modified it earler in the same iteration. (See 184.108.40.206 paragraph 4 for the details.)
DO CONCURRENT (J=1:N) TMP = A(J) + B(J) C(J) = TMP END DO ! And TMP is undefined afterwards
The scalar variable
TMP is used in this loop in a way that conforms to the standard, as every use of
TMP follows a definition that appears earlier in the same iteration.
The idea, of course, is that a parallelizing compiler isn't required to use the same word of memory to hold the value of
TMP; for parallel execution,
TMP can be localized. This means that the loop can be internally rewritten as if it had been
DO CONCURRENT (J=1:N) BLOCK REAL :: TMP TMP = A(J) + B(J) C(J) = TMP END BLOCK END DO
and thus any risk of data flow between the iterations is removed.
The automatic localization rules of
DO CONCURRENT that allow usage like
TMP above are not limited to simple local scalar variables. They also apply to arbitrary variables, and thus may apply in cases that a compiler cannot determine exactly due to the presence of indexing, indirection, and interprocedural data flow.
Let's see why this turns out to be a problem.
DO CONCURRENT (J=1:N) T(IX(J)) = A(J) + B(J) C(J) = T(IY(J)) END DO
This loop conforms to the standard language if, whenever
IY(J') for any distinct pair of iterations
J', then the load must be reading a value stored earlier in the same iteration -- so
IX(J')==IY(J'), and hence
IX(J)==IX(J') too, in this example. Otherwise, a load in one iteration might depend on a store in another.
When all values of
IX(J) are distinct, and the program conforms to the restrictions of
DO CONCURRENT, a compiler can parallelize the construct easily without applying localization to
T(...). And when some values of
IX(J) are duplicates, a compiler can parallelize the loop by forwarding the stored value to the load in those iterations. But at compilation time, there‘s no way to distinguish these cases in general, and a conservative implementation has to assume the worst and run the loop’s iterations serially. (Or compare
IY(J) at runtime and forward the stored value conditionally, which adds overhead and becomes quickly impractical in loops with multiple loads and stores.)
TYPE :: T REAL, POINTER :: P END TYPE TYPE(T) :: T1(N), T2(N) DO CONCURRENT (J=1:N) T1(J)%P = A(J) + B(J) C(J) = T2(J)%P END DO
we have the same kind of ambiguity from the compiler's perspective. Are the targets of the pointers used for the stores all distinct from the targets of the pointers used for the loads? The programmer may know that they are so, but a compiler cannot; and there is no syntax by which one can stipulate that they are so.
Here's another case:
MODULE M REAL :: T END MODULE ... USE M INTERFACE PURE REAL FUNCTION F(X) REAL, INTENT(IN) :: X END FUNCTION END INTERFACE DO CONCURRENT (J=1:N) T = A(J) + B(J) D(J) = F(A(J)) + T END DO
T is obviously meant to be localized. However, a compiler can‘t be sure that the pure function
F doesn’t read from
T; if it does, there wouldn't be a practical way to convey the localized copy to it.
In summary, standard Fortran defines
DO CONCURRENT as a serial construct with a sheaf of constraints that we assume are intended to enable straightforward parallelization without all of the complexity of defining threading models or shared memory semantics, with the addition of an automatic localization rule that provides convenient temporaries objects without requiring the use of nested
ASSOCIATE constructs. But the language allows ambiguous cases in which a compiler can neither
The Fortran 2018 standard added “locality specifiers” to the
DO CONCURRENT statement. These allow one to define some variable names as being
SHARED, overriding the automatic localization rule so that it applies only in the remaining cases of “unspecified” locality.
LOCAL variables are those that can be defined by more than one iteration but are referenced only after having been defined earlier in the same iteration.
SHARED variables are those that, if defined in any iteration, are not defined or referenced in any other iteration.
(There is also a
LOCAL_INIT specifier that is not relevant to the problem at hand, and a
DEFAULT(NONE) specifier that requires a locality specifier be present for every variable mentioned in the
DO CONCURRENT construct.)
These locality specifiers can help resolve some otherwise ambiguous cases of localization, but they're not a complete solution to the problems described above.
First, the specifiers allow explicit localization of objects (like the scalar
MODULE M above) that are not local variables of the subprogram.
DO CONCURRENT still allows a pure procedure called from the loop to reference
T, and so explicit localization just confirms the worst-case assumptions about interprocedural data flow within an iteration that a compiler must make anyway.
Second, the specifiers allow arbitary variables to be localized, not just scalars. One may localize a million-element array of derived type with allocatable components to be created in each iteration, for example. (It is not clear whether localized objects are finalized; probably not.)
Third, as Fortran uses context to distinguish references to pointers from (de)references to their targets, it's not clear whether
LOCAL(PTR) localizes a pointer, its target, or both.
Fourth, the specifiers can be applied only to variable names, not to any designator with subscripts or component references. One may have defined a derived type to hold a representation of a sparse matrix, using
ALLOCATABLE components to store its packed data and indexing structures, but a program cannot localize some parts of it and share the rest. (Perhaps one may wrap
ASSOCIATE constructs around the
DO CONCURRENT construct; the interaction between locality specifiers and construct entities is not clearly defined in the language.)
In the example above that defines
T(IX(J)) and reads from
T(IY(J)), the locality specifiers can't be used to share those elements of
T() that are modified at most once and localize the cases where
IX(J) is a duplicate and
Last, when a loop both defines and references many shared objects, including potential references to globally accessible object in called procedures, one may need to name all of them in a
These problems have been presented to the J3 Fortran language standard committee. Their responses in recent e-mail discussions did not include an intent to address them in future standards or corrigenda. The most effective-looking response -- which was essentially “just use
DEFAULT(SHARED) to disable all automatic localization” -- is not an viable option, since the language does not include such a specifier!
DO CONCURRENT loops that are safely parallelizable need an effective means to convey to compilers that those compilers do not have to assume only the weaker stipulations required by today's
DO CONCURRENT without having to write verbose and error-prone locality specifiers (when those would suffice). Specifically, an easy means is required that stipulates that localization should apply at most only to the obvious cases of local non-pointer non-allocatable scalars.
In the LLVM Fortran compiler project (a/k/a “flang”, “f18”) we considered several solutions to this problem.
DO CONCURRENT() DEFAULT(PARALLEL)) by which one can inform the compiler that it should localize only the obvious cases of simple local scalars. Such syntax seems unlikely to ever be standardized, so its usage would be nonportable.
DO CONCURRENTloops in the source would continue to be portable to other compilers.
DO CONCURRENTloops that perhaps should never have been possible to write in a conforming program in the first place. Actual parallel
DO CONCURRENTconstructs would produce parallel code for users who would otherwise be surprised to learn about these problems in the language. But this option could lead to non-standard behavior for codes that depend, accidentally or not, on non-parallelizable implicit localization.
None of these options is without a fairly obvious disadvantage. The best option seems to be the one that assumes that users who write
DO CONCURRENT constructs are doing so with the intent to write parallel code.
As of August 2020, we observe that the GNU Fortran compiler (10.1) does not yet implement the Fortran 2018 locality clauses, but will parallelize some
DO CONCURRENT constructs without ambiguous data dependences when the automatic parallelization option is enabled.
The Intel Fortran compiler supports the new locality clauses and will parallelize some
DO CONCURRENT constructs when automatic parallelization option is enabled. When OpenMP is enabled, ifort reports that all
DO CONCURRENT constructs are parallelized, but they seem to execute in a serial fashion when data flow hazards are present.