blob: bd1008af86f6b4ea61c47e933eab9a89cfa5f831 [file] [log] [blame] [view]
peter klauslerc2bccd62020-08-25 09:43:16 -07001<!--===- docs/DoConcurrent.md
2
3 Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4 See https://llvm.org/LICENSE.txt for license information.
5 SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6
7-->
8
9# `DO CONCURRENT` isn't necessarily concurrent
10
cor3ntinb7ff0322023-09-25 14:02:39 +020011```{contents}
12---
13local:
14---
peter klauslerc2bccd62020-08-25 09:43:16 -070015```
16
17A variant form of Fortran's primary looping construct was
18added to the Fortran 2008 language standard with the apparent
19intent of enabling more effective automatic parallel execution of code
20written in the standard language without the use of
21non-standard directives.
22Spelled `DO CONCURRENT`, the construct takes a rectilinear iteration
23space specification like `FORALL` and allows us to write
24a multidimensional loop nest construct with a single `DO CONCURRENT`
25statement and a single terminating `END DO` statement.
26
27Within the body of a `DO CONCURRENT` loop the program must respect
28a long list of restrictions on its use of Fortran language features.
29Actions that obviously can't be executed in parallel or that
30don't allow all iterations to execute are prohibited.
31These include:
32* Control flow statements that would prevent the loop nest from
33 executing all its iterations: `RETURN`, `EXIT`, and any
34 `GOTO` or `CYCLE` that leaves the construct.
35* Image control statements: `STOP`, `SYNC`, `LOCK`/`UNLOCK`, `EVENT`,
36 and `ALLOCATE`/`DEALLOCATE` of a coarray.
37* Calling a procedure that is not `PURE`.
38* Deallocation of any polymorphic entity, as that could cause
39 an impure FINAL subroutine to be called.
40* Messing with the IEEE floating-point control and status flags.
41* Accepting some restrictions on data flow between iterations
42 (i.e., none) and on liveness of modified objects after the loop.
43 (The details are spelled out later.)
44
45In return for accepting these restrictions, a `DO CONCURRENT` might
46compile into code that exploits the parallel features of the target
47machine to run the iterations of the `DO CONCURRENT` construct.
48One needn't necessarily require OpenACC or OpenMP directives.
49
50But it turns out that these rules, though *necessary* for safe parallel
51execution, are not *sufficient*.
52One may write conforming `DO CONCURRENT` constructs that cannot
53be safely parallelized by a compiler; worse, one may write conforming
54`DO CONCURRENT` constructs whose parallelizability a compiler cannot
55determine even in principle -- forcing a conforming compiler to
56assume the worst and generate sequential code.
57
58## Localization
59
60The Fortran language standard does not actually define `DO CONCURRENT` as a
61concurrent construct, or even as a construct that imposes sufficient
62requirements on the programmer to allow for parallel execution.
63`DO CONCURRENT` is instead defined as executing the iterations
64of the loop in some arbitrary order (see subclause 11.1.7.4.3 paragraph 3).
65
66A `DO CONCURRENT` construct cannot modify an object in one iteration
67and expect to be able to read it in another, or read it in one before it gets
68modified by another -- there's no way to synchronize inter-iteration
69communication with critical sections or atomics.
70
71But a conforming `DO CONCURRENT` construct *can* modify an object in
72multiple iterations of the loop so long as its only reads from that
73object *after* having modified it earler in the *same* iteration.
74(See 11.1.7.5 paragraph 4 for the details.)
75
76For example:
77
78```
79 DO CONCURRENT (J=1:N)
80 TMP = A(J) + B(J)
81 C(J) = TMP
82 END DO
83 ! And TMP is undefined afterwards
84```
85
86The scalar variable `TMP` is used in this loop in a way that conforms
87to the standard, as every use of `TMP` follows a definition that appears
88earlier in the same iteration.
89
90The idea, of course, is that a parallelizing compiler isn't required to
91use the same word of memory to hold the value of `TMP`;
92for parallel execution, `TMP` can be _localized_.
93This means that the loop can be internally rewritten as if it had been
94```
95 DO CONCURRENT (J=1:N)
96 BLOCK
97 REAL :: TMP
98 TMP = A(J) + B(J)
99 C(J) = TMP
100 END BLOCK
101 END DO
102```
103and thus any risk of data flow between the iterations is removed.
104
105## The identification problem
106
107The automatic localization rules of `DO CONCURRENT` that allow
108usage like `TMP` above are not limited to simple local scalar
109variables.
110They also apply to arbitrary variables, and thus may apply
111in cases that a compiler cannot determine exactly due to
112the presence of indexing, indirection, and interprocedural data flow.
113
114Let's see why this turns out to be a problem.
115
116Examples:
117```
118 DO CONCURRENT (J=1:N)
119 T(IX(J)) = A(J) + B(J)
120 C(J) = T(IY(J))
121 END DO
122```
123This loop conforms to the standard language if,
124whenever `IX(J)` equals `IY(J')` for any distinct pair of iterations
125`J` and `J'`,
126then the load must be reading a value stored earlier in the
127same iteration -- so `IX(J')==IY(J')`, and hence `IX(J)==IX(J')` too,
128in this example.
129Otherwise, a load in one iteration might depend on a store
130in another.
131
132When all values of `IX(J)` are distinct, and the program conforms
133to the restrictions of `DO CONCURRENT`, a compiler can parallelize
134the construct easily without applying localization to `T(...)`.
135And when some values of `IX(J)` are duplicates, a compiler can parallelize
136the loop by forwarding the stored value to the load in those
137iterations.
138But at compilation time, there's _no way to distinguish_ these
139cases in general, and a conservative implementation has to assume
140the worst and run the loop's iterations serially.
141(Or compare `IX(J)` with `IY(J)` at runtime and forward the
142stored value conditionally, which adds overhead and becomes
143quickly impractical in loops with multiple loads and stores.)
144
145In
146```
147 TYPE :: T
148 REAL, POINTER :: P
149 END TYPE
150 TYPE(T) :: T1(N), T2(N)
151 DO CONCURRENT (J=1:N)
152 T1(J)%P = A(J) + B(J)
153 C(J) = T2(J)%P
154 END DO
155```
156we have the same kind of ambiguity from the compiler's perspective.
157Are the targets of the pointers used for the stores all distinct
158from the targets of the pointers used for the loads?
159The programmer may know that they are so, but a compiler
160cannot; and there is no syntax by which one can stipulate
161that they are so.
162
163## The global variable localization problem
164
165Here's another case:
166```
167 MODULE M
168 REAL :: T
169 END MODULE
170 ...
171 USE M
172 INTERFACE
173 PURE REAL FUNCTION F(X)
174 REAL, INTENT(IN) :: X
175 END FUNCTION
176 END INTERFACE
177 DO CONCURRENT (J=1:N)
178 T = A(J) + B(J)
179 D(J) = F(A(J)) + T
180 END DO
181```
182The variable `T` is obviously meant to be localized.
183However, a compiler can't be sure that the pure function `F`
184doesn't read from `T`; if it does, there wouldn't be a
185practical way to convey the localized copy to it.
186
187In summary, standard Fortran defines `DO CONCURRENT` as a serial
188construct with a sheaf of constraints that we assume are intended
189to enable straightforward parallelization without
190all of the complexity of defining threading models or shared memory semantics,
191with the addition of an automatic localization rule that provides
192convenient temporaries objects without requiring the use of nested
193`BLOCK` or `ASSOCIATE` constructs.
194But the language allows ambiguous cases in which a compiler can neither
1951. prove that automatic localization *is* required for a given
196 object in every iteration, nor
1971. prove that automatic localization *isn't* required in any iteration.
198
199## Locality specifiers
200
201The Fortran 2018 standard added "locality specifiers" to the
202`DO CONCURRENT` statement.
203These allow one to define some variable names as being `LOCAL` or
204`SHARED`, overriding the automatic localization rule so that it
205applies only in the remaining cases of "unspecified" locality.
206
207`LOCAL` variables are those that can be defined by more than one
208iteration but are referenced only after having been defined
209earlier in the same iteration.
210`SHARED` variables are those that, if defined in
211any iteration, are not defined or referenced in any other iteration.
212
213(There is also a `LOCAL_INIT` specifier that is not relevant to the
214problem at hand, and a `DEFAULT(NONE)` specifier that requires a
215locality specifier be present for every variable mentioned in the
216`DO CONCURRENT` construct.)
217
218These locality specifiers can help resolve some otherwise ambiguous
219cases of localization, but they're not a complete solution to the problems
220described above.
221
222First, the specifiers allow explicit localization of objects
223(like the scalar `T` in `MODULE M` above) that are not local variables
224of the subprogram.
225`DO CONCURRENT` still allows a pure procedure called from the loop
226to reference `T`, and so explicit localization just confirms the
227worst-case assumptions about interprocedural data flow
228within an iteration that a compiler must make anyway.
229
230Second, the specifiers allow arbitary variables to be localized,
231not just scalars.
232One may localize a million-element array of derived type
233with allocatable components to be created in each iteration,
234for example.
235(It is not clear whether localized objects are finalized;
236probably not.)
237
238Third, as Fortran uses context to distinguish references to
239pointers from (de)references to their targets, it's not clear
240whether `LOCAL(PTR)` localizes a pointer, its target, or both.
241
242Fourth, the specifiers can be applied only to variable _names_,
243not to any designator with subscripts or component references.
244One may have defined a derived type to hold a representation
245of a sparse matrix, using `ALLOCATABLE` components to store its
246packed data and indexing structures, but a program cannot localize
247some parts of it and share the rest.
248(Perhaps one may wrap `ASSOCIATE` constructs around the
249`DO CONCURRENT` construct;
250the interaction between locality specifiers and construct entities is
251not clearly defined in the language.)
252
253In the example above that defines `T(IX(J))` and reads from `T(IY(J))`,
254the locality specifiers can't be used to share those elements of `T()`
255that are modified at most once and localize the cases where
256`IX(J)` is a duplicate and `IY(J)==IX(J)`.
257
258Last, when a loop both defines and references many shared objects,
259including potential references to globally accessible object
260in called procedures, one may need to name all of them in a `SHARED`
261specifier.
262
263## What to do now
264
265These problems have been presented to the J3 Fortran language
266standard committee.
267Their responses in
268recent [e-mail discussions](https://mailman.j3-fortran.org/pipermail/j3/2020-July/thread.html)
269did not include an intent to address them in future standards or corrigenda.
270The most effective-looking response -- which was essentially "just use
271`DEFAULT(SHARED)` to disable all automatic localization" -- is not an
272viable option, since the language does not include such a specifier!
273
274Programmers writing `DO CONCURRENT` loops that are safely parallelizable
275need an effective means to convey to compilers that those compilers
276do not have to assume only the weaker stipulations required by
277today's `DO CONCURRENT` without having to write verbose and
278error-prone locality specifiers (when those would suffice).
279Specifically, an easy means is required that stipulates that localization
280should apply at most only to the obvious cases of local non-pointer
281non-allocatable scalars.
282
283In the LLVM Fortran compiler project (a/k/a "flang", "f18") we considered
284several solutions to this problem.
2851. Add syntax (e.g., `DO PARALLEL` or `DO CONCURRENT() DEFAULT(PARALLEL)`)
286 by which one can inform the compiler that it should localize only
287 the obvious cases of simple local scalars.
288 Such syntax seems unlikely to ever be standardized, so its usage
289 would be nonportable.
2901. Add a command-line option &/or a source directive to stipulate
291 the stronger guarantees. Obvious non-parallelizable usage in the construct
292 would elicit a stern warning. The `DO CONCURRENT` loops in the source
293 would continue to be portable to other compilers.
2941. Assume that these stronger conditions hold by default, and add a command-line
295 option &/or a source directive to "opt out" back to the weaker
296 requirements of the standard language
297 in the event that the program contains one of those inherently
298 non-parallelizable `DO CONCURRENT` loops that perhaps should never have
299 been possible to write in a conforming program in the first place.
300 Actual parallel `DO CONCURRENT` constructs would produce parallel
301 code for users who would otherwise be surprised to learn about these
302 problems in the language.
303 But this option could lead to non-standard behavior for codes that depend,
304 accidentally or not, on non-parallelizable implicit localization.
3051. Accept the standard as it exists, do the best job of automatic
306 parallelization that can be done, and refer dissatisfied users to J3.
307 This would be avoiding the problem.
308
309None of these options is without a fairly obvious disadvantage.
310The best option seems to be the one that assumes that users who write
311`DO CONCURRENT` constructs are doing so with the intent to write parallel code.
312
313## Other precedents
314
315As of August 2020, we observe that the GNU Fortran compiler (10.1) does not
316yet implement the Fortran 2018 locality clauses, but will parallelize some
317`DO CONCURRENT` constructs without ambiguous data dependences when the automatic
318parallelization option is enabled.
319
320The Intel Fortran compiler supports the new locality clauses and will parallelize
321some `DO CONCURRENT` constructs when automatic parallelization option is enabled.
322When OpenMP is enabled, ifort reports that all `DO CONCURRENT` constructs are
323parallelized, but they seem to execute in a serial fashion when data flow
324hazards are present.