-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move dynamic stack allocations outside do concurrent
#227
Conversation
Dyanmic stack alloctaions are those allocs where memory is allocated on the stack (i.e. inside a function or proceure) with a size not known at compile time. Even though this is legal both at compile and run-times for the CPU on the GPU this is not supported. Therefore, this PR removes an instance of such allocations (this is the only instance I came across so far while compiling for the GPU).
Hi @ergawy, thanks for the PR. @ergawy said:
When you say "on the GPU this is not supported", can you please clarify exactly what you mean? AFAIK this code is permitted by the Fortran standard, so are you actually stating the AMD fork of flang (I'm guessing that's the compiler we're talking about here?) lacks support for this language feature when offloading to GPU? Also, can you please clarify whether the compiler involved supports the F18 CC: @rouson |
@@ -934,9 +937,6 @@ | |||
|
|||
iteration: & | |||
block | |||
|
|||
real a(maxval(self%nodes_), input_layer:output_layer) ! Activations | |||
real z(size(b,1),size(b,2)), delta(size(b,1),size(b,2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned the proposed transformation might lead to incorrect behavior.
All three of these variables need to be semantically "local" to each concurrent iteration, because each concurrent iteration independently references and defines (reads and writes) these temporary variables as part of their execution.
This version of the code is the one for compilers lacking the F18 local
locality specifier we'd prefer to use, and we worked-around that missing feature by instead deliberately declaring/allocating these local variables inside the body of the do concurrent
iteration. By declaring them within the scope of the DO CONCURRENT
iteration block, this unambiguously ensures the storage for these three variables is private to each iteration (the semantic this code requires for correctness).
F23 11.1.7.5: (emphasis added)
A construct or statement entity of a construct or statement within the DO CONCURRENT construct has SHARED locality if it has the SAVE attribute. If it does not have the SAVE attribute, is a different entity in each iteration, similar to LOCAL locality.
By moving the declaration outside do concurrent
as proposed here, these variables effectively gain "unspecified locality", and instead become subject to the following semantic rules in F23 11.1.7.5 (inherited from F08 8.1.6.7):
If a variable has unspecified locality,
• if it is referenced in an iteration it shall either be previously defined during that iteration, or shall not be defined or become undefined during any other iteration; if it is defined or becomes undefined by more than one iteration it becomes undefined when the loop terminates;
[...]
NOTE 3
The restrictions on the statements in a DO CONCURRENT construct are designed to ensure there are no data dependencies between iterations of the loop.
This wording is admittedly less definitive, but I read this to imply that these variables still need to behave as if each concurrent iteration has its own private copy, nearly identical to the LOCAL specifier (ignoring differences in variable contents at loop entry/exit that are irrelevant in this code).
TL;DR: I don't understand why this transformation "helps" the compiler in question. In particular, if this transformation causes the compiler to emit code where concurrent iterations all share the same copy of these three variables, that would break the correctness of this code. This procedure semantically requires each concurrent iteration to have a private/local copy of these three variables.
@bonachea Thanks for the detailed reply and sorry for the late response. I will reply this week, just busy with some other stuff atm. |
Dynamic stack allocationsTo explain dynamic stack allocations, we can ignore program test_dyanmic_stack_alloc_1
implicit none
integer :: n
block
real arr(n)
arr(55) = 42
end block
end program Note that the size of Dynamic stack allocations (an OpenMP example)Now, if we still ingnore subroutine default_real_learn
type(real), allocatable :: inputs(:)
!$omp target
!iterate_through_batch: &
!do concurrent (pair = 1:10)
iteration: &
block
real a(maxval(ubound(inputs)))
a(1:30) = inputs
end block iteration
!end do iterate_through_batch
!$omp end target
end subroutine default_real_learn Here, we have a /path/to/bin/flang -fopenmp --offload-arch=gfx90a /tmp/test.f90 -o /tmp/test.o You hit the following error: ld.lld: error: <unknown>:0:0: in function __omp_offloading_fc00_600202_default_real_learn__l4 void (ptr, ptr): unsupported dynamic alloca Note that I produced the above error with the latest upstream flang at the time of writing (hash I hope that answers at least some of your question @bonachea and helps us move the discussion forward. |
@ergawy although it doesn't matter for this discussion, the above example is invalid code because |
Hi @ergawy - Thanks for your response. I already understood (or suspected) everything in your response. However you have not addressed the key question regarding your PR'd code change:
My specific concern: if hoisting the dynamic array allocation out of the Will you please show us the OpenMP code generated for the loop in question before and after your proposed change? |
Dynamic stack allocations are those allocs where memory is allocated on the stack (i.e. inside a function or proceure) with a size not known at compile time.
Even though this is legal both at compile and run-times for the CPU, on the GPU this is not supported.
Therefore, this PR removes an instance of such allocations (this is the only instance I came across so far while compiling for the GPU).