Move dynamic stack allocations outside `do concurrent` #227

ergawy · 2024-11-21T11:43:07Z

Dynamic stack allocations are those allocs where memory is allocated on the stack (i.e. inside a function or proceure) with a size not known at compile time.

Even though this is legal both at compile and run-times for the CPU, on the GPU this is not supported.

Therefore, this PR removes an instance of such allocations (this is the only instance I came across so far while compiling for the GPU).

Dyanmic stack alloctaions are those allocs where memory is allocated on the stack (i.e. inside a function or proceure) with a size not known at compile time. Even though this is legal both at compile and run-times for the CPU on the GPU this is not supported. Therefore, this PR removes an instance of such allocations (this is the only instance I came across so far while compiling for the GPU).

bonachea · 2024-11-22T22:32:11Z

Hi @ergawy, thanks for the PR.

@ergawy said:

Dynamic stack allocations are those allocs where memory is allocated on the stack (i.e. inside a function or proceure) with a size not known at compile time.
Even though this is legal both at compile and run-times for the CPU, on the GPU this is not supported.

When you say "on the GPU this is not supported", can you please clarify exactly what you mean? AFAIK this code is permitted by the Fortran standard, so are you actually stating the AMD fork of flang (I'm guessing that's the compiler we're talking about here?) lacks support for this language feature when offloading to GPU?

Also, can you please clarify whether the compiler involved supports the F18 LOCAL locality spec for DO CONCURRENT? That's the feature we'd actually prefer to be using for the variables in question (see the version of the code activated by -DF2018_LOCALITY), your PR is modifying the "last resort" F08 version of the code for compilers lacking all locality specifiers.

CC: @rouson

bonachea · 2024-11-22T22:16:57Z

src/fiats/neural_network_s.F90

@@ -934,9 +937,6 @@

                  iteration: &
                  block
-
-                    real a(maxval(self%nodes_), input_layer:output_layer) ! Activations
-                    real z(size(b,1),size(b,2)), delta(size(b,1),size(b,2))


I'm concerned the proposed transformation might lead to incorrect behavior.

All three of these variables need to be semantically "local" to each concurrent iteration, because each concurrent iteration independently references and defines (reads and writes) these temporary variables as part of their execution.

This version of the code is the one for compilers lacking the F18 local locality specifier we'd prefer to use, and we worked-around that missing feature by instead deliberately declaring/allocating these local variables inside the body of the do concurrent iteration. By declaring them within the scope of the DO CONCURRENT iteration block, this unambiguously ensures the storage for these three variables is private to each iteration (the semantic this code requires for correctness).

F23 11.1.7.5: (emphasis added)

A construct or statement entity of a construct or statement within the DO CONCURRENT construct has SHARED locality if it has the SAVE attribute. If it does not have the SAVE attribute, is a different entity in each iteration, similar to LOCAL locality.

By moving the declaration outside do concurrent as proposed here, these variables effectively gain "unspecified locality", and instead become subject to the following semantic rules in F23 11.1.7.5 (inherited from F08 8.1.6.7):

If a variable has unspecified locality,
• if it is referenced in an iteration it shall either be previously defined during that iteration, or shall not be defined or become undefined during any other iteration; if it is defined or becomes undefined by more than one iteration it becomes undefined when the loop terminates;
[...]
NOTE 3
The restrictions on the statements in a DO CONCURRENT construct are designed to ensure there are no data dependencies between iterations of the loop.

This wording is admittedly less definitive, but I read this to imply that these variables still need to behave as if each concurrent iteration has its own private copy, nearly identical to the LOCAL specifier (ignoring differences in variable contents at loop entry/exit that are irrelevant in this code).

TL;DR: I don't understand why this transformation "helps" the compiler in question. In particular, if this transformation causes the compiler to emit code where concurrent iterations all share the same copy of these three variables, that would break the correctness of this code. This procedure semantically requires each concurrent iteration to have a private/local copy of these three variables.

ergawy · 2024-11-27T12:18:05Z

@bonachea Thanks for the detailed reply and sorry for the late response. I will reply this week, just busy with some other stuff atm.

ergawy · 2024-12-16T07:49:59Z

Dynamic stack allocations

To explain dynamic stack allocations, we can ignore do concurrent, OpenMP, and GPUs for now. Consider the following input:

program test_dyanmic_stack_alloc_1
  implicit none
  integer :: n

  block
      real arr(n)
      arr(55) = 42
  end block
end program

Note that the size of arr is not static and at the same time it is not allocated on the heap (using something like allocate or malloc). Hence, this is a dynamic array that is allocated on the stack of its parent function/procedure/subroutine.

Dynamic stack allocations (an OpenMP example)

Now, if we still ingnore do concurrent for now and look at OpenMP-proper (which is what do concurent maps to at the moment). Consider the following example reduced from one of the do concurrent loops in neural_network_s.F90:

subroutine default_real_learn
  type(real), allocatable :: inputs(:)

  !$omp target
  !iterate_through_batch: &
  !do concurrent (pair = 1:10)
    iteration: &
    block
      real a(maxval(ubound(inputs)))

      a(1:30) = inputs
    end block iteration
  !end do iterate_through_batch
  !$omp end target
end subroutine default_real_learn

Here, we have a target (i.e. GPU region) that contains a dynamic stack allocation for an array a. If you try to compile this example with flang:

/path/to/bin/flang -fopenmp --offload-arch=gfx90a /tmp/test.f90 -o /tmp/test.o

You hit the following error:

ld.lld: error: <unknown>:0:0: in function __omp_offloading_fc00_600202_default_real_learn__l4 void (ptr, ptr): unsupported dynamic alloca

Note that I produced the above error with the latest upstream flang at the time of writing (hash f65a21a4ecc2e712c700c59842b6b9a1d2a9a060). So this is not specific to our fork. This error is produced by the following line.

I hope that answers at least some of your question @bonachea and helps us move the discussion forward.

rouson · 2024-12-16T12:35:44Z

@ergawy although it doesn't matter for this discussion, the above example is invalid code because n is used but never defined. More importantly,our use of block is workaround for compilers that don't support locality specifiers for ‘do concurrent. Rather than settling on what to do with block, it would be even better to support locality specifiers, including Fortran 2023 reduce` locality. Is this something you're able to help with?

bonachea · 2024-12-16T18:20:20Z

Hi @ergawy - Thanks for your response.

I already understood (or suspected) everything in your response. However you have not addressed the key question regarding your PR'd code change:

if this transformation causes the compiler to emit code where concurrent iterations all share the same copy of these three variables, that would break the correctness of this code. This procedure semantically requires each concurrent iteration to have a private/local copy of these three variables.

My specific concern: if hoisting the dynamic array allocation out of the omp target section into host code causes all iterations running on the GPU to share the same copy of the three temporary array variables, then your proposed change could be an incorrect transformation.

Will you please show us the OpenMP code generated for the loop in question before and after your proposed change?

bonachea · 2024-12-18T23:14:33Z

For the record, @ergawy @rouson and myself discussed this PR today over Zoom and reached consensus that this PR was not the right resolution to this problem.

rouson requested a review from bonachea November 22, 2024 21:53

bonachea requested changes Nov 22, 2024

View reviewed changes

rouson closed this Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move dynamic stack allocations outside `do concurrent` #227

Move dynamic stack allocations outside `do concurrent` #227

ergawy commented Nov 21, 2024 •

edited by rouson

Loading

bonachea commented Nov 22, 2024

bonachea Nov 22, 2024

ergawy commented Nov 27, 2024

ergawy commented Dec 16, 2024 •

edited

Loading

rouson commented Dec 16, 2024

bonachea commented Dec 16, 2024

bonachea commented Dec 18, 2024

Move dynamic stack allocations outside do concurrent #227

Move dynamic stack allocations outside do concurrent #227

Conversation

ergawy commented Nov 21, 2024 • edited by rouson Loading

bonachea commented Nov 22, 2024

bonachea Nov 22, 2024

Choose a reason for hiding this comment

ergawy commented Nov 27, 2024

ergawy commented Dec 16, 2024 • edited Loading

Dynamic stack allocations

Dynamic stack allocations (an OpenMP example)

rouson commented Dec 16, 2024

bonachea commented Dec 16, 2024

bonachea commented Dec 18, 2024

Move dynamic stack allocations outside `do concurrent` #227

Move dynamic stack allocations outside `do concurrent` #227

ergawy commented Nov 21, 2024 •

edited by rouson

Loading

ergawy commented Dec 16, 2024 •

edited

Loading