Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move dynamic stack allocations outside do concurrent #227

Closed

Conversation

ergawy
Copy link

@ergawy ergawy commented Nov 21, 2024

Dynamic stack allocations are those allocs where memory is allocated on the stack (i.e. inside a function or proceure) with a size not known at compile time.

Even though this is legal both at compile and run-times for the CPU, on the GPU this is not supported.

Therefore, this PR removes an instance of such allocations (this is the only instance I came across so far while compiling for the GPU).

Dyanmic stack alloctaions are those allocs where memory is allocated on
the stack (i.e. inside a function or proceure) with a size not known at
compile time.

Even though this is legal both at compile and run-times for the CPU on
the GPU this is not supported.

Therefore, this PR removes an instance of such allocations (this is the
only instance I came across so far while compiling for the GPU).
@rouson rouson requested a review from bonachea November 22, 2024 21:53
@bonachea
Copy link
Member

Hi @ergawy, thanks for the PR.

@ergawy said:

Dynamic stack allocations are those allocs where memory is allocated on the stack (i.e. inside a function or proceure) with a size not known at compile time.
Even though this is legal both at compile and run-times for the CPU, on the GPU this is not supported.

When you say "on the GPU this is not supported", can you please clarify exactly what you mean? AFAIK this code is permitted by the Fortran standard, so are you actually stating the AMD fork of flang (I'm guessing that's the compiler we're talking about here?) lacks support for this language feature when offloading to GPU?

Also, can you please clarify whether the compiler involved supports the F18 LOCAL locality spec for DO CONCURRENT? That's the feature we'd actually prefer to be using for the variables in question (see the version of the code activated by -DF2018_LOCALITY), your PR is modifying the "last resort" F08 version of the code for compilers lacking all locality specifiers.

CC: @rouson

@@ -934,9 +937,6 @@

iteration: &
block

real a(maxval(self%nodes_), input_layer:output_layer) ! Activations
real z(size(b,1),size(b,2)), delta(size(b,1),size(b,2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned the proposed transformation might lead to incorrect behavior.

All three of these variables need to be semantically "local" to each concurrent iteration, because each concurrent iteration independently references and defines (reads and writes) these temporary variables as part of their execution.

This version of the code is the one for compilers lacking the F18 local locality specifier we'd prefer to use, and we worked-around that missing feature by instead deliberately declaring/allocating these local variables inside the body of the do concurrent iteration. By declaring them within the scope of the DO CONCURRENT iteration block, this unambiguously ensures the storage for these three variables is private to each iteration (the semantic this code requires for correctness).

F23 11.1.7.5: (emphasis added)

A construct or statement entity of a construct or statement within the DO CONCURRENT construct has SHARED locality if it has the SAVE attribute. If it does not have the SAVE attribute, is a different entity in each iteration, similar to LOCAL locality.

By moving the declaration outside do concurrent as proposed here, these variables effectively gain "unspecified locality", and instead become subject to the following semantic rules in F23 11.1.7.5 (inherited from F08 8.1.6.7):

If a variable has unspecified locality,
• if it is referenced in an iteration it shall either be previously defined during that iteration, or shall not be defined or become undefined during any other iteration; if it is defined or becomes undefined by more than one iteration it becomes undefined when the loop terminates;
[...]
NOTE 3
The restrictions on the statements in a DO CONCURRENT construct are designed to ensure there are no data dependencies between iterations of the loop.

This wording is admittedly less definitive, but I read this to imply that these variables still need to behave as if each concurrent iteration has its own private copy, nearly identical to the LOCAL specifier (ignoring differences in variable contents at loop entry/exit that are irrelevant in this code).

TL;DR: I don't understand why this transformation "helps" the compiler in question. In particular, if this transformation causes the compiler to emit code where concurrent iterations all share the same copy of these three variables, that would break the correctness of this code. This procedure semantically requires each concurrent iteration to have a private/local copy of these three variables.

@ergawy
Copy link
Author

ergawy commented Nov 27, 2024

@bonachea Thanks for the detailed reply and sorry for the late response. I will reply this week, just busy with some other stuff atm.

@ergawy
Copy link
Author

ergawy commented Dec 16, 2024

Dynamic stack allocations

To explain dynamic stack allocations, we can ignore do concurrent, OpenMP, and GPUs for now. Consider the following input:

program test_dyanmic_stack_alloc_1
  implicit none
  integer :: n

  block
      real arr(n)
      arr(55) = 42
  end block
end program

Note that the size of arr is not static and at the same time it is not allocated on the heap (using something like allocate or malloc). Hence, this is a dynamic array that is allocated on the stack of its parent function/procedure/subroutine.

Dynamic stack allocations (an OpenMP example)

Now, if we still ingnore do concurrent for now and look at OpenMP-proper (which is what do concurent maps to at the moment). Consider the following example reduced from one of the do concurrent loops in neural_network_s.F90:

subroutine default_real_learn
  type(real), allocatable :: inputs(:)

  !$omp target
  !iterate_through_batch: &
  !do concurrent (pair = 1:10)
    iteration: &
    block
      real a(maxval(ubound(inputs)))

      a(1:30) = inputs
    end block iteration
  !end do iterate_through_batch
  !$omp end target
end subroutine default_real_learn

Here, we have a target (i.e. GPU region) that contains a dynamic stack allocation for an array a. If you try to compile this example with flang:

/path/to/bin/flang -fopenmp --offload-arch=gfx90a /tmp/test.f90 -o /tmp/test.o

You hit the following error:

ld.lld: error: <unknown>:0:0: in function __omp_offloading_fc00_600202_default_real_learn__l4 void (ptr, ptr): unsupported dynamic alloca

Note that I produced the above error with the latest upstream flang at the time of writing (hash f65a21a4ecc2e712c700c59842b6b9a1d2a9a060). So this is not specific to our fork. This error is produced by the following line.

I hope that answers at least some of your question @bonachea and helps us move the discussion forward.

@rouson
Copy link
Contributor

rouson commented Dec 16, 2024

@ergawy although it doesn't matter for this discussion, the above example is invalid code because n is used but never defined. More importantly,our use of block is workaround for compilers that don't support locality specifiers for ‘do concurrent. Rather than settling on what to do with block, it would be even better to support locality specifiers, including Fortran 2023 reduce` locality. Is this something you're able to help with?

@bonachea
Copy link
Member

Hi @ergawy - Thanks for your response.

I already understood (or suspected) everything in your response. However you have not addressed the key question regarding your PR'd code change:

if this transformation causes the compiler to emit code where concurrent iterations all share the same copy of these three variables, that would break the correctness of this code. This procedure semantically requires each concurrent iteration to have a private/local copy of these three variables.

My specific concern: if hoisting the dynamic array allocation out of the omp target section into host code causes all iterations running on the GPU to share the same copy of the three temporary array variables, then your proposed change could be an incorrect transformation.

Will you please show us the OpenMP code generated for the loop in question before and after your proposed change?

@rouson rouson closed this Dec 18, 2024
@bonachea
Copy link
Member

For the record, @ergawy @rouson and myself discussed this PR today over Zoom and reached consensus that this PR was not the right resolution to this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants