Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xo_res initialization on CPU for OpenACC version and on GPU for CUDAFortran #153

Open
bellenlau opened this issue Nov 13, 2024 · 4 comments

Comments

@bellenlau
Copy link

bellenlau commented Nov 13, 2024

Line 69 in X_irredux_residuals.F fills Xo_res with zeros

  • on the GPU for CUDAFortran version (as being a device variable)
  • on the CPU for OpenACC version (as being a host variable)

The time taken for this operation on the CPU and on the GPU differs significantly and becomes evident when distributing the simulation in the OpenACC version, as being done on the CPU and not on the GPU.

In the following pictures, a small test running on 4 gpus is traced with nsys; nvtx range labelled "issue3" wraps line 68-69.

OpenACC small test

openacc-case

CUDAFortran small test

cudaf-case

The time becomes detrimental when running larger systems maximally distributed (e.g. GrCo-7k on 16 nodes in the following picture). This simulation takes 3 minutes for CUDAFortran version, but overcomes the walltime for OpenACC version.

image

If zeroing Xo_res is actually needed, a possible fix is kernels construct: bellenlau@8efaf61, but I am not sure if a counterpart in OpenMP offload exists.

With kernels, small test on 4 gpus, OpenACC version:

openacc-kernels-fix

The data: issue.tar.gz
software stack: nvhpc/23.1, no present clauses, openmpi/4.1.4 on Leonardo

@bellenlau bellenlau changed the title X0_res initialization on CPU for OpenACC version and on GPU for CUDAFortran Xo_res initialization on CPU for OpenACC version and on GPU for CUDAFortran Nov 13, 2024
@sangallidavide
Copy link
Member

Thanks Laura. Very accurate.

In the tech-gpu branch
https://github.com/yambo-code/yambo/blob/tech-gpu/src/pol_function/X_irredux_residuals.F#L69
I see this

 Xo_res    = cZERO
 call devxlib_memset_d(Xo_res,cZERO)

In cudaf, the "on GPU" zeroing is done via devxlib in the second line, while the "on CPU" zeroing is done via fortran n te first line.

I understand from your fix

 !DEV_ACC kernels
 Xo_res    = cZERO
 !DEV_ACC end kernels
 call devxlib_memset_d(Xo_res,cZERO)

that that Xo_res = cZERO is not an "on CPU" operation in the openacc case.

Indeed, as Laura is asking, @andrea-ferretti , is Xo_res = cZERO needed at all?
From the logic I see, it should be a device only variable

Alternatively, could this

 call devxlib_memset_h(Xo_res,cZERO)
 call devxlib_memset_d(Xo_res,cZERO)

be an alternative which works also in OpenMP?

We need to discuss the logic of OpenACC and devxlib in yambo ...

@bellenlau
Copy link
Author

Hello Davide,

from what I see from the profiler the zeroing of Xo_res in CUDAFortran is done not once but twice on the GPU; firstly because Xo_res = 0 is actually a GPU operation in CUDAFortran as Xo_res is a device variable (the offload is automatically managed by CUDAFortran) and once due to devixlib. In OpenACC this automatic offload does not exist because Xo_res is a host variable, and thus the first Xo_res does the operation on the CPU and devixlib does the operation on the GPU.

If the zeroing on cpu is needed, than using both devxlib for the host and then for the device should fix, so that all versions do the same. However I expect this to slow down the simulation as in the OpenACC trace when the system is well distributed among MPI tasks... I hope the zeroing on CPU is actually not needed :)

@sangallidavide
Copy link
Member

Ok. Clear.

Yeah, if zeroing on CPU is not done in CUDAF, I'd say it is not needed in OpenACC neither.

@andrea-ferretti
Copy link
Member

andrea-ferretti commented Nov 13, 2024

Dear all,
thanks for the comments. Indeed, I think Xo_res is only needed on the device, we should just drop the first line in
[ Xo_res = cZERO
call devxlib_memset_d(Xo_res,cZERO)]

I am also wondering whether Xo_res in X_irredux.F (in CPU parts) should not be replaced by Xo_res_p (which would point to CPU workspace in that case)... we may have issues with pointer slicing, though (this may be the reason why the pointer is not used throughtout).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants