Skip to content
This repository has been archived by the owner on Mar 20, 2023. It is now read-only.

Integrate changes from NERSC GPU hackathon. #713

Merged
merged 31 commits into from
Dec 23, 2021
Merged

Integrate changes from NERSC GPU hackathon. #713

merged 31 commits into from
Dec 23, 2021

Conversation

olupton
Copy link
Contributor

@olupton olupton commented Dec 13, 2021

Description
During the NERSC GPU hackathon we used hackathon_main as our "base" branch.
Now the hackathon is over, we should merge our developments into master, after re-enabling the full test suite and fixing any issues we delayed addressing during the hackathon.

See also:
neuronsimulator/gpuhackathon#4

Summary of hackathon developments:

  • Support OpenMP target offload when NMODL and GPU support are enabled.
  • Use sensible defaults for the --nwarp parameter, improving the performance of the Hines solver with --cell-permute=2 on GPU.
  • Use Boost memory pool, if Boost is available, to reduce the number of independent CUDA unified memory allocations used for Random123 stream objects. This speeds up initialisation of models using Random123, and also makes it feasible to use NSight Compute on models using Random123 and for NSight Systems to profile initialisation.
  • Use -cuda when compiling with NVHPC and OpenACC or OpenMP, as recommended on the NVIDIA forums.
  • Do not compile for compute capability 6.0 by default, as this is not supported by NVHPC with OpenMP target offload.
  • Add new GitLab CI tests so we test CoreNEURON + NMODL with both OpenACC and OpenMP.
  • Add CUDA runtime header search path explicitly, so we don't rely on it being implicit in our NVHPC localrc.

TODO:

Use certain branches for the SimulationStack CI
CI_BRANCHES:NEURON_BRANCH=master,NMODL_BRANCH=master,

olupton and others added 11 commits November 23, 2021 09:11
* Disable cmake-format and clang-format checks.
* Disable GitLab CI except for NMODL + GPU.
* Add a hackathon-specific argument for benchmarks.
* Add a reference comparison for channel-benchmark.
* create build/benchmark folder before trying to use it
* run nrnivmodl-core in parallel than serially (too slow)
* Add memory pool for Random123 streams.
   This speeds up initialisation when running on GPU.
* Make Boost optional.
This was a silly bug in #702.
* Simplify unified memory logic.
* Pass -mp=gpu when we pass -acc
* Pass -gpu=lineinfo for better debug information.
* Pass -Minfo=accel,mp for better compile time diagnostics.
* Add nrn_pragma_{acc,omp} macros for single-source Open{ACC,MP} support.
* Call omp_set_default_device.
* Drop cc60 because of OpenMP offload incompatibility.
* Add --gpu to test.
* Default (BB5-valid) CORENRN_EXTERNAL_BENCHMARK_DATA.
* Remove cuda_add_library.
* Don't print number of GPUs when quiet.
* Set OMP_NUM_THREADS=1 for lfp_test.
* Update NMODL to emit nrn_pragma{acc,omp} macros.

Co-authored-by: Pramod Kumbhar <[email protected]>
* Add wrapper functions for using OpenMP or OpenACC API
* Add -mp=gpu in order to link gpu runtime with tests as well
* Avoid copying VecPlay members twice otherwise association fails with OpenMP
     * IvocVect members t_ and y_ were copied twice
     * only discon_indices_ is pointer and hence that
        needs to be copied
@bbpbuildbot
Copy link
Collaborator

@bbpbuildbot
Copy link
Collaborator

@bbpbuildbot
Copy link
Collaborator

* Use #pragma omp instead of runtime API in `cnrn_target_{copyin,delete}`
* Fix `VecPlayContinuous::discon_indices_` device transfer.
* Name `cnrn_target_` wrappers more consistently.

Co-authored-by: Olli Lupton <[email protected]>
@bbpbuildbot
Copy link
Collaborator

We prefer selective host-to-device updates.
@bbpbuildbot
Copy link
Collaborator

Code fixes for XLC and Clang execution without build system changes.
This mainly adds missing OpenMP pragmas and makes cnrn_target_
wrappers visible to NMODL.
@bbpbuildbot
Copy link
Collaborator

omp_get_mapped_ptr was added in OpenMP 5.1 and is not widely supported.

With this change then calling cnrn_target_deviceptr on a pointer that is not
present on the device is a hard error instead of returning nullptr, so avoid
calling it for artificial cells.
@bbpbuildbot
Copy link
Collaborator

* Set nwarp to very big number for optimal parallelization and improve a bit grid config of CUDA solve_interleaved2
@bbpbuildbot
Copy link
Collaborator

olupton and others added 2 commits December 17, 2021 14:53
* Re-enable GitLab CI.
* Add NMODL + OpenACC test.
* Restore {clang,cmake}-format checks.
* Prefer OpenACC with MOD2C.
* Do not enable OpenACC in NMODL + OpenMP mode.
* Convert more #pragma acc to nrn_pragma_acc(...).
* Call cudaSetDevice in OpenMP mode.

Co-authored-by: Ioannis Magkanaris <[email protected]>
CMake/OpenAccHelper.cmake Outdated Show resolved Hide resolved
Presumably this was working before because our nvhpc localrc files
accidentally included CUDA include directories before
BlueBrain/spack#1392.
* Compile NVHPC+Open{ACC,MP} with -cuda.
* Pull in NMODL+Eigen fixes to make this work.
@olupton olupton marked this pull request as ready for review December 22, 2021 09:39
Copy link
Contributor

@iomaganaris iomaganaris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍
Just one question

coreneuron/network/partrans.cpp Show resolved Hide resolved
@olupton olupton requested a review from pramodk December 22, 2021 12:03
Copy link
Collaborator

@pramodk pramodk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Overall all changes look fine to me. Added one/two comments for clarification.

coreneuron/CMakeLists.txt Show resolved Hide resolved
coreneuron/permute/cellorder.cu Show resolved Hide resolved
coreneuron/sim/solve_core.cpp Show resolved Hide resolved
coreneuron/network/partrans.cpp Show resolved Hide resolved
coreneuron/network/partrans.cpp Outdated Show resolved Hide resolved
@olupton olupton closed this Dec 22, 2021
@olupton olupton reopened this Dec 22, 2021
@olupton olupton closed this Dec 22, 2021
@olupton olupton reopened this Dec 22, 2021
@olupton olupton merged commit 423ae6c into master Dec 23, 2021
pramodk pushed a commit to neuronsimulator/nrn that referenced this pull request Nov 2, 2022
Summary of changes:
 - Support OpenMP target offload when NMODL and GPU support are enabled.
   (BlueBrain/CoreNeuron#693, BlueBrain/CoreNeuron#704, BlueBrain/CoreNeuron#705, BlueBrain/CoreNeuron#707, BlueBrain/CoreNeuron#708, BlueBrain/CoreNeuron#716, BlueBrain/CoreNeuron#719)
 - Use sensible defaults for the --nwarp parameter, improving the performance
   of the Hines solver with --cell-permute=2 on GPU. (BlueBrain/CoreNeuron#700, BlueBrain/CoreNeuron#710, BlueBrain/CoreNeuron#718)
 - Use a Boost memory pool, if Boost is available, to reduce the number of
   independent CUDA unified memory allocations used for Random123 stream
   objects. This speeds up initialisation of models using Random123, and also
   makes it feasible to use NSight Compute on models using Random123 and for
   NSight Systems to profile initialisation. (BlueBrain/CoreNeuron#702, BlueBrain/CoreNeuron#703)
 - Use -cuda when compiling with NVHPC and OpenACC or OpenMP, as recommended
   on the NVIDIA forums. (BlueBrain/CoreNeuron#721)
 - Do not compile for compute capability 6.0 by default, as this is not
   supported by NVHPC with OpenMP target offload.
 - Add new GitLab CI tests so we test CoreNEURON + NMODL with both OpenACC and
   OpenMP. (BlueBrain/CoreNeuron#698, BlueBrain/CoreNeuron#717)
 - Add CUDA runtime header search path explicitly, so we don't rely on it being
   implicit in our NVHPC localrc.
 - Cleanup unused code. (BlueBrain/CoreNeuron#711)

Co-authored-by: Pramod Kumbhar <[email protected]>
Co-authored-by: Ioannis Magkanaris <[email protected]>
Co-authored-by: Christos Kotsalos <[email protected]>
Co-authored-by: Nicolas Cornu <[email protected]>

CoreNEURON Repo SHA: BlueBrain/CoreNeuron@423ae6c
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants