Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_multiplication_hermitian fails on hohgant #808

Open
rasolca opened this issue Feb 23, 2023 · 7 comments
Open

test_multiplication_hermitian fails on hohgant #808

rasolca opened this issue Feb 23, 2023 · 7 comments
Assignees
Labels
Type:Bug Something isn't working

Comments

@rasolca
Copy link
Collaborator

rasolca commented Feb 23, 2023

Reported by @RMeli in #603, but moved here as it is reproducible:

I get the following failure on hohgant, when running the RANK_6 tests:

57: Test command: /scratch/e1000/rmeli/git/DLA-Future/build2/test/unit/multiplication/test_multiplication_hermitian
57: Working Directory: /scratch/e1000/rmeli/git/DLA-Future/build2/test/unit/multiplication
57: Test timeout computed to be: 1500
57: Running main() from gtest_mpipika_main.cpp
57: [==========] Running 16 tests from 8 test suites.
57: [----------] Global test environment set-up.
57: [----------] 2 tests from HermitianMultiplicationTestMC/0, where TypeParam = float
57: [ RUN      ] HermitianMultiplicationTestMC/0.CorrectnessLocal
57: [       OK ] HermitianMultiplicationTestMC/0.CorrectnessLocal (297 ms)
57: [ RUN      ] HermitianMultiplicationTestMC/0.CorrectnessDistributed
57: [       OK ] HermitianMultiplicationTestMC/0.CorrectnessDistributed (13 ms)
57: [----------] 2 tests from HermitianMultiplicationTestMC/0 (313 ms total)
57:
57: [----------] 2 tests from HermitianMultiplicationTestMC/1, where TypeParam = double
57: [ RUN      ] HermitianMultiplicationTestMC/1.CorrectnessLocal
57: [       OK ] HermitianMultiplicationTestMC/1.CorrectnessLocal (7 ms)
57: [ RUN      ] HermitianMultiplicationTestMC/1.CorrectnessDistributed
57: [       OK ] HermitianMultiplicationTestMC/1.CorrectnessDistributed (10 ms)
57: [----------] 2 tests from HermitianMultiplicationTestMC/1 (18 ms total)
57:
57: [----------] 2 tests from HermitianMultiplicationTestMC/2, where TypeParam = std::complex<float>
57: [ RUN      ] HermitianMultiplicationTestMC/2.CorrectnessLocal
57: [       OK ] HermitianMultiplicationTestMC/2.CorrectnessLocal (2 ms)
57: [ RUN      ] HermitianMultiplicationTestMC/2.CorrectnessDistributed
57: [       OK ] HermitianMultiplicationTestMC/2.CorrectnessDistributed (10 ms)
57: [----------] 2 tests from HermitianMultiplicationTestMC/2 (14 ms total)
57:
57: [----------] 2 tests from HermitianMultiplicationTestMC/3, where TypeParam = std::complex<double>
57: [ RUN      ] HermitianMultiplicationTestMC/3.CorrectnessLocal
57: [       OK ] HermitianMultiplicationTestMC/3.CorrectnessLocal (2 ms)
57: [ RUN      ] HermitianMultiplicationTestMC/3.CorrectnessDistributed
57: [       OK ] HermitianMultiplicationTestMC/3.CorrectnessDistributed (10 ms)
57: [----------] 2 tests from HermitianMultiplicationTestMC/3 (13 ms total)
57:
57: [----------] 2 tests from HermitianMultiplicationTestGPU/0, where TypeParam = float
57: [ RUN      ] HermitianMultiplicationTestGPU/0.CorrectnessLocal
57: [       OK ] HermitianMultiplicationTestGPU/0.CorrectnessLocal (481 ms)
57: [ RUN      ] HermitianMultiplicationTestGPU/0.CorrectnessDistributed
24/27 Test #57: test_multiplication_hermitian .....***Exception: SegFault  3.74 sec

@msimberg has seen this before on hohgant.

The code is compiled using the following procedure (from @rasolca):

srun --pty --partition=nvgpu --uenv-file=/scratch/e1000/rmeli/squashfs/dlaf-mkl-cuda.squashfs bash

export SPACK_SYSTEM_CONFIG_PATH=/user-environment/config

cd $SCRATCH/DLA-Future
mkdir build && cd build

spack -e /scratch/e1000/rmeli/git/my-spack/envs/dlaf-mkl-cuda concretize -f
spack -e /scratch/e1000/rmeli/git/my-spack/envs/dlaf-mkl-cuda install --only=dependencies
spack -e /scratch/e1000/rmeli/git/my-spack/envs/dlaf-mkl-cuda build-env dla-future -- bash

cmake .. -DCMAKE_BUILD_TYPE=Release -DDLAF_WITH_MKL=1 -DDLAF_MPI_PRESET=slurm -DMPIEXEC_NUMCORES=128 -DCMAKE_CUDA_ARCHITECTURES=80 -DDLAF_WITH_CUDA=1

nice -n 19 make -j 32

To run the tests I use the following (built scraping things from scripts):

squashfs-run /scratch/e1000/rmeli/squashfs/dlaf-mkl-cuda.squashfs bash
export SPACK_SYSTEM_CONFIG_PATH=/user-environment/config
spack -e /scratch/e1000/rmeli/git/my-spack/envs/dlaf-mkl-cuda build-env dla-future -- bash

sbatch test.sh

#!/bin/bash
#SBATCH --partition=nvgpu
#SBATCH --nodes=2
#SBATCH --hint=multithread
#SBATCH --uenv-file=/scratch/e1000/rmeli/squashfs/dlaf-mkl-cuda.squashfs

# Before SBATCHing this script:
#   squashfs-run /scratch/e1000/rmeli/squashfs/dlaf-mkl-cuda.squashfs bash
#   export SPACK_SYSTEM_CONFIG_PATH=/user-environment/config
#   spack -e /scratch/e1000/rmeli/git/my-spack/envs/dlaf-mkl-cuda build-env -- bash

export MPICH_MAX_THREAD_SAFETY=multiple
export MIMALLOC_EAGER_COMMIT_DELAY=0
export MIMALLOC_LARGE_OS_PAGES=1

hostname
nvidia-smi
mpichversion

# gpu2ranks
cat > gpu2ranks <<EOF
#!/bin/bash
# Restrict visible GPUs when using multiple ranks per node with slurm.

set -eu

export CUDA_VISIBLE_DEVICES=\$SLURM_LOCALID

eval "\$@"
EOF
chmod +x gpu2ranks

# DLA-Future needs to be compiled with -DDLAF_CI_RUNNER_USES_MPIRUN=on
srun -u -n 1 --cpu-bind=mask_cpu:ffff000000000000ffff000000000000,ffff000000000000ffff00000000,ffff000000000000ffff0000,ffff000000000000ffff \
    ./gpu2ranks ctest -L RANK_1
srun -u -n 2 --cpu-bind=mask_cpu:ffff000000000000ffff000000000000,ffff000000000000ffff00000000,ffff000000000000ffff0000,ffff000000000000ffff \
    ./gpu2ranks ctest -L RANK_2
srun -u -n 4 --cpu-bind=mask_cpu:ffff000000000000ffff000000000000,ffff000000000000ffff00000000,ffff000000000000ffff0000,ffff000000000000ffff \
    ./gpu2ranks ctest -L RANK_4
#srun -u -n 6 --cpu-bind=mask_cpu:ffff000000000000ffff000000000000,ffff000000000000ffff00000000,ffff000000000000ffff0000,ffff000000000000ffff \
#    ./gpu2ranks ctest -L RANK_6
srun -u -n 6 --cpu-bind=mask_cpu:ffff000000000000ffff000000000000,ffff000000000000ffff00000000,ffff000000000000ffff0000,ffff000000000000ffff \
    ./gpu2ranks ctest -V -L RANK_6
@rasolca rasolca added Type:Bug Something isn't working Priority:High labels Feb 23, 2023
@rasolca rasolca self-assigned this Feb 23, 2023
@rasolca
Copy link
Collaborator Author

rasolca commented Feb 23, 2023

@msimberg
trace.txt

I have a trace but I'm not sure what can be wrong...

@msimberg
Copy link
Collaborator

msimberg commented Feb 23, 2023

Thanks @rasolca for checking. The only thing that stands out is that the stacktrace is relatively long, but not impossibly so. It may be that it's running out of stack space. Then again the way the trace ends could point to something completely different. I'm really not sure, but it may be at least worth trying to see what happens with --pika:ini=pika.stacks.small_size=0x10000, but this is a very long shot... And I don't really see what the hermitian matrix multiplication would be doing differently compared to other algorithms to trigger something like that.

@rasolca
Copy link
Collaborator Author

rasolca commented Feb 23, 2023

Nope it is not the solution... 😞

@rasolca
Copy link
Collaborator Author

rasolca commented Apr 11, 2023

A failure of the hermitian mult happened in the CI as well (daint MC)
https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/4093987502

(Note: Non reproducible)

@rasolca
Copy link
Collaborator Author

rasolca commented May 16, 2023

@msimberg Was this solved by #838 ?

@msimberg
Copy link
Collaborator

I haven't tried again after the nvidia nodes were returned to hohgant but I'll try to do so this week. @RMeli or @albestro has either of you run tests on hohgant after the nvidia nodes came back?

@msimberg
Copy link
Collaborator

This is still a problem but I don't know what's causing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type:Bug Something isn't working
Projects
Status: No status
Development

No branches or pull requests

2 participants