test_multiplication_hermitian fails on hohgant #808

rasolca · 2023-02-23T12:44:01Z

Reported by @RMeli in #603, but moved here as it is reproducible:

I get the following failure on hohgant, when running the RANK_6 tests:

57: Test command: /scratch/e1000/rmeli/git/DLA-Future/build2/test/unit/multiplication/test_multiplication_hermitian
57: Working Directory: /scratch/e1000/rmeli/git/DLA-Future/build2/test/unit/multiplication
57: Test timeout computed to be: 1500
57: Running main() from gtest_mpipika_main.cpp
57: [==========] Running 16 tests from 8 test suites.
57: [----------] Global test environment set-up.
57: [----------] 2 tests from HermitianMultiplicationTestMC/0, where TypeParam = float
57: [ RUN      ] HermitianMultiplicationTestMC/0.CorrectnessLocal
57: [       OK ] HermitianMultiplicationTestMC/0.CorrectnessLocal (297 ms)
57: [ RUN      ] HermitianMultiplicationTestMC/0.CorrectnessDistributed
57: [       OK ] HermitianMultiplicationTestMC/0.CorrectnessDistributed (13 ms)
57: [----------] 2 tests from HermitianMultiplicationTestMC/0 (313 ms total)
57:
57: [----------] 2 tests from HermitianMultiplicationTestMC/1, where TypeParam = double
57: [ RUN      ] HermitianMultiplicationTestMC/1.CorrectnessLocal
57: [       OK ] HermitianMultiplicationTestMC/1.CorrectnessLocal (7 ms)
57: [ RUN      ] HermitianMultiplicationTestMC/1.CorrectnessDistributed
57: [       OK ] HermitianMultiplicationTestMC/1.CorrectnessDistributed (10 ms)
57: [----------] 2 tests from HermitianMultiplicationTestMC/1 (18 ms total)
57:
57: [----------] 2 tests from HermitianMultiplicationTestMC/2, where TypeParam = std::complex<float>
57: [ RUN      ] HermitianMultiplicationTestMC/2.CorrectnessLocal
57: [       OK ] HermitianMultiplicationTestMC/2.CorrectnessLocal (2 ms)
57: [ RUN      ] HermitianMultiplicationTestMC/2.CorrectnessDistributed
57: [       OK ] HermitianMultiplicationTestMC/2.CorrectnessDistributed (10 ms)
57: [----------] 2 tests from HermitianMultiplicationTestMC/2 (14 ms total)
57:
57: [----------] 2 tests from HermitianMultiplicationTestMC/3, where TypeParam = std::complex<double>
57: [ RUN      ] HermitianMultiplicationTestMC/3.CorrectnessLocal
57: [       OK ] HermitianMultiplicationTestMC/3.CorrectnessLocal (2 ms)
57: [ RUN      ] HermitianMultiplicationTestMC/3.CorrectnessDistributed
57: [       OK ] HermitianMultiplicationTestMC/3.CorrectnessDistributed (10 ms)
57: [----------] 2 tests from HermitianMultiplicationTestMC/3 (13 ms total)
57:
57: [----------] 2 tests from HermitianMultiplicationTestGPU/0, where TypeParam = float
57: [ RUN      ] HermitianMultiplicationTestGPU/0.CorrectnessLocal
57: [       OK ] HermitianMultiplicationTestGPU/0.CorrectnessLocal (481 ms)
57: [ RUN      ] HermitianMultiplicationTestGPU/0.CorrectnessDistributed
24/27 Test #57: test_multiplication_hermitian .....***Exception: SegFault  3.74 sec

@msimberg has seen this before on hohgant.

The code is compiled using the following procedure (from @rasolca):

srun --pty --partition=nvgpu --uenv-file=/scratch/e1000/rmeli/squashfs/dlaf-mkl-cuda.squashfs bash

export SPACK_SYSTEM_CONFIG_PATH=/user-environment/config

cd $SCRATCH/DLA-Future
mkdir build && cd build

spack -e /scratch/e1000/rmeli/git/my-spack/envs/dlaf-mkl-cuda concretize -f
spack -e /scratch/e1000/rmeli/git/my-spack/envs/dlaf-mkl-cuda install --only=dependencies
spack -e /scratch/e1000/rmeli/git/my-spack/envs/dlaf-mkl-cuda build-env dla-future -- bash

cmake .. -DCMAKE_BUILD_TYPE=Release -DDLAF_WITH_MKL=1 -DDLAF_MPI_PRESET=slurm -DMPIEXEC_NUMCORES=128 -DCMAKE_CUDA_ARCHITECTURES=80 -DDLAF_WITH_CUDA=1

nice -n 19 make -j 32

To run the tests I use the following (built scraping things from scripts):

squashfs-run /scratch/e1000/rmeli/squashfs/dlaf-mkl-cuda.squashfs bash
export SPACK_SYSTEM_CONFIG_PATH=/user-environment/config
spack -e /scratch/e1000/rmeli/git/my-spack/envs/dlaf-mkl-cuda build-env dla-future -- bash

sbatch test.sh

#!/bin/bash
#SBATCH --partition=nvgpu
#SBATCH --nodes=2
#SBATCH --hint=multithread
#SBATCH --uenv-file=/scratch/e1000/rmeli/squashfs/dlaf-mkl-cuda.squashfs

# Before SBATCHing this script:
#   squashfs-run /scratch/e1000/rmeli/squashfs/dlaf-mkl-cuda.squashfs bash
#   export SPACK_SYSTEM_CONFIG_PATH=/user-environment/config
#   spack -e /scratch/e1000/rmeli/git/my-spack/envs/dlaf-mkl-cuda build-env -- bash

export MPICH_MAX_THREAD_SAFETY=multiple
export MIMALLOC_EAGER_COMMIT_DELAY=0
export MIMALLOC_LARGE_OS_PAGES=1

hostname
nvidia-smi
mpichversion

# gpu2ranks
cat > gpu2ranks <<EOF
#!/bin/bash
# Restrict visible GPUs when using multiple ranks per node with slurm.

set -eu

export CUDA_VISIBLE_DEVICES=\$SLURM_LOCALID

eval "\$@"
EOF
chmod +x gpu2ranks

# DLA-Future needs to be compiled with -DDLAF_CI_RUNNER_USES_MPIRUN=on
srun -u -n 1 --cpu-bind=mask_cpu:ffff000000000000ffff000000000000,ffff000000000000ffff00000000,ffff000000000000ffff0000,ffff000000000000ffff \
    ./gpu2ranks ctest -L RANK_1
srun -u -n 2 --cpu-bind=mask_cpu:ffff000000000000ffff000000000000,ffff000000000000ffff00000000,ffff000000000000ffff0000,ffff000000000000ffff \
    ./gpu2ranks ctest -L RANK_2
srun -u -n 4 --cpu-bind=mask_cpu:ffff000000000000ffff000000000000,ffff000000000000ffff00000000,ffff000000000000ffff0000,ffff000000000000ffff \
    ./gpu2ranks ctest -L RANK_4
#srun -u -n 6 --cpu-bind=mask_cpu:ffff000000000000ffff000000000000,ffff000000000000ffff00000000,ffff000000000000ffff0000,ffff000000000000ffff \
#    ./gpu2ranks ctest -L RANK_6
srun -u -n 6 --cpu-bind=mask_cpu:ffff000000000000ffff000000000000,ffff000000000000ffff00000000,ffff000000000000ffff0000,ffff000000000000ffff \
    ./gpu2ranks ctest -V -L RANK_6

The text was updated successfully, but these errors were encountered:

rasolca · 2023-02-23T14:54:16Z

@msimberg
trace.txt

I have a trace but I'm not sure what can be wrong...

msimberg · 2023-02-23T15:19:00Z

Thanks @rasolca for checking. The only thing that stands out is that the stacktrace is relatively long, but not impossibly so. It may be that it's running out of stack space. Then again the way the trace ends could point to something completely different. I'm really not sure, but it may be at least worth trying to see what happens with --pika:ini=pika.stacks.small_size=0x10000, but this is a very long shot... And I don't really see what the hermitian matrix multiplication would be doing differently compared to other algorithms to trigger something like that.

rasolca · 2023-02-23T15:38:29Z

Nope it is not the solution... 😞

rasolca · 2023-04-11T14:48:46Z

A failure of the hermitian mult happened in the CI as well (daint MC)
https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/4093987502

(Note: Non reproducible)

rasolca · 2023-05-16T15:17:51Z

@msimberg Was this solved by #838 ?

msimberg · 2023-05-16T15:42:53Z

I haven't tried again after the nvidia nodes were returned to hohgant but I'll try to do so this week. @RMeli or @albestro has either of you run tests on hohgant after the nvidia nodes came back?

msimberg · 2023-05-19T14:52:04Z

This is still a problem but I don't know what's causing it.

rasolca added Type:Bug Something isn't working Priority:High labels Feb 23, 2023

rasolca self-assigned this Feb 23, 2023

rasolca added this to DLA-F Planning Feb 23, 2023

rasolca removed the Priority:High label Feb 23, 2023

msimberg mentioned this issue Apr 14, 2023

Fix waitLocalTiles to guarantee that the next access won't yield #838

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_multiplication_hermitian fails on hohgant #808

test_multiplication_hermitian fails on hohgant #808

rasolca commented Feb 23, 2023

rasolca commented Feb 23, 2023

msimberg commented Feb 23, 2023 •

edited

Loading

rasolca commented Feb 23, 2023

rasolca commented Apr 11, 2023 •

edited

Loading

rasolca commented May 16, 2023

msimberg commented May 16, 2023

msimberg commented May 19, 2023

test_multiplication_hermitian fails on hohgant #808

test_multiplication_hermitian fails on hohgant #808

Comments

rasolca commented Feb 23, 2023

rasolca commented Feb 23, 2023

msimberg commented Feb 23, 2023 • edited Loading

rasolca commented Feb 23, 2023

rasolca commented Apr 11, 2023 • edited Loading

rasolca commented May 16, 2023

msimberg commented May 16, 2023

msimberg commented May 19, 2023

msimberg commented Feb 23, 2023 •

edited

Loading

rasolca commented Apr 11, 2023 •

edited

Loading