Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unreproducible test failures #603

Open
rasolca opened this issue Jul 12, 2022 · 27 comments
Open

Unreproducible test failures #603

rasolca opened this issue Jul 12, 2022 · 27 comments
Labels
Priority:Low Type:Bug Something isn't working

Comments

@rasolca
Copy link
Collaborator

rasolca commented Jul 12, 2022

@albestro @teonnik @msimberg
I created this issue to collect non-reproducible test failures, i.e. failures that occur so rarely that it is almost impossible to debug them.

Please add any case of non-reproducible test failure from any system (local workstation included) as well as failure of the CI.

@rasolca rasolca added the Type:Bug Something isn't working label Jul 12, 2022
@rasolca
Copy link
Collaborator Author

rasolca commented Jul 12, 2022

EDIT: might be solved by #691

CI failure of test_gen_eigensolver

46: [ RUN      ] GenEigensolverTestGPU/2.CorrectnessLocal
46: /DLA-Future/test/unit/eigensolver/test_gen_eigensolver.cpp:126: Failure
46: Failed
46: Error at index ((1, 0)): expected (0.0227803,0.0631979) == (-0.00522555,0.0293542) (Relative diff: 0.653912 > 3.05176e-05, Absolute diff: 0.0439287 > 3.05176e-05)
46: 
46: [  FAILED  ] GenEigensolverTestGPU/2.CorrectnessLocal, where TypeParam = std::complex<float> (56 ms)

Full output of the error:
https://gitlab.com/cscs-ci/eth-cscs/DLA-Future/-/jobs/2709470723

Rerun passed:
https://gitlab.com/cscs-ci/eth-cscs/DLA-Future/-/jobs/2709574032

@msimberg
Copy link
Collaborator

msimberg commented Nov 2, 2022

This could have been triggered by #643, right? I don't mind keeping it open until we're more sure, but it seems like that same failure could lead to similar failures in the full pipeline.

@rasolca
Copy link
Collaborator Author

rasolca commented Nov 3, 2022

It can, however these issue is meant to collect any strange unreproducible failure to keep track of them

@aurianer aurianer changed the title Unreproducible test failure Unreproducible test failures Nov 3, 2022
@rasolca
Copy link
Collaborator Author

rasolca commented Dec 5, 2022

CI deadlock:
https://gitlab.com/cscs-ci/eth-cscs/DLA-Future/-/jobs/3416102369
might be related to #729

@msimberg
Copy link
Collaborator

Precision error on test_eigensolver: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/3465819735#L1355. (Ran on #668, but unlikely to be caused by it.)

@msimberg
Copy link
Collaborator

A couple more (again test_eigensolver and test_gen_eigensolver): https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/3477987284.

@RMeli
Copy link
Member

RMeli commented Feb 23, 2023

I get the following failure on hohgant, when running the RANK_6 tests:

57: Test command: /scratch/e1000/rmeli/git/DLA-Future/build2/test/unit/multiplication/test_multiplication_hermitian
57: Working Directory: /scratch/e1000/rmeli/git/DLA-Future/build2/test/unit/multiplication
57: Test timeout computed to be: 1500
57: Running main() from gtest_mpipika_main.cpp
57: [==========] Running 16 tests from 8 test suites.
57: [----------] Global test environment set-up.
57: [----------] 2 tests from HermitianMultiplicationTestMC/0, where TypeParam = float
57: [ RUN      ] HermitianMultiplicationTestMC/0.CorrectnessLocal
57: [       OK ] HermitianMultiplicationTestMC/0.CorrectnessLocal (297 ms)
57: [ RUN      ] HermitianMultiplicationTestMC/0.CorrectnessDistributed
57: [       OK ] HermitianMultiplicationTestMC/0.CorrectnessDistributed (13 ms)
57: [----------] 2 tests from HermitianMultiplicationTestMC/0 (313 ms total)
57:
57: [----------] 2 tests from HermitianMultiplicationTestMC/1, where TypeParam = double
57: [ RUN      ] HermitianMultiplicationTestMC/1.CorrectnessLocal
57: [       OK ] HermitianMultiplicationTestMC/1.CorrectnessLocal (7 ms)
57: [ RUN      ] HermitianMultiplicationTestMC/1.CorrectnessDistributed
57: [       OK ] HermitianMultiplicationTestMC/1.CorrectnessDistributed (10 ms)
57: [----------] 2 tests from HermitianMultiplicationTestMC/1 (18 ms total)
57:
57: [----------] 2 tests from HermitianMultiplicationTestMC/2, where TypeParam = std::complex<float>
57: [ RUN      ] HermitianMultiplicationTestMC/2.CorrectnessLocal
57: [       OK ] HermitianMultiplicationTestMC/2.CorrectnessLocal (2 ms)
57: [ RUN      ] HermitianMultiplicationTestMC/2.CorrectnessDistributed
57: [       OK ] HermitianMultiplicationTestMC/2.CorrectnessDistributed (10 ms)
57: [----------] 2 tests from HermitianMultiplicationTestMC/2 (14 ms total)
57:
57: [----------] 2 tests from HermitianMultiplicationTestMC/3, where TypeParam = std::complex<double>
57: [ RUN      ] HermitianMultiplicationTestMC/3.CorrectnessLocal
57: [       OK ] HermitianMultiplicationTestMC/3.CorrectnessLocal (2 ms)
57: [ RUN      ] HermitianMultiplicationTestMC/3.CorrectnessDistributed
57: [       OK ] HermitianMultiplicationTestMC/3.CorrectnessDistributed (10 ms)
57: [----------] 2 tests from HermitianMultiplicationTestMC/3 (13 ms total)
57:
57: [----------] 2 tests from HermitianMultiplicationTestGPU/0, where TypeParam = float
57: [ RUN      ] HermitianMultiplicationTestGPU/0.CorrectnessLocal
57: [       OK ] HermitianMultiplicationTestGPU/0.CorrectnessLocal (481 ms)
57: [ RUN      ] HermitianMultiplicationTestGPU/0.CorrectnessDistributed
24/27 Test #57: test_multiplication_hermitian .....***Exception: SegFault  3.74 sec

@msimberg has seen this before on hohgant.


The code is compiled using the following procedure (from @rasolca):

srun --pty --partition=nvgpu --uenv-file=/scratch/e1000/rmeli/squashfs/dlaf-mkl-cuda.squashfs bash

export SPACK_SYSTEM_CONFIG_PATH=/user-environment/config

cd $SCRATCH/DLA-Future
mkdir build && cd build

spack -e /scratch/e1000/rmeli/git/my-spack/envs/dlaf-mkl-cuda concretize -f
spack -e /scratch/e1000/rmeli/git/my-spack/envs/dlaf-mkl-cuda install --only=dependencies
spack -e /scratch/e1000/rmeli/git/my-spack/envs/dlaf-mkl-cuda build-env dla-future -- bash

cmake .. -DCMAKE_BUILD_TYPE=Release -DDLAF_WITH_MKL=1 -DDLAF_MPI_PRESET=slurm -DMPIEXEC_NUMCORES=128 -DCMAKE_CUDA_ARCHITECTURES=80 -DDLAF_WITH_CUDA=1

nice -n 19 make -j 32

To run the tests I use the following (built scraping things from scripts):

squashfs-run /scratch/e1000/rmeli/squashfs/dlaf-mkl-cuda.squashfs bash
export SPACK_SYSTEM_CONFIG_PATH=/user-environment/config
spack -e /scratch/e1000/rmeli/git/my-spack/envs/dlaf-mkl-cuda build-env dla-future -- bash

sbatch test.sh
#!/bin/bash
#SBATCH --partition=nvgpu
#SBATCH --nodes=2
#SBATCH --hint=multithread
#SBATCH --uenv-file=/scratch/e1000/rmeli/squashfs/dlaf-mkl-cuda.squashfs

# Before SBATCHing this script:
#   squashfs-run /scratch/e1000/rmeli/squashfs/dlaf-mkl-cuda.squashfs bash
#   export SPACK_SYSTEM_CONFIG_PATH=/user-environment/config
#   spack -e /scratch/e1000/rmeli/git/my-spack/envs/dlaf-mkl-cuda build-env -- bash

export MPICH_MAX_THREAD_SAFETY=multiple
export MIMALLOC_EAGER_COMMIT_DELAY=0
export MIMALLOC_LARGE_OS_PAGES=1

hostname
nvidia-smi
mpichversion

# gpu2ranks
cat > gpu2ranks <<EOF
#!/bin/bash
# Restrict visible GPUs when using multiple ranks per node with slurm.

set -eu

export CUDA_VISIBLE_DEVICES=\$SLURM_LOCALID

eval "\$@"
EOF
chmod +x gpu2ranks

# DLA-Future needs to be compiled with -DDLAF_CI_RUNNER_USES_MPIRUN=on
srun -u -n 1 --cpu-bind=mask_cpu:ffff000000000000ffff000000000000,ffff000000000000ffff00000000,ffff000000000000ffff0000,ffff000000000000ffff \
    ./gpu2ranks ctest -L RANK_1
srun -u -n 2 --cpu-bind=mask_cpu:ffff000000000000ffff000000000000,ffff000000000000ffff00000000,ffff000000000000ffff0000,ffff000000000000ffff \
    ./gpu2ranks ctest -L RANK_2
srun -u -n 4 --cpu-bind=mask_cpu:ffff000000000000ffff000000000000,ffff000000000000ffff00000000,ffff000000000000ffff0000,ffff000000000000ffff \
    ./gpu2ranks ctest -L RANK_4
#srun -u -n 6 --cpu-bind=mask_cpu:ffff000000000000ffff000000000000,ffff000000000000ffff00000000,ffff000000000000ffff0000,ffff000000000000ffff \
#    ./gpu2ranks ctest -L RANK_6
srun -u -n 6 --cpu-bind=mask_cpu:ffff000000000000ffff000000000000,ffff000000000000ffff00000000,ffff000000000000ffff0000,ffff000000000000ffff \
    ./gpu2ranks ctest -V -L RANK_6

@rasolca
Copy link
Collaborator Author

rasolca commented Feb 23, 2023

@RMeli how often does the problem appear?

@RMeli
Copy link
Member

RMeli commented Feb 23, 2023

@rasolca Every time so far, but I only tried a few times before asking @msimberg if he ever saw the same issue.

@rasolca
Copy link
Collaborator Author

rasolca commented Feb 23, 2023

@RMeli I'll move it to a separate issue as it is reproducible.
This issue is meant to collect unreproducible errors such that we can monitor what is failing to help debugging...

@rasolca
Copy link
Collaborator Author

rasolca commented Feb 28, 2023

band_to_tridiagonal segfaulted once on eiger. Tested with repeat 200 and than repeat 1000 and theissue was not reproducible.

@msimberg
Copy link
Collaborator

Possible hang in the tridiagonal eigensolver tests: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/5543375695#L2169. However, this may also be caused by external problems.

@msimberg
Copy link
Collaborator

Another possible hang, this time in the hermitian multiplication test: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/5621888281#L2326. This one is on #993 and may be caused by the changes there, but it seems unlikely.

@rasolca
Copy link
Collaborator Author

rasolca commented Dec 12, 2023

@rasolca
Copy link
Collaborator Author

rasolca commented Dec 12, 2023

@rasolca
Copy link
Collaborator Author

rasolca commented Dec 12, 2023

@msimberg
Copy link
Collaborator

pika-org/pika#976 seems to fix issues we've seen in the tridiagonal eigensolver test.

@albestro
Copy link
Collaborator

@albestro
Copy link
Collaborator

@RMeli
Copy link
Member

RMeli commented Jun 21, 2024

@msimberg
Copy link
Collaborator

Segfault in test_multiplication_hermitian (after tests have run): https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/7643366038#L2553. This one has the following backtrace:

60: Backtrace:
60: /root/DLA-Future.bundle/usr/lib/libmimalloc.so.2(__libc_free+0x39)[0x7ffff6e0d2d9]
60: /root/DLA-Future.bundle/usr/lib/libgtest.so.1.13.0(_ZN7testing8internal11ThreadLocalISt6vectorINS0_9TraceInfoESaIS3_EEE11ValueHolderD0Ev+0x4e)[0x7ffff7120f6e]
60: /lib/x86_64-linux-gnu/libc.so.6(+0x91691)[0x7ffff6355691]
60: /lib/x86_64-linux-gnu/libc.so.6(+0x9494a)[0x7ffff635894a]
60: /lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x7ffff63ea850]

Maybe indicating we still have some gtest unsafety (switching threads or similar) together with mimalloc funny business?

@msimberg
Copy link
Collaborator

@RMeli
Copy link
Member

RMeli commented Nov 20, 2024

#1216 (comment)

@albestro
Copy link
Collaborator

albestro commented Dec 2, 2024

#1218 (comment) test_reduce failed

@rasolca
Copy link
Collaborator Author

rasolca commented Dec 6, 2024

@msimberg
Copy link
Collaborator

@msimberg
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority:Low Type:Bug Something isn't working
Projects
Status: No status
Development

No branches or pull requests

4 participants