-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unreproducible test failures #603
Comments
EDIT: might be solved by #691 CI failure of test_gen_eigensolver
Full output of the error: Rerun passed: |
This could have been triggered by #643, right? I don't mind keeping it open until we're more sure, but it seems like that same failure could lead to similar failures in the full pipeline. |
It can, however these issue is meant to collect any strange unreproducible failure to keep track of them |
CI deadlock: |
Precision error on |
A couple more (again |
I get the following failure on
@msimberg has seen this before on The code is compiled using the following procedure (from @rasolca): srun --pty --partition=nvgpu --uenv-file=/scratch/e1000/rmeli/squashfs/dlaf-mkl-cuda.squashfs bash
export SPACK_SYSTEM_CONFIG_PATH=/user-environment/config
cd $SCRATCH/DLA-Future
mkdir build && cd build
spack -e /scratch/e1000/rmeli/git/my-spack/envs/dlaf-mkl-cuda concretize -f
spack -e /scratch/e1000/rmeli/git/my-spack/envs/dlaf-mkl-cuda install --only=dependencies
spack -e /scratch/e1000/rmeli/git/my-spack/envs/dlaf-mkl-cuda build-env dla-future -- bash
cmake .. -DCMAKE_BUILD_TYPE=Release -DDLAF_WITH_MKL=1 -DDLAF_MPI_PRESET=slurm -DMPIEXEC_NUMCORES=128 -DCMAKE_CUDA_ARCHITECTURES=80 -DDLAF_WITH_CUDA=1
nice -n 19 make -j 32 To run the tests I use the following (built scraping things from squashfs-run /scratch/e1000/rmeli/squashfs/dlaf-mkl-cuda.squashfs bash
export SPACK_SYSTEM_CONFIG_PATH=/user-environment/config
spack -e /scratch/e1000/rmeli/git/my-spack/envs/dlaf-mkl-cuda build-env dla-future -- bash
sbatch test.sh #!/bin/bash
#SBATCH --partition=nvgpu
#SBATCH --nodes=2
#SBATCH --hint=multithread
#SBATCH --uenv-file=/scratch/e1000/rmeli/squashfs/dlaf-mkl-cuda.squashfs
# Before SBATCHing this script:
# squashfs-run /scratch/e1000/rmeli/squashfs/dlaf-mkl-cuda.squashfs bash
# export SPACK_SYSTEM_CONFIG_PATH=/user-environment/config
# spack -e /scratch/e1000/rmeli/git/my-spack/envs/dlaf-mkl-cuda build-env -- bash
export MPICH_MAX_THREAD_SAFETY=multiple
export MIMALLOC_EAGER_COMMIT_DELAY=0
export MIMALLOC_LARGE_OS_PAGES=1
hostname
nvidia-smi
mpichversion
# gpu2ranks
cat > gpu2ranks <<EOF
#!/bin/bash
# Restrict visible GPUs when using multiple ranks per node with slurm.
set -eu
export CUDA_VISIBLE_DEVICES=\$SLURM_LOCALID
eval "\$@"
EOF
chmod +x gpu2ranks
# DLA-Future needs to be compiled with -DDLAF_CI_RUNNER_USES_MPIRUN=on
srun -u -n 1 --cpu-bind=mask_cpu:ffff000000000000ffff000000000000,ffff000000000000ffff00000000,ffff000000000000ffff0000,ffff000000000000ffff \
./gpu2ranks ctest -L RANK_1
srun -u -n 2 --cpu-bind=mask_cpu:ffff000000000000ffff000000000000,ffff000000000000ffff00000000,ffff000000000000ffff0000,ffff000000000000ffff \
./gpu2ranks ctest -L RANK_2
srun -u -n 4 --cpu-bind=mask_cpu:ffff000000000000ffff000000000000,ffff000000000000ffff00000000,ffff000000000000ffff0000,ffff000000000000ffff \
./gpu2ranks ctest -L RANK_4
#srun -u -n 6 --cpu-bind=mask_cpu:ffff000000000000ffff000000000000,ffff000000000000ffff00000000,ffff000000000000ffff0000,ffff000000000000ffff \
# ./gpu2ranks ctest -L RANK_6
srun -u -n 6 --cpu-bind=mask_cpu:ffff000000000000ffff000000000000,ffff000000000000ffff00000000,ffff000000000000ffff0000,ffff000000000000ffff \
./gpu2ranks ctest -V -L RANK_6 |
@RMeli how often does the problem appear? |
@RMeli I'll move it to a separate issue as it is reproducible. |
|
Possible hang in the tridiagonal eigensolver tests: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/5543375695#L2169. However, this may also be caused by external problems. |
Another possible hang, this time in the hermitian multiplication test: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/5621888281#L2326. This one is on #993 and may be caused by the changes there, but it seems unlikely. |
Hermitian multiplication again: |
Tridiagonal eigensolver hang: |
Tridiagonal eigensolver segfault: |
pika-org/pika#976 seems to fix issues we've seen in the tridiagonal eigensolver test. |
HermitianMultiplicationTestMC |
HermitianMultiplicationTestMC |
Segfault in
Maybe indicating we still have some gtest unsafety (switching threads or similar) together with mimalloc funny business? |
|
#1218 (comment) |
test_reduce again. GPU pipelines were not modified by #1192. |
Another |
@albestro @teonnik @msimberg
I created this issue to collect non-reproducible test failures, i.e. failures that occur so rarely that it is almost impossible to debug them.
Please add any case of non-reproducible test failure from any system (local workstation included) as well as failure of the CI.
The text was updated successfully, but these errors were encountered: