Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected bus error encountered in worker #1648

Open
albertz opened this issue Nov 14, 2024 · 2 comments
Open

Unexpected bus error encountered in worker #1648

albertz opened this issue Nov 14, 2024 · 2 comments

Comments

@albertz
Copy link
Member

albertz commented Nov 14, 2024

I got this two times now, at the end of successful training:

...
Uname: uname_result(system='Linux', node='w23g0002.hpc.itc.rwth-aachen.de', release='4.18.0-553.22.1.el8_10.x86_64', version='#1 SMP Wed Sep 25 09:20:43 UTC 2024', 
machine='x86_64')
Load: (0.17, 0.24, 0.21)
[2024-11-13 11:07:57,866] INFO: ------------------------------------------------------------
[2024-11-13 11:07:57,866] INFO: Starting subtask for arg id: 0 args: []
[2024-11-13 11:07:57,866] INFO: ------------------------------------------------------------
[2024-11-13 11:07:57,893] INFO: Run time: 0:00:00 CPU: 124.30% RSS: 81MB VMS: 1.50GB
WARNING:root:Settings file 'settings.py' does not exist, ignoring it ([Errno 2] No such file or directory: 'settings.py').
RETURNN train proc manager starting up, version 1.20241113.092146+git.55cdd2f8
Most recent trained model epoch: 0 file: None
Run RETURNN...
WARNING:root:Settings file 'settings.py' does not exist, ignoring it ([Errno 2] No such file or directory: 'settings.py').
Running in managed mode.
RETURNN starting up, version 1.20241113.092146+git.55cdd2f8, date/time 2024-11-13-11-07-58 (UTC+0100), pid 162050, cwd /rwthfs/rz/cluster/hpcwork/az668407/setups-da
ta/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.qGCKnKppHvEu/work, Python /home/az668407/work/py-envs/py3.12-torch2.5/bin/python
RETURNN command line options: ['/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.qGCKnKppHvEu/output/ret
urnn.config']
Hostname: w23g0002.hpc.itc.rwth-aachen.de
...
PyTorch: 2.5.0+cu124 (32f585d9346e316e554c8d9bf7548af9f62141fc) (<site-package> in /home/az668407/work/py-envs/py3.12-torch2.5/lib64/python3.12/site-packages/torch)
MKL_EXAMPLES=/cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/sapphirerapids/software/imkl/2022.1.0/mkl/2022.1.0/examples
CUDA_VISIBLE_DEVICES=0
NCCL_DEBUG=INFO
OMP_NUM_THREADS=24
MKL_NUM_THREADS=24
CUDA_VISIBLE_DEVICES is set to '0'.
Available CUDA devices:
  1/1: cuda:0
       name: NVIDIA H100
       total_memory: 93.0GB
       capability: 9.0
       device_index: 0
Train data:
  input: 10240 x 1
  output: {'data': [10240, 1]}
  LmDataset, sequences: unknown, frames: unknown
Using device: cuda ('gpu' in config)
Using gpu device 0: NVIDIA H100
Total GPU 0 memory 93.0GB, free 92.5GB
Using autocast (automatic mixed precision (AMP)) with dtype torch.bfloat16
...
Starting training at epoch 1, global train step 0
start epoch 1 global train step 0 with effective learning rate 1e-05 ...
...
...
Epoch 100 evaluation: dev: ce 3.670 ce:exp 39.242 fer 0.679 devtrain: ce 3.773 ce:exp 43.510 fer 0.691
Memory usage (cuda): alloc cur 1.3GB alloc peak 2.2GB reserved cur 10.9GB reserved peak 10.9GB
We have stored models for epochs [10, 20, 40, ..., 98, 99, 100] and keep epochs [10, 20, 40, 80, 96, 97, 98, 99, 100].
We will delete the models of epochs [95].
Deleted 308.7MB.
Finished training at epoch 100, global train step 2090917
elapsed: 32:35:58.6757
Quitting
Cache manager: not used, use local file: /rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/work/i6_core/returnn/oggzip/BlissToOggZipJob.5ad18raRAWhr/output/out.ogg.zip (discard further messages)
Reading sequence list for MetaDataset 'devtrain' from sub-dataset 'devtrain_ogg_zip'
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
^@Fatal Python error: Bus error

Thread 0x0000145a9d6ff700 (most recent call first):
  File "/usr/lib64/python3.12/threading.py", line 355 in wait
  File "/usr/lib64/python3.12/multiprocessing/queues.py", line 251 in _feed
  File "/usr/lib64/python3.12/threading.py", line 1012 in run
  File "/usr/lib64/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/usr/lib64/python3.12/threading.py", line 1032 in _bootstrap

Thread 0x0000145ab03e4700 (most recent call first):
  File "/usr/lib64/python3.12/threading.py", line 355 in wait
  File "/usr/lib64/python3.12/multiprocessing/queues.py", line 251 in _feed
  File "/usr/lib64/python3.12/threading.py", line 1012 in run
  File "/usr/lib64/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/usr/lib64/python3.12/threading.py", line 1032 in _bootstrap

Thread 0x0000145ab09e5700 (most recent call first):
  <no Python frame>

Thread 0x0000145ab0be6700 (most recent call first):
  File "/usr/lib64/python3.12/threading.py", line 355 in wait
  File "/usr/lib64/python3.12/multiprocessing/queues.py", line 251 in _feed
  File "/usr/lib64/python3.12/threading.py", line 1012 in run
  File "/usr/lib64/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/usr/lib64/python3.12/threading.py", line 1032 in _bootstrap

Current thread 0x0000145dc89a8240 (most recent call first):
  File "/home/az668407/work/py-envs/py3.12-torch2.5/lib64/python3.12/site-packages/torch/utils/data/_utils/signal_handling.py", line 73 in handler
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/util/multi_proc_non_daemonic_spawn.py", line 215 in __call__

Extension modules: h5py._errors, h5py.defs, h5py._objects, h5py.h5, numpy._core._multiarray_umath, numpy._core._multiarray_tests, numpy.linalg._umath_linalg, h5py.utils, h5py.h5t, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5r, h5py._proxy, h5py._conv, h5py.h5z, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5o, h5py.h5f, h
5py.h5fd, h5py.h5pl, h5py.h5l, h5py._selector, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random
.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, markupsafe._speedups, _cffi_backend, psutil._psutil_linux, psutil.
_psutil_posix, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._
linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, PIL._imaging, kiwisolver._cext, sentencepiece._sentencepiece (total: 54)
Run ['/home/az668407/work/py-envs/py3.12-torch2.5/bin/python', '/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/rnn.py', '/rwthfs/rz/clust
er/home/az668407/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.qGCKnKppHvEu/output/returnn.config']
RETURNN runtime: 32:36:08
RETURNN return code: -7
Most recent trained model epoch: 100 file: /rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.qGCKnKppHvEu
/output/models/epoch.100
Finished training, but got some error?
Total RETURNN num starts: 1
Total RETURNN runtime: 32:36:08
[2024-11-14 19:44:06,510] ERROR: Executed command failed:
[2024-11-14 19:44:06,510] ERROR: Cmd: ['/home/az668407/work/py-envs/py3.12-torch2.5/bin/python', '/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/rnn.py', '/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.qGCKnKppHvEu/output/returnn.config']
[2024-11-14 19:44:06,510] ERROR: Args: (249, ['/home/az668407/work/py-envs/py3.12-torch2.5/bin/python', '/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/rnn.py', '/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.qGCKnKppHvEu/output/returnn.config'])
[2024-11-14 19:44:06,510] ERROR: Return-Code: 249
[2024-11-14 19:44:06,512] INFO: Max resources: Run time: 32:36:08 CPU: 124.3% RSS: 13.58GB VMS: 97.96GB

This is at the RWTH ITC. I have never seen this before.

@albertz
Copy link
Member Author

albertz commented Nov 15, 2024

Strangely, I now get this very frequently (always at the RWTH ITC). Nothing really changed in my setup.

...ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
^@Fatal Python error: Bus error

Thread 0x00007f062fffe700 (most recent call first):
  File "/usr/lib64/python3.12/threading.py", line 355 in wait
  File "/usr/lib64/python3.12/multiprocessing/queues.py", line 251 in _feed
  File "/usr/lib64/python3.12/threading.py", line 1012 in run
  File "/usr/lib64/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/usr/lib64/python3.12/threading.py", line 1032 in _bootstrap

Thread 0x00007f064ffff700 (most recent call first):
  File "/usr/lib64/python3.12/threading.py", line 355 in wait
  File "/usr/lib64/python3.12/multiprocessing/queues.py", line 251 in _feed
  File "/usr/lib64/python3.12/threading.py", line 1012 in run
  File "/usr/lib64/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/usr/lib64/python3.12/threading.py", line 1032 in _bootstrap

Thread 0x00007f0695bd8700 (most recent call first):
  <no Python frame>

Thread 0x00007f0700e6a700 (most recent call first):
  File "/usr/lib64/python3.12/threading.py", line 355 in wait
  File "/usr/lib64/python3.12/multiprocessing/queues.py", line 251 in _feed
  File "/usr/lib64/python3.12/threading.py", line 1012 in run
  File "/usr/lib64/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/usr/lib64/python3.12/threading.py", line 1032 in _bootstrap

Current thread 0x00007f0def8e5240 (most recent call first):
  File "/home/az668407/work/py-envs/py3.12-torch2.5/lib64/python3.12/site-packages/torch/utils/data/_utils/signal_handling.py", line 73 in handler
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/util/multi_proc_non_daemonic_spawn.py", line 219 in __call__

Extension modules: h5py._errors, h5py.defs, h5py._objects, h5py.h5, numpy._core._multiarray_umath, numpy._core._multiarray_tests, numpy.linalg._umath_linalg, h5py.u
tils, h5py.h5t, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5r, h5py._proxy, h5py._conv, h5py.h5z, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5o, h5py.h5f, h
5py.h5fd, h5py.h5pl, h5py.h5l, h5py._selector, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random
.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, markupsafe._speedups, _cffi_backend, psutil._psutil_linux, psutil.
_psutil_posix, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._
linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, PIL._imaging, kiwisolver._cext, sentencepiece._sentencepiece (total: 54)
Cache manager: not used, use local file: /rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/work/i6_core/returnn/oggzip/BlissToOggZipJob.5ad18raRAWhr/outpu
t/out.ogg.zip (discard further messages)
Cache manager: not used, use local file: /rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/work/i6_core/returnn/oggzip/BlissToOggZipJob.RvwLniNrgMit/outpu
t/out.ogg.zip (discard further messages)
Run ['/home/az668407/work/py-envs/py3.12-torch2.5/bin/python', '/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/rnn.py', '/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.zNqdeO16DII0/output/returnn.config']
RETURNN runtime: 35:20:44
RETURNN return code: -7
Most recent trained model epoch: 100 file: /rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.zNqdeO16DII0
/output/models/epoch.100
Finished training, but got some error?
Total RETURNN num starts: 1
Total RETURNN runtime: 35:20:44
MEMORY: total (main 91987, 2024-11-15, 02:12:28, 132 procs): pss=30.2GB uss=29.7GB

@albertz
Copy link
Member Author

albertz commented Nov 15, 2024

Note, searching for this error gives many results. E.g.:

Many solutions are about increasing the SHM size in Docker. But that does not apply for me, as I don't use Docker here.

Or they are about multi GPU training, where it only happens with increased num GPUs, but I use single GPU training only here.

In most cases, in those reports, it seems also to occur at a random point during training, however, for me it happens only at the very end during shutdown, probably in some atexit handler or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant