Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck on training #63

Open
aaghawaheed opened this issue Dec 26, 2022 · 2 comments
Open

Stuck on training #63

aaghawaheed opened this issue Dec 26, 2022 · 2 comments

Comments

@aaghawaheed
Copy link

aaghawaheed commented Dec 26, 2022

image

When I start training on multiple gpus it stuck, you can check in screenshot

@aaghawaheed
Copy link
Author

aaghawaheed commented Dec 26, 2022

sh configs/r50_motr_train.sh
/home/user/anaconda3/envs/motr2/lib/python3.8/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


| distributed init (rank 0): env://
| distributed init (rank 1): env://

[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809347 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809351 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 118400) of binary: /home/user/anaconda3/envs/motr2/bin/python3
Traceback (most recent call last):
File "/home/user/anaconda3/envs/motr2/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/user/anaconda3/envs/motr2/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/user/anaconda3/envs/motr2/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in
main()
File "/home/user/anaconda3/envs/motr2/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/home/user/anaconda3/envs/motr2/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/home/user/anaconda3/envs/motr2/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/user/anaconda3/envs/motr2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/user/anaconda3/envs/motr2/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures:
[1]:
time : 2022-12-26_18:00:48
host : user-System-Product-Name
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 118401)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 118401

Root Cause (first observed failure):
[0]:
time : 2022-12-26_18:00:48
host : user-System-Product-Name
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 118400)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 118400

@aaghawaheed
Copy link
Author

The process stuck at torch.distributed.barrier()

Here is my env information

PyTorch version: 1.13.0
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.10.8 (main, Nov 24 2022, 14:13:03) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.7.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 4090
GPU 1: NVIDIA GeForce RTX 4090
GPU 2: NVIDIA GeForce RTX 4090
GPU 3: NVIDIA GeForce RTX 4090

Nvidia driver version: 525.60.11
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.23.4
[pip3] torch==1.13.0
[pip3] torchaudio==0.13.0
[pip3] torchvision==0.14.0
[conda] blas 1.0 mkl
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py310h7f8727e_0
[conda] mkl_fft 1.3.1 py310hd6ae3a3_0
[conda] mkl_random 1.2.2 py310h00e6091_0
[conda] numpy 1.23.4 py310hd5efca6_0
[conda] numpy-base 1.23.4 py310h8e6c178_0
[conda] pytorch 1.13.0 py3.10_cuda11.7_cudnn8.5.0_0 pytorch
[conda] pytorch-cuda 11.7 h67b0de4_1 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torchaudio 0.13.0 py310_cu117 pytorch
[conda] torchvision 0.14.0 py310_cu117 pytorch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant