Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running error #18

Open
stu-hu opened this issue Oct 13, 2024 · 1 comment
Open

running error #18

stu-hu opened this issue Oct 13, 2024 · 1 comment

Comments

@stu-hu
Copy link

stu-hu commented Oct 13, 2024

Hello, I am running code on a 4-card A800 server, and it seems to be quite problematic. Could you please provide some solutions? Thank you very much.This is my error log.

WARNING:main:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Traceback (most recent call last):
File "/202308540003/Camera/MonoUNI/lib/train_val.py", line 141, in
main_worker(args.local_rank,args.nprocs, args)
File "/202308540003/Camera/MonoUNI/lib/train_val.py", line 54, in main_worker
shutil.copytree('./lib', os.path.join(cfg['trainer']['log_dir'], 'lib/'))
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/shutil.py", line 568, in copytree
return _copytree(entries=entries, src=src, dst=dst, symlinks=symlinks,
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/shutil.py", line 467, in _copytree
os.makedirs(dst, exist_ok=dirs_exist_ok)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/os.py", line 225, in makedirs
mkdir(name, mode)
FileExistsError: [Errno 17] File exists: 'output/rope3d/lib/'
Traceback (most recent call last):
File "/202308540003/Camera/MonoUNI/lib/train_val.py", line 141, in
main_worker(args.local_rank,args.nprocs, args)
File "/202308540003/Camera/MonoUNI/lib/train_val.py", line 52, in main_worker
shutil.rmtree(os.path.join(cfg['trainer']['log_dir'], 'lib/'))
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/shutil.py", line 734, in rmtree
Traceback (most recent call last):
File "/202308540003/Camera/MonoUNI/lib/train_val.py", line 141, in
_rmtree_safe_fd(fd, path, onerror)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/shutil.py", line 667, in _rmtree_safe_fd
main_worker(args.local_rank,args.nprocs, args)
File "/202308540003/Camera/MonoUNI/lib/train_val.py", line 52, in main_worker
_rmtree_safe_fd(dirfd, fullname, onerror)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/shutil.py", line 667, in _rmtree_safe_fd
shutil.rmtree(os.path.join(cfg['trainer']['log_dir'], 'lib/'))
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/shutil.py", line 734, in rmtree
_rmtree_safe_fd(dirfd, fullname, onerror)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/shutil.py", line 690, in _rmtree_safe_fd
_rmtree_safe_fd(fd, path, onerror)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/shutil.py", line 673, in _rmtree_safe_fd
onerror(os.unlink, fullname, sys.exc_info())
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/shutil.py", line 688, in _rmtree_safe_fd
onerror(os.rmdir, fullname, sys.exc_info())
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/shutil.py", line 671, in _rmtree_safe_fd
os.rmdir(entry.name, dir_fd=topfd)
OSError: [Errno 39] Directory not empty: 'losses'
os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'focal_loss.cpython-39.pyc'
Traceback (most recent call last):
File "/202308540003/Camera/MonoUNI/lib/train_val.py", line 141, in
main_worker(args.local_rank,args.nprocs, args)
File "/202308540003/Camera/MonoUNI/lib/train_val.py", line 54, in main_worker
shutil.copytree('./lib', os.path.join(cfg['trainer']['log_dir'], 'lib/'))
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/shutil.py", line 568, in copytree
return _copytree(entries=entries, src=src, dst=dst, symlinks=symlinks,
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/shutil.py", line 522, in _copytree
raise Error(errors)
shutil.Error: [('./lib/losses/loss_function.py', 'output/rope3d/lib/losses/loss_function.py', "[Errno 2] No such file or directory: 'output/rope3d/lib/losses/loss_function.py'")]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 89337) of binary: /202308540003/download/anaconda3/envs/cen/bin/python
Traceback (most recent call last):
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/run.py", line 798, in
main()
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

lib/train_val.py FAILED

Failures:
[1]:
time : 2024-10-13_19:30:40
host : 17hbh40qmr0ll-0
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 89338)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-10-13_19:30:40
host : 17hbh40qmr0ll-0
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 89339)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-10-13_19:30:40
host : 17hbh40qmr0ll-0
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 89340)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-10-13_19:30:40
host : 17hbh40qmr0ll-0
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 89337)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@stu-hu
Copy link
Author

stu-hu commented Oct 13, 2024

WARNING:main:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


[E socket.cpp:860] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 1222).
Traceback (most recent call last):
File "/202308540003/Camera/MonoUNI/lib/train_val.py", line 141, in
main_worker(args.local_rank,args.nprocs, args)
File "/202308540003/Camera/MonoUNI/lib/train_val.py", line 68, in main_worker
[E socket.cpp:860] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 1222).
dist.init_process_group(backend='nccl',
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 900, in init_process_group
Traceback (most recent call last):
File "/202308540003/Camera/MonoUNI/lib/train_val.py", line 141, in
main_worker(args.local_rank,args.nprocs, args)
File "/202308540003/Camera/MonoUNI/lib/train_val.py", line 68, in main_worker
dist.init_process_group(backend='nccl',
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 900, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 200, in _tcp_rendezvous_handler
store, rank, world_size = next(rendezvous_iterator)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 200, in _tcp_rendezvous_handler
store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 172, in _create_c10d_store
store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 172, in _create_c10d_store
tcp_store = TCPStore(hostname, port, world_size, False, timeout)
TimeoutError: The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 1222).
tcp_store = TCPStore(hostname, port, world_size, False, timeout)
TimeoutError: The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 1222).
[E socket.cpp:860] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 1222).
Traceback (most recent call last):
File "/202308540003/Camera/MonoUNI/lib/train_val.py", line 141, in
main_worker(args.local_rank,args.nprocs, args)
File "/202308540003/Camera/MonoUNI/lib/train_val.py", line 68, in main_worker
dist.init_process_group(backend='nccl',
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 900, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 200, in _tcp_rendezvous_handler
store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 172, in _create_c10d_store
tcp_store = TCPStore(hostname, port, world_size, False, timeout)
TimeoutError: The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 1222).
[E socket.cpp:860] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 1222).
Traceback (most recent call last):
File "/202308540003/Camera/MonoUNI/lib/train_val.py", line 141, in
main_worker(args.local_rank,args.nprocs, args)
File "/202308540003/Camera/MonoUNI/lib/train_val.py", line 68, in main_worker
dist.init_process_group(backend='nccl',
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 900, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 200, in _tcp_rendezvous_handler
store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 172, in _create_c10d_store
tcp_store = TCPStore(hostname, port, world_size, False, timeout)
TimeoutError: The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 1222).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 90175) of binary: /202308540003/download/anaconda3/envs/cen/bin/python
Traceback (most recent call last):
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/run.py", line 798, in
main()
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

lib/train_val.py FAILED

Failures:
[1]:
time : 2024-10-13_20:10:28
host : 17hbh40qmr0ll-0
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 90176)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-10-13_20:10:28
host : 17hbh40qmr0ll-0
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 90177)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-10-13_20:10:28
host : 17hbh40qmr0ll-0
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 90178)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-10-13_20:10:28
host : 17hbh40qmr0ll-0
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 90175)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant