-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
running error #18
Comments
WARNING:main: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [E socket.cpp:860] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 1222).
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hello, I am running code on a 4-card A800 server, and it seems to be quite problematic. Could you please provide some solutions? Thank you very much.This is my error log.
WARNING:main:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Traceback (most recent call last):
File "/202308540003/Camera/MonoUNI/lib/train_val.py", line 141, in
main_worker(args.local_rank,args.nprocs, args)
File "/202308540003/Camera/MonoUNI/lib/train_val.py", line 54, in main_worker
shutil.copytree('./lib', os.path.join(cfg['trainer']['log_dir'], 'lib/'))
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/shutil.py", line 568, in copytree
return _copytree(entries=entries, src=src, dst=dst, symlinks=symlinks,
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/shutil.py", line 467, in _copytree
os.makedirs(dst, exist_ok=dirs_exist_ok)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/os.py", line 225, in makedirs
mkdir(name, mode)
FileExistsError: [Errno 17] File exists: 'output/rope3d/lib/'
Traceback (most recent call last):
File "/202308540003/Camera/MonoUNI/lib/train_val.py", line 141, in
main_worker(args.local_rank,args.nprocs, args)
File "/202308540003/Camera/MonoUNI/lib/train_val.py", line 52, in main_worker
shutil.rmtree(os.path.join(cfg['trainer']['log_dir'], 'lib/'))
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/shutil.py", line 734, in rmtree
Traceback (most recent call last):
File "/202308540003/Camera/MonoUNI/lib/train_val.py", line 141, in
_rmtree_safe_fd(fd, path, onerror)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/shutil.py", line 667, in _rmtree_safe_fd
main_worker(args.local_rank,args.nprocs, args)
File "/202308540003/Camera/MonoUNI/lib/train_val.py", line 52, in main_worker
_rmtree_safe_fd(dirfd, fullname, onerror)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/shutil.py", line 667, in _rmtree_safe_fd
shutil.rmtree(os.path.join(cfg['trainer']['log_dir'], 'lib/'))
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/shutil.py", line 734, in rmtree
_rmtree_safe_fd(dirfd, fullname, onerror)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/shutil.py", line 690, in _rmtree_safe_fd
_rmtree_safe_fd(fd, path, onerror)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/shutil.py", line 673, in _rmtree_safe_fd
onerror(os.unlink, fullname, sys.exc_info())
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/shutil.py", line 688, in _rmtree_safe_fd
onerror(os.rmdir, fullname, sys.exc_info())
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/shutil.py", line 671, in _rmtree_safe_fd
os.rmdir(entry.name, dir_fd=topfd)
OSError: [Errno 39] Directory not empty: 'losses'
os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'focal_loss.cpython-39.pyc'
Traceback (most recent call last):
File "/202308540003/Camera/MonoUNI/lib/train_val.py", line 141, in
main_worker(args.local_rank,args.nprocs, args)
File "/202308540003/Camera/MonoUNI/lib/train_val.py", line 54, in main_worker
shutil.copytree('./lib', os.path.join(cfg['trainer']['log_dir'], 'lib/'))
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/shutil.py", line 568, in copytree
return _copytree(entries=entries, src=src, dst=dst, symlinks=symlinks,
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/shutil.py", line 522, in _copytree
raise Error(errors)
shutil.Error: [('./lib/losses/loss_function.py', 'output/rope3d/lib/losses/loss_function.py', "[Errno 2] No such file or directory: 'output/rope3d/lib/losses/loss_function.py'")]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 89337) of binary: /202308540003/download/anaconda3/envs/cen/bin/python
Traceback (most recent call last):
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/run.py", line 798, in
main()
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/202308540003/download/anaconda3/envs/cen/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
lib/train_val.py FAILED
Failures:
[1]:
time : 2024-10-13_19:30:40
host : 17hbh40qmr0ll-0
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 89338)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-10-13_19:30:40
host : 17hbh40qmr0ll-0
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 89339)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-10-13_19:30:40
host : 17hbh40qmr0ll-0
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 89340)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2024-10-13_19:30:40
host : 17hbh40qmr0ll-0
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 89337)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
The text was updated successfully, but these errors were encountered: