Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Address already in use #180

Open
noparkee opened this issue Jun 17, 2022 · 0 comments
Open

RuntimeError: Address already in use #180

noparkee opened this issue Jun 17, 2022 · 0 comments

Comments

@noparkee
Copy link

noparkee commented Jun 17, 2022

I tried to run this model to evaluate dumy folders at the same time with one GPU (A100) which has 80G.

When I tried to evaluate one folder, it works well. However, if I try to run the other one additionally, an error appears.
It seems that there is a problem with the pytorch.distributed package.
When I googled it, people said if I change the port number, this problem will be solved.
Do you �know how to change the port number in this code?

error message

None
Global Rank: 0 Local Rank: 0
Killing subprocess 659577
Traceback (most recent call last):
  File "train.py", line 299, in <module>
    torch.distributed.init_process_group(backend='nccl',
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'train.py', '--local_rank=0', '--dataset', 'cityscapes', '--cv', '0', '--syncbn', '--apex', '--fp16', '--bs_val', '1', '--eval', 'folder', '--eval_folder', '/workspace/lyft_trainval_images', '--dump_assets', '--dump_all_images', '--n_scales', '0.5,1.0,2.0', '--snapshot', 'large_asset_dir/seg_weights/cityscapes_ocrnet.HRNet_Mscale_outstanding-turtle.pth', '--arch', 'ocrnet.HRNet_Mscale', '--result_dir', 'logs/dump_folder/frisky-serval_2022.06.17_17.15']' returned non-zero exit status 1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant