-
Notifications
You must be signed in to change notification settings - Fork 4.4k
set device_id
in torch's init_process_group
#7266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
deepspeed/comm/torch.py
Outdated
torch.distributed.init_process_group(backend, | ||
timeout=timeout, | ||
init_method=init_method, | ||
rank=rank, | ||
world_size=world_size) | ||
world_size=world_size, | ||
device_id=torch.device('cuda', local_rank)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stas00 - the cuda here will cause failures on non-cuda backends like HPU (not sure why the tests didn't run, but ran manually here: https://github.com/deepspeedai/DeepSpeed/actions/runs/14886572284/job/41807642413)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aha, thank you so much for seeing the big picture, @loadams
so we need something like:
device = torch.device(f'cuda:{local_rank}' if torch.cuda.is_available() else 'cpu')
or should I just add device_id
only if torch.cuda.is_available()
and do nothing otherwise - I mean I don't know what device to use in the case of HPU if it's not cpu
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use get_accelerator() here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whatever works - could you please show what you have in mind specifically for filling out:
device_id=torch.device('cuda', local_rank)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, still randomly deadlocks:
Thread 474480 (idle): "MainThread"
broadcast (torch/distributed/distributed_c10d.py:2772)
wrapper (torch/distributed/c10d_logger.py:81)
broadcast (deepspeed/comm/torch.py:216)
broadcast (deepspeed/comm/comm.py:224)
log_wrapper (deepspeed/comm/comm.py:117)
_zero_init_param (deepspeed/runtime/zero/partition_parameters.py:1054)
_post_init_method (deepspeed/runtime/zero/partition_parameters.py:1099)
wrapper (deepspeed/runtime/zero/partition_parameters.py:521)
__init__ (transformers/models/llama/modeling_llama.py:166)
wrapper (deepspeed/runtime/zero/partition_parameters.py:511)
__init__ (transformers/models/llama/modeling_llama.py:297)
wrapper (deepspeed/runtime/zero/partition_parameters.py:511)
<listcomp> (transformers/models/llama/modeling_llama.py:477)
__init__ (transformers/models/llama/modeling_llama.py:477)
wrapper (deepspeed/runtime/zero/partition_parameters.py:511)
__init__ (transformers/models/llama/modeling_llama.py:740)
wrapper (deepspeed/runtime/zero/partition_parameters.py:511)
from_pretrained (transformers/modeling_utils.py:4340)
_wrapper (transformers/modeling_utils.py:279)
from_pretrained (transformers/models/auto/auto_factory.py:571)
from_pretrained (liger_kernel/transformers/auto_model.py:38)
create_model (arctic_training/model/liger_factory.py:45)
wrapper (arctic_training/callback/mixin.py:45)
__call__ (arctic_training/model/factory.py:68)
__init__ (arctic_training/trainer/trainer.py:228)
wrapper (arctic_training/callback/mixin.py:45)
run_script (arctic_training/cli.py:108)
<module> (arctic_training_run:8)
Thread 476034 (idle): "Thread-1"
wait (threading.py:324)
wait (threading.py:607)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Process 474481: /usr/bin/python -u /home/yak/.local/bin/arctic_training_run --local_rank=6 --mode train --config run-dp1-sp8.yml
Python v3.10.12 (/usr/bin/python3.10)
Thread 474481 (active): "MainThread"
wrapped_fn (deepspeed/runtime/zero/partition_parameters.py:240)
_compute_default_rope_parameters (transformers/modeling_rope_utils.py:130)
__init__ (transformers/models/llama/modeling_llama.py:106)
wrapper (deepspeed/runtime/zero/partition_parameters.py:511)
__init__ (transformers/models/llama/modeling_llama.py:480)
wrapper (deepspeed/runtime/zero/partition_parameters.py:511)
__init__ (transformers/models/llama/modeling_llama.py:740)
wrapper (deepspeed/runtime/zero/partition_parameters.py:511)
from_pretrained (transformers/modeling_utils.py:4340)
_wrapper (transformers/modeling_utils.py:279)
from_pretrained (transformers/models/auto/auto_factory.py:571)
from_pretrained (liger_kernel/transformers/auto_model.py:38)
create_model (arctic_training/model/liger_factory.py:45)
wrapper (arctic_training/callback/mixin.py:45)
__call__ (arctic_training/model/factory.py:68)
__init__ (arctic_training/trainer/trainer.py:228)
wrapper (arctic_training/callback/mixin.py:45)
run_script (arctic_training/cli.py:108)
<module> (arctic_training_run:8)
Thread 476031 (idle): "Thread-1"
wait (threading.py:324)
wait (threading.py:607)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1016)
_bootstrap (threading.py:973)
Process 474482: /usr/bin/python -u /home/yak/.local/bin/arctic_training_run --local_rank=7 --mode train --config run-dp1-sp8.yml
Python v3.10.12 (/usr/bin/python3.10)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so definitely let's not merge this - asking pytorch folks for help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Yes, let's align with HF Transformers. Thanks.
@loadams can you help with this?
@sfc-gh-truwase @stas00 - yes, I think we should do something like this. Min 2.1 would be good. Agreed 2.3 might be a bit rushed, but let me check what cuda/GPU versions that implies as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further investigation shows the deadlocks start at torch>=2.7.0 - it's difficult to debug since the deadlocks aren't always reproducible but usually happen after some 3-6 re-runs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So switching to draft for now as we can't commit this until we know how to work around this. I'm actively pursuing this with pytorch devs.
A seemingly related Issue is: modded-nanogpt flaky NCCL hang starting 3/30 nightly
Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: Stas Bekman <[email protected]>
This PR overcomes this issue when using any
torch.distributed
calls w/ deepspeed:by setting
device_id
to the correct device corresponding toLOCAL_RANK
env var.Update: discovered
torch.dist
deadlocks withtorch=>2.7.0
when usingdevice_id
arg - switching to draft for now as we can't commit this until we know how to work around this.