set `device_id` in torch's `init_process_group` #7266

stas00 · 2025-04-30T19:09:54Z

This PR overcomes this issue when using any torch.distributed calls w/ deepspeed:

[W404 00:15:21.693690333 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 
to perform barrier as devices used by this process are currently unknown. This can
 potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in
 barrier() to force use of a particular device, or call init_process_group() with a device_id.

by setting device_id to the correct device corresponding to LOCAL_RANK env var.

Update: discovered torch.dist deadlocks with torch=>2.7.0 when using device_id arg - switching to draft for now as we can't commit this until we know how to work around this.

stas00 · 2025-05-06T18:43:06Z

@loadams?

loadams · 2025-05-07T14:43:22Z

@loadams?

Sorry @stas00, I missed this and will review today.

loadams · 2025-05-07T15:38:06Z

deepspeed/comm/torch.py

            torch.distributed.init_process_group(backend,
                                                 timeout=timeout,
                                                 init_method=init_method,
                                                 rank=rank,
-                                                 world_size=world_size)
+                                                 world_size=world_size,
+                                                 device_id=torch.device('cuda', local_rank))


@stas00 - the cuda here will cause failures on non-cuda backends like HPU (not sure why the tests didn't run, but ran manually here: https://github.com/deepspeedai/DeepSpeed/actions/runs/14886572284/job/41807642413)

aha, thank you so much for seeing the big picture, @loadams

so we need something like:

device = torch.device(f'cuda:{local_rank}' if torch.cuda.is_available() else 'cpu')

or should I just add device_id only if torch.cuda.is_available() and do nothing otherwise - I mean I don't know what device to use in the case of HPU if it's not cpu?

Could we use get_accelerator() here?

whatever works - could you please show what you have in mind specifically for filling out:

device_id=torch.device('cuda', local_rank)?

I think what is needed is get_accelerator().device(local_rank).

For cuda this maps to torch.cuda.device(device_index)

@stas00, does that work?

Nope, still randomly deadlocks:

Thread 474480 (idle): "MainThread" broadcast (torch/distributed/distributed_c10d.py:2772) wrapper (torch/distributed/c10d_logger.py:81) broadcast (deepspeed/comm/torch.py:216) broadcast (deepspeed/comm/comm.py:224) log_wrapper (deepspeed/comm/comm.py:117) _zero_init_param (deepspeed/runtime/zero/partition_parameters.py:1054) _post_init_method (deepspeed/runtime/zero/partition_parameters.py:1099) wrapper (deepspeed/runtime/zero/partition_parameters.py:521) __init__ (transformers/models/llama/modeling_llama.py:166) wrapper (deepspeed/runtime/zero/partition_parameters.py:511) __init__ (transformers/models/llama/modeling_llama.py:297) wrapper (deepspeed/runtime/zero/partition_parameters.py:511) <listcomp> (transformers/models/llama/modeling_llama.py:477) __init__ (transformers/models/llama/modeling_llama.py:477) wrapper (deepspeed/runtime/zero/partition_parameters.py:511) __init__ (transformers/models/llama/modeling_llama.py:740) wrapper (deepspeed/runtime/zero/partition_parameters.py:511) from_pretrained (transformers/modeling_utils.py:4340) _wrapper (transformers/modeling_utils.py:279) from_pretrained (transformers/models/auto/auto_factory.py:571) from_pretrained (liger_kernel/transformers/auto_model.py:38) create_model (arctic_training/model/liger_factory.py:45) wrapper (arctic_training/callback/mixin.py:45) __call__ (arctic_training/model/factory.py:68) __init__ (arctic_training/trainer/trainer.py:228) wrapper (arctic_training/callback/mixin.py:45) run_script (arctic_training/cli.py:108) <module> (arctic_training_run:8) Thread 476034 (idle): "Thread-1" wait (threading.py:324) wait (threading.py:607) run (tqdm/_monitor.py:60) _bootstrap_inner (threading.py:1016) _bootstrap (threading.py:973) Process 474481: /usr/bin/python -u /home/yak/.local/bin/arctic_training_run --local_rank=6 --mode train --config run-dp1-sp8.yml Python v3.10.12 (/usr/bin/python3.10) Thread 474481 (active): "MainThread" wrapped_fn (deepspeed/runtime/zero/partition_parameters.py:240) _compute_default_rope_parameters (transformers/modeling_rope_utils.py:130) __init__ (transformers/models/llama/modeling_llama.py:106) wrapper (deepspeed/runtime/zero/partition_parameters.py:511) __init__ (transformers/models/llama/modeling_llama.py:480) wrapper (deepspeed/runtime/zero/partition_parameters.py:511) __init__ (transformers/models/llama/modeling_llama.py:740) wrapper (deepspeed/runtime/zero/partition_parameters.py:511) from_pretrained (transformers/modeling_utils.py:4340) _wrapper (transformers/modeling_utils.py:279) from_pretrained (transformers/models/auto/auto_factory.py:571) from_pretrained (liger_kernel/transformers/auto_model.py:38) create_model (arctic_training/model/liger_factory.py:45) wrapper (arctic_training/callback/mixin.py:45) __call__ (arctic_training/model/factory.py:68) __init__ (arctic_training/trainer/trainer.py:228) wrapper (arctic_training/callback/mixin.py:45) run_script (arctic_training/cli.py:108) <module> (arctic_training_run:8) Thread 476031 (idle): "Thread-1" wait (threading.py:324) wait (threading.py:607) run (tqdm/_monitor.py:60) _bootstrap_inner (threading.py:1016) _bootstrap (threading.py:973) Process 474482: /usr/bin/python -u /home/yak/.local/bin/arctic_training_run --local_rank=7 --mode train --config run-dp1-sp8.yml Python v3.10.12 (/usr/bin/python3.10)

so definitely let's not merge this - asking pytorch folks for help.

Good point. Yes, let's align with HF Transformers. Thanks.

@loadams can you help with this?

@sfc-gh-truwase @stas00 - yes, I think we should do something like this. Min 2.1 would be good. Agreed 2.3 might be a bit rushed, but let me check what cuda/GPU versions that implies as well.

Further investigation shows the deadlocks start at torch>=2.7.0 - it's difficult to debug since the deadlocks aren't always reproducible but usually happen after some 3-6 re-runs

So switching to draft for now as we can't commit this until we know how to work around this. I'm actively pursuing this with pytorch devs.

A seemingly related Issue is: modded-nanogpt flaky NCCL hang starting 3/30 nightly

Signed-off-by: Stas Bekman <[email protected]>

Update torch.py

700bc77

stas00 requested a review from GuanhuaWang as a code owner April 30, 2025 19:09

stas00 requested review from loadams and removed request for GuanhuaWang April 30, 2025 19:09

loadams reviewed May 7, 2025

View reviewed changes

sfc-gh-truwase and others added 3 commits May 14, 2025 19:50

Merge branch 'master' into stas00-dist-init-device-id

272f01a

Update torch.py

4cd92e7

Merge branch 'master' into stas00-dist-init-device-id

eb34bc4

loadams approved these changes May 15, 2025

View reviewed changes

stas00 added 2 commits May 15, 2025 11:10

fix

71420e6

Signed-off-by: Stas Bekman <[email protected]>

add device_id if init_process_group has it

6ac04d9

Signed-off-by: Stas Bekman <[email protected]>

stas00 marked this pull request as draft May 15, 2025 22:15

Merge branch 'master' into stas00-dist-init-device-id

01a1c89

stas00 mentioned this pull request May 20, 2025

Passing device_id to torch.distributed.init_process_group() results in NCCL randomly hanging during communications pytorch/pytorch#153960

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

set `device_id` in torch's `init_process_group` #7266

set `device_id` in torch's `init_process_group` #7266

Uh oh!

stas00 commented Apr 30, 2025 •

edited

Loading

Uh oh!

stas00 commented May 6, 2025

Uh oh!

loadams commented May 7, 2025

Uh oh!

loadams May 7, 2025

Uh oh!

stas00 May 7, 2025 •

edited

Loading

Uh oh!

loadams May 7, 2025

Uh oh!

stas00 May 7, 2025

Uh oh!

sfc-gh-truwase May 14, 2025

Uh oh!

stas00 May 15, 2025

Uh oh!

stas00 May 15, 2025

Uh oh!

loadams May 15, 2025

Uh oh!

stas00 May 15, 2025

Uh oh!

stas00 May 15, 2025

Uh oh!

Uh oh!

set device_id in torch's init_process_group #7266

Are you sure you want to change the base?

set device_id in torch's init_process_group #7266

Uh oh!

Conversation

stas00 commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented May 6, 2025

Uh oh!

loadams commented May 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stas00 May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

set `device_id` in torch's `init_process_group` #7266

set `device_id` in torch's `init_process_group` #7266

stas00 commented Apr 30, 2025 •

edited

Loading

stas00 May 7, 2025 •

edited

Loading