[BUG] [Fix-Suggested] ZeRO Stage 3 Overwrites Module ID Attribute Causing Incorrect Expert Placement on GPUs #6772

traincheck-team · 2024-11-20T23:05:43Z

Description

We experienced wrong GPU placement when doing MoE with ZeRO Stage 3. We use module.id to control which expert to be loaded onto which GPU for finegrained controlm and we find out that module.id got corrupted after deepspeed.initialize.

Suspected Root Cause

DeepSpeed uses .id in ZeRO Stage 3 optimization to manage states, as seen in runtime/zero/parameter_offload.py:L271.

This practice is very brittle in that:

id is an overly generic attribute name, might get easilly collided with some user-defined attributes.
There's no special check on .id attribute before setting it, this allows for accidental overwrites of the attribute, causing hard-to-diagnose problems.

In the specific bug we've encountered (bug.py provided below), each expert module is identified by the .id attribute, but during initialization, the .id is overwritten by the _register_hooks_recursively function in deepspeed/runtime/zero/stage3.py, leading to a mess on expert-GPU placement.

To reproduce

The following code in ZeRO Stage 3 is responsible for overwriting the .id attribute:

Install deepspeed 0.15.4
run bug.py using deepspeed --num_gpus=2 bug.py (num_gpus argument here doesn't matter, use 1 if you don't have multigpu nodes.)

import torch
import deepspeed
from torch.nn import Module, Linear

# Define a simple expert module
class Expert(Module):
    def __init__(self, id):
        super().__init__()
        self.id = id  # ID for custom GPU placement
        self.fc = Linear(128, 128)

    def forward(self, x):
        return self.fc(x)

# Create a model with 60 experts
class MoEModel(Module):
    def __init__(self):
        super().__init__()
        self.experts = torch.nn.ModuleList([Expert(i) for i in range(60)])
    def forward(self, x, expert_id):
        return self.experts[expert_id](x)

# Helper function to log expert ids
def log_expert_ids(model, rank):
    loaded_experts = [e.id for e in model.experts]

def main():
    deepspeed.init_distributed()
    rank = torch.distributed.get_rank()

    # Create model
    model = MoEModel()
    log_expert_ids(model, rank)  # prints 0, 1, 2, .., 59

    # Configure DeepSpeed
    model_engine, optimizer, _, _ = deepspeed.initialize(
        model=model,
        optimizer=torch.optim.Adam(model.parameters(), lr=3e-5),
        config={
            "train_micro_batch_size_per_gpu": 1,
            "gradient_accumulation_steps": 1,
            "steps_per_print": 1,
            "zero_optimization": {"stage": 3,}
        }
    )

    # print model ids again after deepspeed.initialize
    log_expert_ids(model, rank)  # prints 0, 2, 4, 6, ...

    # if you do a deepspeed.intialize here again, you will see the id itself completely messed up.

    dummy_input = torch.randn(1, 128).cuda(rank)
    for expert_id in range(60):
        model_engine(dummy_input, expert_id=expert_id)

if __name__ == "__main__":
    main()

We print ids of all experts twice, one before deepspeed.initialize and one after that. Observe that the first print gives 0, 1, 2, ..., 59 while the second one gives 2, 4, 6, 8, .., 120

In this code, module.id is set to a value based on a counter (my_count), which conflicts with user-defined .id attributes used for expert placement.

Bug Significance

This bug can significantly affect model behavior when expert modules are incorrectly placed across GPUs, leading to incorrect training outcomes or potential crashes. Ensuring that internal DeepSpeed modifications do not overwrite user-defined attributes is crucial for stability and expected functionality.

Even if user-side conflicts are not in your scope, deepspeed itself can accidently modify these attributes as well. For example, you can reproduce the same problem by calling deepspeed.initialize multiple times.

Thus, we argue for two fixes / engineering practices for this issue.

Expected Behavior / Suggested Fix

Use a Specific Attribute for Internal IDs: Instead of overwriting .id, use a more specific attribute name such as _deepspeed_id to avoid conflicts with user-defined attributes.
Restrict Attribute Modification: Modify the __setattr__ method to only allow setting fields that have not been previously set, preventing unintentional overwrites of user-defined attributes.
**Forbid Duplicated deepspeed.initialize: We observe a lot of issue with accidental duplicate calls to deepspeed.initialize. Thus we suggest to forbid duplicate calls by recording the models / optimizers that have already been inited, as mentioned in [BUG] [Fix-Suggested] KeyError in stage_1_and_2.py Due to Optimizer-Model Parameter Mismatch #6770 .

ds_report output

Click to Show

collect2: error: ld returned 1 exit status
gds .................... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/xxx/python3.10/site-packages/torch']
torch version .................... 2.2.2+cu121
deepspeed install path ........... ['/home/xxx/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.15.4, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.3
deepspeed wheel compiled w. ...... torch 2.2, cuda 12.1
shared memory (/dev/shm) size .... 31.24 GB

I will be more than happy to contribute to the two suggested fixes, let me know what you think!

The text was updated successfully, but these errors were encountered:

tjruwase · 2024-11-22T14:25:43Z

@traincheck-team, thanks for providing detailed report and suggesting reasonable solutions. We will work on this immediately.

tjruwase · 2024-12-10T16:39:05Z

@traincheck-team, please see #6847. I simplified your repro into a simple unit test. Please advise if this simplification is missing crucial aspects of this issue. Thanks!

traincheck-team · 2024-12-10T16:47:50Z

Thanks for the fast response! @tjruwase
The test and implementation look good, and it guards against DeepSpeed modifying user-defined attributes.
I think the missing piece is to guard against DeepSpeed itself corrupting that ds_id attribute due to repeated initialization, but that should be handled in a separate PR that we've agreed to implement in #6770. We will try to ship that PR soon.

tjruwase · 2024-12-10T17:20:06Z

Thanks for the quick review.

I think the missing piece is to guard against DeepSpeed itself corrupting that ds_id attribute due to repeated initialization, but that should be handled in a separate PR that we've agreed to implement in #6770. We will try to ship that PR soon.

Yes, for this problem, I think your PR is best solution as discussed. Thanks!

Fix #6772 --------- Co-authored-by: Logan Adams <[email protected]>

Fix #6772 --------- Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]>

Fix deepspeedai#6772 --------- Co-authored-by: Logan Adams <[email protected]> Signed-off-by: siqi <[email protected]>

Fix deepspeedai#6772 --------- Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Bruno Magalhaes <[email protected]>

traincheck-team added bug Something isn't working training labels Nov 20, 2024

tjruwase self-assigned this Nov 22, 2024

tjruwase mentioned this issue Dec 10, 2024

Use ds-specific module id to avoid conflicts #6847

Merged

This was referenced Dec 10, 2024

Fix: ensure deepspeed.initialize can only be called once. #6848

Closed

Fix: forbid repeated deepspeed.initialize on training objects #6874

Open

github-merge-queue bot pushed a commit that referenced this issue Jan 31, 2025

Use ds-specific module id to avoid conflicts (#6847)

029e0a3

Fix #6772 --------- Co-authored-by: Logan Adams <[email protected]>

tjruwase closed this as completed in #6847 Jan 31, 2025

tjruwase added a commit that referenced this issue Feb 6, 2025

Use ds-specific module id to avoid conflicts (#6847)

b0b0132

Fix #6772 --------- Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]>

siqi654321 pushed a commit to siqi654321/DeepSpeed that referenced this issue Feb 7, 2025

Use ds-specific module id to avoid conflicts (deepspeedai#6847)

c97ba7f

Fix deepspeedai#6772 --------- Co-authored-by: Logan Adams <[email protected]> Signed-off-by: siqi <[email protected]>

bm-synth pushed a commit to bm-synth/DeepSpeed that referenced this issue Feb 8, 2025

Use ds-specific module id to avoid conflicts (deepspeedai#6847)

d6f9e6d

Fix deepspeedai#6772 --------- Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Bruno Magalhaes <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [Fix-Suggested] ZeRO Stage 3 Overwrites Module ID Attribute Causing Incorrect Expert Placement on GPUs #6772

[BUG] [Fix-Suggested] ZeRO Stage 3 Overwrites Module ID Attribute Causing Incorrect Expert Placement on GPUs #6772

traincheck-team commented Nov 20, 2024

tjruwase commented Nov 22, 2024

tjruwase commented Dec 10, 2024

traincheck-team commented Dec 10, 2024 •

edited

Loading

tjruwase commented Dec 10, 2024

[BUG] [Fix-Suggested] ZeRO Stage 3 Overwrites Module ID Attribute Causing Incorrect Expert Placement on GPUs #6772

[BUG] [Fix-Suggested] ZeRO Stage 3 Overwrites Module ID Attribute Causing Incorrect Expert Placement on GPUs #6772

Comments

traincheck-team commented Nov 20, 2024

Description

Suspected Root Cause

To reproduce

Bug Significance

Expected Behavior / Suggested Fix

ds_report output

tjruwase commented Nov 22, 2024

tjruwase commented Dec 10, 2024

traincheck-team commented Dec 10, 2024 • edited Loading

tjruwase commented Dec 10, 2024

traincheck-team commented Dec 10, 2024 •

edited

Loading