Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to make allreduce fully-overlapped in ZeRO-2 #7009

Open
2012zzhao opened this issue Feb 6, 2025 · 1 comment
Open

How to make allreduce fully-overlapped in ZeRO-2 #7009

2012zzhao opened this issue Feb 6, 2025 · 1 comment

Comments

@2012zzhao
Copy link

2012zzhao commented Feb 6, 2025

[TARGET]
I am trying to overlap inter-worker communications as much as possible, in order to boost the performance of ZeRO-2.

[ISSUE]
According to my understanding of ZeRO-2, reduce-scatter communications can be overlapped by backward computations, which is supported by utilizing different CUDA Streams in an asynchronized way.

F. I. Y.
related parameters in DS_CONFIG
related designs in runtime.zero.stage_1_and_2.py

Communication overlapping did work under the following setup, but not in a satisfying way: when profiling the actual communications I found that only part of the allreduce was covered by computation. According to my knowledge, allreduce in ZeRO-2 does not have any dependencies between adjacent nn modules (model parallelism was not turned on), so fully-overlapping should be achievable in theory.

#------ ZeRO-2 DS_CONFIG ------#

{
  "train_batch_size": 65536,
  "gradient_accumulation_steps": 1,
  "optimizer": {
      "type": "Adam",
      "params": {
      "lr": 0.00015
    }
  },
  "fp16": {
    "enabled": true,
    "loss_scale": 0
  },
  "zero_optimization": {
    "stage": 2,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 1e6
  }
}

[BUG]
When diving into runtime.zero.stage_1_and_2.py, I found that there is a lock between computation stream and allreduce stream before bucketing and gradient communications. Given that CUDA device always returns False in get_accelerator().resolves_data_dependency(), the computation stream (get_accelerator().current_stream()) and communication stream (self.reduction_stream()) always have to wait for each other, until they are synchronized.
I wonder why the computation stream has to wait until the reduction stream finished, which prevented the following backward computations from overlapping the allreduce communications of previous gradients in runtime.zero.stage_1_and_2.py. I believe that the performance of ZeRO-2 can be further boosted if this restriction can be released without conflicts to other setups.

@2012zzhao
Copy link
Author

Due to personal issues I am not able to upload the profiling data here, but I will try my best to illustration the situation I found in the profiling and tracing data:

computation Stream view: there is always an EVENT_WAIT operation just after the ReduceSum operation of the previous nn module and just before the backward computation of the next nn module.

XCCL view: there is always an allReduce communication which is not overlapped at all during the same time as EVENT_WAIT, but the following allReduce can be covered by the next backward computations (I guess this credits to the bucketing mechanism). Such phenomenon happens periodically. It can be observed in each junction between nnn modules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant