How to make allreduce fully-overlapped in ZeRO-2 #7009

2012zzhao · 2025-02-06T08:01:40Z

[TARGET]
I am trying to overlap inter-worker communications as much as possible, in order to boost the performance of ZeRO-2.

[ISSUE]
According to my understanding of ZeRO-2, reduce-scatter communications can be overlapped by backward computations, which is supported by utilizing different CUDA Streams in an asynchronized way.

F. I. Y.
related parameters in DS_CONFIG
related designs in runtime.zero.stage_1_and_2.py

Communication overlapping did work under the following setup, but not in a satisfying way: when profiling the actual communications I found that only part of the allreduce was covered by computation. According to my knowledge, allreduce in ZeRO-2 does not have any dependencies between adjacent nn modules (model parallelism was not turned on), so fully-overlapping should be achievable in theory.

#------ ZeRO-2 DS_CONFIG ------#

{
  "train_batch_size": 65536,
  "gradient_accumulation_steps": 1,
  "optimizer": {
      "type": "Adam",
      "params": {
      "lr": 0.00015
    }
  },
  "fp16": {
    "enabled": true,
    "loss_scale": 0
  },
  "zero_optimization": {
    "stage": 2,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 1e6
  }
}

[BUG]
When diving into runtime.zero.stage_1_and_2.py, I found that there is a lock between computation stream and allreduce stream before bucketing and gradient communications. Given that CUDA device always returns False in get_accelerator().resolves_data_dependency(), the computation stream (get_accelerator().current_stream()) and communication stream (self.reduction_stream()) always have to wait for each other, until they are synchronized.
I wonder why the computation stream has to wait until the reduction stream finished, which prevented the following backward computations from overlapping the allreduce communications of previous gradients in runtime.zero.stage_1_and_2.py. I believe that the performance of ZeRO-2 can be further boosted if this restriction can be released without conflicts to other setups.

The text was updated successfully, but these errors were encountered:

2012zzhao · 2025-02-06T08:21:11Z

Due to personal issues I am not able to upload the profiling data here, but I will try my best to illustration the situation I found in the profiling and tracing data:

computation Stream view: there is always an EVENT_WAIT operation just after the ReduceSum operation of the previous nn module and just before the backward computation of the next nn module.

XCCL view: there is always an allReduce communication which is not overlapped at all during the same time as EVENT_WAIT, but the following allReduce can be covered by the next backward computations (I guess this credits to the bucketing mechanism). Such phenomenon happens periodically. It can be observed in each junction between nnn modules.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to make allreduce fully-overlapped in ZeRO-2 #7009

How to make allreduce fully-overlapped in ZeRO-2 #7009

2012zzhao commented Feb 6, 2025 •

edited

Loading

2012zzhao commented Feb 6, 2025

How to make allreduce fully-overlapped in ZeRO-2 #7009

How to make allreduce fully-overlapped in ZeRO-2 #7009

Comments

2012zzhao commented Feb 6, 2025 • edited Loading

2012zzhao commented Feb 6, 2025

2012zzhao commented Feb 6, 2025 •

edited

Loading