You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[TARGET]
I am trying to overlap inter-worker communications as much as possible, in order to boost the performance of ZeRO-2.
[ISSUE]
According to my understanding of ZeRO-2, reduce-scatter communications can be overlapped by backward computations, which is supported by utilizing different CUDA Streams in an asynchronized way.
Communication overlapping did work under the following setup, but not in a satisfying way: when profiling the actual communications I found that only part of the allreduce was covered by computation. According to my knowledge, allreduce in ZeRO-2 does not have any dependencies between adjacent nn modules (model parallelism was not turned on), so fully-overlapping should be achievable in theory.
[BUG]
When diving into runtime.zero.stage_1_and_2.py, I found that there is a lock between computation stream and allreduce stream before bucketing and gradient communications. Given that CUDA device always returns False in get_accelerator().resolves_data_dependency(), the computation stream (get_accelerator().current_stream()) and communication stream (self.reduction_stream()) always have to wait for each other, until they are synchronized.
I wonder why the computation stream has to wait until the reduction stream finished, which prevented the following backward computations from overlapping the allreduce communications of previous gradients in runtime.zero.stage_1_and_2.py. I believe that the performance of ZeRO-2 can be further boosted if this restriction can be released without conflicts to other setups.
The text was updated successfully, but these errors were encountered:
Due to personal issues I am not able to upload the profiling data here, but I will try my best to illustration the situation I found in the profiling and tracing data:
computation Stream view: there is always an EVENT_WAIT operation just after the ReduceSum operation of the previous nn module and just before the backward computation of the next nn module.
XCCL view: there is always an allReduce communication which is not overlapped at all during the same time as EVENT_WAIT, but the following allReduce can be covered by the next backward computations (I guess this credits to the bucketing mechanism). Such phenomenon happens periodically. It can be observed in each junction between nnn modules.
[TARGET]
I am trying to overlap inter-worker communications as much as possible, in order to boost the performance of ZeRO-2.
[ISSUE]
According to my understanding of ZeRO-2, reduce-scatter communications can be overlapped by backward computations, which is supported by utilizing different CUDA Streams in an asynchronized way.
F. I. Y.
related parameters in DS_CONFIG
related designs in runtime.zero.stage_1_and_2.py
Communication overlapping did work under the following setup, but not in a satisfying way: when profiling the actual communications I found that only part of the allreduce was covered by computation. According to my knowledge, allreduce in ZeRO-2 does not have any dependencies between adjacent nn modules (model parallelism was not turned on), so fully-overlapping should be achievable in theory.
#------ ZeRO-2 DS_CONFIG ------#
[BUG]
When diving into runtime.zero.stage_1_and_2.py, I found that there is a lock between computation stream and allreduce stream before bucketing and gradient communications. Given that CUDA device always returns False in get_accelerator().resolves_data_dependency(), the computation stream (get_accelerator().current_stream()) and communication stream (self.reduction_stream()) always have to wait for each other, until they are synchronized.
I wonder why the computation stream has to wait until the reduction stream finished, which prevented the following backward computations from overlapping the allreduce communications of previous gradients in runtime.zero.stage_1_and_2.py. I believe that the performance of ZeRO-2 can be further boosted if this restriction can be released without conflicts to other setups.
The text was updated successfully, but these errors were encountered: