-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDP degrades the performance #8
Comments
I have the same problem. When running this code on 8 x V100 (16G), I got : |
Thanks for pointing out the issue. I remember I didn't have this issue when I tried DDP. I will check on this soon. |
I just found out that DDP works well with full fine-tuning but works worse with parameter-efficient transfer learning methods. I will further investigate this issue soon. |
If DDP does not work well, can i reproduce the results on a single A100 (40GB) by reducing the batch size. From @prote376 , it seems also not work well. How should I address this problem? Thanks! |
I think reduce the batch size should work, but the learning rate might need to reduce accordingly. The performance drops from @prote376 experiments may still come from the multi-gpu problem not the batch size. |
The problem may come with DDP settings. The PyTorch DDP notes about Backward Pass “so after the backward pass, the grad field on the same corresponding parameter across different DDP processes should be the same”, however, I printed the grads and found out that they were not the same across GPUs. My guess is that the synchronization of grads is missing (or some other misplays that possibly cause the desynchronization), and the way to use DDP model could be the source (official use is |
@ylsung Any update regarding this issue? |
You're right. |
@JieShibo Hi, could you elaborate more on how to manually synchronize the gradient and address the synchronization problem? Thank you so much! |
@nbasyl from torch.distributed.algorithms.join import (
Join,
Joinable,
JoinHook,
)
from torch.distributed.utils import (
_verify_param_shape_across_processes,
_sync_module_states,
_to_kwargs,
)
from torch.nn.parallel.distributed import _find_tensors, _tree_flatten_with_rref, _DDPSink, _tree_unflatten_with_rref
def ddp_forward(self, *inputs, **kwargs):
with torch.autograd.profiler.record_function(
"DistributedDataParallel.forward"
):
if torch.is_grad_enabled() and self.require_backward_grad_sync:
assert self.logger is not None
self.logger.set_runtime_stats_and_log()
self.num_iterations += 1
self.reducer.prepare_for_forward()
work = Join.notify_join_context(self)
if work:
self.reducer._set_forward_pass_work_handle(
work, self._divide_by_initial_world_size # type: ignore[arg-type]
)
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
logger.info(
"Reducer buckets have been rebuilt in this iteration."
)
self._has_rebuilt_buckets = True
if self._check_sync_bufs_pre_fwd():
self._sync_buffers()
if self._join_config.enable:
self._check_global_requires_backward_grad_sync(
is_joined_rank=False
)
module_to_run = (
self._replicated_tensor_module
if self._use_replicated_tensor_module
else self.module
)
if self.device_ids:
inputs, kwargs = _to_kwargs(
inputs,
kwargs,
self.device_ids[0],
self.use_side_stream_for_tensor_copies,
)
with self._inside_ddp_forward():
output = module_to_run.train_step(*inputs[0], **kwargs[0]) # type: ignore[index]
else:
with self._inside_ddp_forward():
output = module_to_run.train_step(*inputs, **kwargs)
if self._check_sync_bufs_post_fwd():
self._sync_buffers()
if torch.is_grad_enabled() and self.require_backward_grad_sync:
self.require_forward_param_sync = True
if self.find_unused_parameters and not self.static_graph:
self.reducer.prepare_for_backward(
list(_find_tensors(output))
)
else:
self.reducer.prepare_for_backward([])
else:
self.require_forward_param_sync = False
if (self.find_unused_parameters and not self.static_graph) or (
self.static_graph and self.num_iterations == 1
):
state_dict = {
"static_graph": self.static_graph,
"num_iterations": self.num_iterations,
}
(
output_tensor_list,
treespec,
output_is_rref,
) = _tree_flatten_with_rref(output)
output_placeholders = [None for _ in range(len(output_tensor_list))]
for i, output in enumerate(output_tensor_list):
if torch.is_tensor(output) and output.grad_fn is None:
output_placeholders[i] = output
passthrough_tensor_list = _DDPSink.apply(
self.reducer,
state_dict,
*output_tensor_list,
)
for i in range(len(output_placeholders)):
if output_placeholders[i] is None:
output_placeholders[i] = passthrough_tensor_list[i]
output = _tree_unflatten_with_rref(
output_placeholders, treespec, output_is_rref
)
return output VL_adapter/VL-T5/src/multitask.py Lines 284 to 294 in 545fcbb
self.model.module.train_step(batch) -> ddp_forward(self.model, batch)
|
Thank you for sharing this code!
I am testing your code for multitask video with BART on 24GB GPUs.
To run your code on 24GB GPUs, I used below command to enable DDP. (batch size:50 -> 25)
bash scripts/video/single_adapter.sh 2
However, it showed worse results than the performance on a single 48GB GPU.
When I increased the number of GPUs, the performance was getting worse.
Because the model doesn't have BatchNorm, I thought the performance should be similar.
Have you tried DDP? Or do you have any intuition about the problem?
The text was updated successfully, but these errors were encountered: