Skip to content

Pipeline parallelism requires special gradient clipping implementation #313

Open
@le1nux

Description

@le1nux

The current implementation of the gradient clipper only works with DP, FSDP and HSDP.
For PP, the gradients of the different stages would not be accumulated.

See:
https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel.clip_grad_norm_

https://github.com/pytorch/torchtitan/blob/b291ad662493b63d25b038a30a915082d3617baf/torchtitan/distributed/utils.py#L256-L259

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions