Pipeline parallelism requires special gradient clipping implementation

The current implementation of the gradient clipper only works with DP, FSDP and HSDP.
For PP, the gradients of the  different stages would not be accumulated. 

See: 
https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel.clip_grad_norm_

https://github.com/pytorch/torchtitan/blob/b291ad662493b63d25b038a30a915082d3617baf/torchtitan/distributed/utils.py#L256-L259