[RFC] Improved coverage for native distributed collective operations

## 🚀 Feature
Improve coverage of the [PyTorch collective operations](https://docs.pytorch.org/docs/stable/distributed.html#synchronous-and-asynchronous-collective-operations) so that native distributed code is more likely to work without any user modification required. Collective ops would ideally work both in LazyTensor and in compiled mode, but many do not have an upstream Dynamo path.

The following are in scope.
* [All Gather Into Tensor](https://docs.pytorch.org/docs/stable/distributed.html#torch.distributed.all_gather_into_tensor)
* [Broadcast](https://docs.pytorch.org/docs/stable/distributed.html#torch.distributed.broadcast)
* [Scatter](https://docs.pytorch.org/docs/stable/distributed.html#torch.distributed.scatter)
* [Reduce Scatter](https://docs.pytorch.org/docs/stable/distributed.html#torch.distributed.reduce_scatter)
* [Send and Recv](https://docs.pytorch.org/docs/stable/distributed.html#point-to-point-communication)
* [Gather](https://docs.pytorch.org/docs/stable/distributed.html#torch.distributed.gather)
* [Reduce](https://docs.pytorch.org/docs/stable/distributed.html#torch.distributed.reduce)
* [All to All](https://docs.pytorch.org/docs/stable/distributed.html#torch.distributed.all_to_all)


Collectives that operate on objects instead of tensors are out of scope.

## Pitch

### All Gather Into Tensor

Gathers tensors into a single output tensor. Like the already-implemented [all_gather](https://docs.pytorch.org/docs/stable/distributed.html#torch.distributed.reduce_scatter_tensor) but outputs a single tensor instead of a list of tensors. It appears in FSDP ([example](https://github.com/pytorch/pytorch/blob/v2.7.0/torch/distributed/fsdp/_optim_utils.py#L231)), and other places.

This is implemented by [wrapping all_gather](https://github.com/pytorch/xla/blob/v2.7.0/torch_xla/distributed/xla_backend.py#L86), so it fails in some cases because it has not been set up to support both stacking and concatenation of the input tensors. [This zip](https://github.com/pytorch/xla/blob/v2.7.0/torch_xla/distributed/xla_backend.py#L89), which is meant for a list of tensors, happens to work in some cases by splitting the tensors along the 0th dimension.

This should be a simple refactor, isolating the common logic but letting stacking/concatenation/[copying to individual tensors](https://github.com/pytorch/xla/blob/v2.7.0/torch_xla/distributed/xla_backend.py#L96) happens only in the correct cases.

Done in #9332 

### Broadcast

This has been solved, but [PR 7956](https://github.com/pytorch/xla/pull/7956) and [PT PR 135171](https://github.com/pytorch/pytorch/pull/135171) were not merged. They should be revived. Broadcast is [used in DDP](https://github.com/pytorch/pytorch/blob/v2.7.0/torch/distributed/algorithms/ddp_comm_hooks/ddp_zero_hook.py).

### Scatter

Scatters a list of tensors across processes. It is used by [sharding code](https://github.com/pytorch/pytorch/blob/v2.7.0/torch/distributed/_shard/sharding_spec/chunk_sharding_spec.py). `Scatter` does not have a corresponding XLA op, but we could implement it similarly to how [broadcast is implemented](https://github.com/pytorch/xla/blob/v2.7.0/torch_xla/core/xla_model.py#L789), and perform a `reduce_scatter` after multiplying all but the non-source device tensors times 0.

Done in #9365 

### Reduce Scatter

Reduces a list of tensors, then scatters across processes. This is like the already-implemented [reduce_scatter_tensor](https://docs.pytorch.org/docs/stable/distributed.html#torch.distributed.reduce_scatter_tensor), but it acts on a list of tensors instead of a single tensor. It is not used in the torch.distributed code but may be necessary to implement `scatter`.

Reduce Scatter works in LazyTensor mode, but fails when compiled because there's no [Dynamo mapping](https://github.com/pytorch/pytorch/blob/v2.7.0/torch/distributed/_functional_collectives.py#L1179-L1200). There isn't even a `reduce_scatter` functional collective but [reduce_scatter_tensor_coalesced](https://github.com/pytorch/pytorch/blob/v2.7.0/torch/distributed/_functional_collectives.py#L390) might be usable. Once implemented we would bind it in pt/xla (the analogous binding for `reduce_scatter_tensor` is [here](https://github.com/pytorch/xla/blob/v2.7.0/torch_xla/csrc/cross_replica_reduces.cpp#L484-L486)). After implementing the binding the logic can be shared with the existing [reduce_scatter_tensor](https://github.com/pytorch/xla/blob/v2.7.0/torch_xla/csrc/cross_replica_reduces.cpp#L454), and the underlying XLA Op can accept a tuple of arrays, so the rest should be straight-forward.

### Send and Recv

Send and receive are implemented but do not work. As seen in #8074, the XLA ops are missing "_xla_send_recv_source_target_pairs". Those fields get set [here](https://github.com/openxla/xla/blob/main/xla/service/collective_permute_decomposer.cc#L154-L162). XLA's Send and Recv aren't meant to be called directly. Instead the user is expected to use [CollectivePermute](https://openxla.org/xla/operation_semantics#collectivepermute) and specify every source-target pair. If there are no cycles then the HLO is decomposed into Send and Recv ops. 

We can replace send/recv with calls to [xm.collective_permute.](https://github.com/pytorch/xla/blob/v2.7.0/torch_xla/core/xla_model.py#L763).

Done in #9373 

### Gather

This is not used in the torch.distributed code but was requested in #9069. It could be implemented using `all_gather` and only keeping the result on the `dst` rank.

### Reduce

This is not used in the torch.distributed code. It could be implemented using `all_reduce` and only keeping the result on the `dst` rank.

### All to All

This is not used in the torch.distributed code. Since all_to_all_single is implemented we could probably stack the tensors, run `all_to_all`, then chunk them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Improved coverage for native distributed collective operations #9315

🚀 Feature

Pitch

All Gather Into Tensor

Broadcast

Scatter

Reduce Scatter

Send and Recv

Gather

Reduce

All to All

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Improved coverage for native distributed collective operations #9315

Description

🚀 Feature

Pitch

All Gather Into Tensor

Broadcast

Scatter

Reduce Scatter

Send and Recv

Gather

Reduce

All to All

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions