Support torch.distributed.scatter collective #9365

bfolie · 2025-06-16T17:01:53Z

XLA doesn't have a distributed Scatter op but we can put dummy tensor lists on the non-source rank and use reduce_scatter

…314-scatter

test/test_torch_distributed_xla_backend.py

pgmoka

Mostly questions, and a requests for extra documentation

torch_xla/distributed/xla_backend.py

test/test_torch_distributed_xla_backend.py

test/pjrt/test_collective_ops_tpu.py

pgmoka

Follow-up seems good. Let me know if you have any questions on https://github.com/pytorch/xla/pull/9365/files#r2151351304.

Otherwise, LGTM

pgmoka

Follow-up seems good. Let me know if you have any questions on https://github.com/pytorch/xla/pull/9365/files#r2151351304.

Otherwise, LGTM.

One minor thing: I believe the tests failing are due to flakyness. Can you confirm?

bfolie · 2025-06-24T15:26:28Z

One minor thing: I believe the tests failing are due to flakyness. Can you confirm?

The TPU test failure is a known flake and was cleared up by re-running.

The Torchprime e2e test failure is probably real but because PRs from forks aren't exercising that test it's very difficult to tell where the failure comes from

bfolie added 2 commits June 16, 2025 16:04

write first draft of scatter implementation and test

af92456

format, fix small issues

4940e19

bfolie requested review from bhavya01 and pgmoka June 16, 2025 17:02

bfolie added 3 commits June 16, 2025 17:10

generalize implementation to work on longer lists

fcf2803

remove scatter from unimplemented list

04032e8

Merge branch 'master' of https://github.com/pytorch/xla into bfolie/9…

2154841

…314-scatter

bfolie commented Jun 17, 2025

View reviewed changes

test/test_torch_distributed_xla_backend.py Show resolved Hide resolved

remove extra blank line

25ffd2f

bfolie mentioned this pull request Jun 17, 2025

[RFC] Improved coverage for native distributed collective operations #9315

Open

pgmoka reviewed Jun 17, 2025

View reviewed changes

test/pjrt/test_collective_ops_tpu.py Show resolved Hide resolved

improved documentation

448bd65

bfolie requested review from pgmoka and ghpvnist June 18, 2025 17:32

ghpvnist approved these changes Jun 20, 2025

View reviewed changes

test/pjrt/test_collective_ops_tpu.py Show resolved Hide resolved

pgmoka approved these changes Jun 20, 2025

View reviewed changes

pgmoka reviewed Jun 20, 2025

View reviewed changes

Consolidate if-else

e8db116

bfolie merged commit 01db65d into master Jun 24, 2025
23 of 24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support torch.distributed.scatter collective #9365

Support torch.distributed.scatter collective #9365

Uh oh!

bfolie commented Jun 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

pgmoka left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pgmoka left a comment

Uh oh!

pgmoka left a comment

Uh oh!

bfolie commented Jun 24, 2025

Uh oh!

Uh oh!

Uh oh!

Support torch.distributed.scatter collective #9365

Support torch.distributed.scatter collective #9365

Uh oh!

Conversation

bfolie commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pgmoka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pgmoka left a comment

Choose a reason for hiding this comment

Uh oh!

pgmoka left a comment

Choose a reason for hiding this comment

Uh oh!

bfolie commented Jun 24, 2025

Uh oh!

Uh oh!

Uh oh!

bfolie commented Jun 16, 2025 •

edited

Loading