Improve assume_pure docs and tests #9038

tengyifei · 2025-04-25T01:47:59Z

Beefed up the assume_pure tests and updated the docs to mention that mark_sharding is supported thanks to qihqi@'s #8989.

Also update yapf in the dev image to match CI.

CONTRIBUTING.md

test/test_assume_pure_spmd.py

test/test_jax_interop_spmd.py

docs/source/perf/assume_pure.md

torch_xla/experimental/splash_attention.py

qihqi · 2025-04-25T15:33:21Z

I think what's happening with ValueError: torch_xla device ID [1 2 3] not found in available JAX devices is:

torch_xla.devices() has 4 devices but jax.devices() only have 1. One possibility is that jax[cuda] was not installed so jax.devices() returnts one device and that is the CPU device.

Last time I tried to add the install and hit a different error. I am OK with disabling the test for CUDA until later.

tengyifei · 2025-04-25T18:36:41Z

I think what's happening with ValueError: torch_xla device ID [1 2 3] not found in available JAX devices is [...]

That is a great point!

This made me realize that if we land this PR as-is, then not only will call_jax not work on GPUs, it also won't work under multi-slice TPUs. That in turn means shard_as won't work, and scan won't work either, preventing training e.g. Llama 3.1 405B.

I think I'll split out the call_jax part of this PR into a separate one, and unfortunately that one can't be landed unless we fix the PJRT client sharing between PyTorch/XLA and JAX.

Now we can run a JAX SPMD function that accesses the ambient SPMD mesh from xb.call_jax. Fixes #8972. Also I beefed up the assume_pure tests and updated the docs to mention that mark_sharding is supported thanks to qihqi@' #8989.

tengyifei · 2025-04-26T00:16:28Z

Draft for registering Jax contextual mesh in call_jax: #9043 (informational only)

tengyifei marked this pull request as ready for review April 25, 2025 01:49

tengyifei requested a review from mikegre-google as a code owner April 25, 2025 01:49

tengyifei requested a review from qihqi April 25, 2025 01:49

tengyifei requested review from lsy323, ManfeiBai and zpcore as code owners April 25, 2025 01:52

qihqi approved these changes Apr 25, 2025

View reviewed changes