pytorch
diff --git a/‎docs/source/contribute/bazel.md
+1-1 b/‎docs/source/contribute/bazel.md
+1-1
diff --git a/‎docs/source/contribute/codegen_migration.md
+1-1 b/‎docs/source/contribute/codegen_migration.md
+1-1
diff --git a/‎docs/source/contribute/configure-environment.md
+1-1 b/‎docs/source/contribute/configure-environment.md
+1-1
diff --git a/‎docs/source/contribute/op_lowering.md
+1-1 b/‎docs/source/contribute/op_lowering.md
+1-1
diff --git a/‎docs/source/contribute/plugins.md
+1-1 b/‎docs/source/contribute/plugins.md
+1-1
diff --git a/‎docs/source/features/scan.md
+16-17 b/‎docs/source/features/scan.md
+16-17
diff --git a/‎docs/source/features/distop.md renamed to ‎docs/source/features/torch_distributed.md
+5-2 b/‎docs/source/features/distop.md renamed to ‎docs/source/features/torch_distributed.md
+5-2
diff --git a/‎docs/source/index.rst
+70-34 b/‎docs/source/index.rst
+70-34
diff --git a/‎docs/source/learn/pjrt.md renamed to ‎docs/source/learn/_pjrt.md b/‎docs/source/learn/pjrt.md renamed to ‎docs/source/learn/_pjrt.md
diff --git a/‎docs/source/learn/dynamic_shape.md
+17-14 b/‎docs/source/learn/dynamic_shape.md
+17-14
diff --git a/‎docs/source/learn/troubleshoot.md
+1-1 b/‎docs/source/learn/troubleshoot.md
+1-1
diff --git a/‎docs/source/learn/xla-overview.md
+1-1 b/‎docs/source/learn/xla-overview.md
+1-1
diff --git a/‎docs/source/perf/assume_pure.md
+1-1 b/‎docs/source/perf/assume_pure.md
+1-1
diff --git a/‎docs/source/perf/ddp.md
+1-1 b/‎docs/source/perf/ddp.md
+1-1
@@ -1,4 +1,4 @@
-# Bazel in Pytorch/XLA
+# Building with Bazel
 
 [Bazel](https://bazel.build/) is a free software tool used for the
 automation of building and testing software.
 
@@ -1,4 +1,4 @@
-# Codegen migration Guide
+# Codegen Migration Guide
 
 As PyTorch/XLA migrates to the LTC (Lazy Tensor Core), we need to clean
 up the existing stub code (which spans over 6+ files) that were used to
 
@@ -1,4 +1,4 @@
-# Configure a development environment
+# Configure A Development Environment
 
 The goal of this guide is to set up an interactive development
 environment on a Cloud TPU with PyTorch/XLA installed. If this is your
 
@@ -1,4 +1,4 @@
-# OP Lowering Guide
+# Op Lowering Guide
 
 PyTorch wraps the C++ ATen tensor library that offers a wide range of
 operations implemented on GPU and CPU. Pytorch/XLA is a PyTorch
 
@@ -1,7 +1,7 @@
 # Custom Hardware Plugins
 
 PyTorch/XLA supports custom hardware through OpenXLA's PJRT C API. The
-PyTorch/XLA team direclty supports plugins for Cloud TPU (`libtpu`) and
+PyTorch/XLA team directly supports plugins for Cloud TPU (`libtpu`) and
 GPU ([OpenXLA](https://github.com/openxla/xla/tree/main/xla/pjrt/gpu)).
 The same plugins may also be used by JAX and TF.
 
 
@@ -1,26 +1,26 @@
-# Guide for using `scan` and `scan_layers`
+# Optimizing Repeated Layers with `scan` and `scan_layers`
 
 This is a guide for using `scan` and `scan_layers` in PyTorch/XLA.
 
 ## When should you use this
 
-You should consider using [`scan_layers`][scan_layers] if you have a model with
+Consider using [`scan_layers`][scan_layers] if you have a model with
 many homogenous (same shape, same logic) layers, for example LLMs. These models
 can be slow to compile. `scan_layers` is a drop-in replacement for a for loop over
 homogenous layers, such as a bunch of decoder layers. `scan_layers` traces the
 first layer and reuses the compiled result for all subsequent layers, significantly 
 reducing the model compile time.
 
 [`scan`][scan] on the other hand is a lower level higher-order-op modeled after
-[`jax.lax.scan`][jax-lax-scan]. Its primary purpose is to help implement
-`scan_layers` under the hood. However, you may find it useful if you would like
-to program some sort of loop logic where the loop itself has a first-class
-representation in the compiler (specifically, an XLA `While` op).
+[`jax.lax.scan`][jax-lax-scan]. Its primary purpose is to implement
+`scan_layers` under the hood. However, you may find it useful 
+to program loop logic where the loop itself has a first-class
+representation in the compiler (specifically, the XLA `while` op).
 
 ## `scan_layers` example
 
 Typically, a transformer model passes the input embedding through a sequence of
-homogenous decoder layers like the following:
+homogenous decoder layers:
 
 ```python
 def run_decoder_layers(self, hidden_states):
@@ -31,7 +31,7 @@ def run_decoder_layers(self, hidden_states):
 
 When this function is lowered into an HLO graph, the for loop is unrolled into a
 flat list of operations, resulting in long compile times. To reduce compile
-times, you can replace the for loop with a call to `scan_layers`, as shown in 
+times, replace the for loop with `scan_layers`, as shown in 
 [`decoder_with_scan.py`][decoder_with_scan]:
 
 ```python
@@ -61,7 +61,7 @@ def scan(
   ...
 ```
 
-You can use it to loop over the leading dimension of tensors efficiently. If `xs`
+Use it to loop over the leading dimension of tensors efficiently. If `xs`
 is a single tensor, this function is roughly equal to the following Python code:
 
 ```python
@@ -74,8 +74,8 @@ def scan(fn, init, xs):
   return carry, torch.stack(ys, dim=0)
 ```
 
-Under the hood, `scan` is implemented much more efficiently by lowering the loop
-into an XLA `While` operation. This ensures that only one iteration of the loop
+Under the hood, `scan` is implemented efficiently by lowering the loop
+into an XLA `while` operation. This ensures that only one iteration of the loop
 is compiled by XLA.
 
 [`scan_examples.py`][scan_examples] contains some example code showing how to use
@@ -114,19 +114,18 @@ Means over time: tensor([[1.0000],
 The functions/modules passed to `scan` and `scan_layers` must be AOTAutograd
 traceable. In particular, as of PyTorch/XLA 2.6, `scan` and `scan_layers` cannot
 trace functions with custom Pallas kernels. That means if your decoder uses,
-for example flash attention, then it's incompatible with `scan`. We are working on
-[supporting this important use case][flash-attn-issue] in nightly and the next
-releases.
+for example flash attention, then it is incompatible with `scan`. We are working on
+[supporting this important use case][flash-attn-issue].
 
 ### AOTAutograd overhead
 
 Because `scan` uses AOTAutograd to figure out the backward pass of the input 
-function/module on every iteration, it's easy to become tracing bound compared to 
+function/module on every iteration, it is easy to become tracing-bound compared to 
 a for loop implementation. In fact, the  `train_decoder_only_base.py` example runs 
 slower under `scan` than with for loop as of PyTorch/XLA 2.6 due to this overhead.
 We are working on [improving tracing speed][retracing-issue]. This is less of a 
 problem when your model is very large or has many layers, which are the situations 
-you would want to use `scan` anyways.
+you would want to use `scan`.
 
 ## Compile time experiments
 
@@ -180,7 +179,7 @@ Metric: CompileTime
   99%=18s995ms301.667us
 ```
 
-We can see that the maximum compile time dropped from `1m03s` to `19s` by
+The maximum compile time dropped from `1m03s` to `19s` by
 switching to `scan_layers`.
 
 ## References
 
@@ -1,6 +1,9 @@
-# Support of Torch Distributed API in PyTorch/XLA
-Before the 2.5 release, PyTorch/XLA only supported collective ops through our custom API call `torch_xla.core.xla_model.*`.  In the 2.5 release, we adopt `torch.distributed.*` in PyTorch/XLA for both Dynamo and non-Dynamo cases.
+# Support for Torch Distributed
+
+Before the 2.5 release, PyTorch/XLA only supported collective ops through the custom API call `torch_xla.core.xla_model.*`.  In the 2.5 release, we adopted `torch.distributed.*` in PyTorch/XLA for both Dynamo and non-Dynamo cases.
+
 ## Collective ops lowering
+
 ### Collective ops lowering stack
 After introducing the [traceable collective communication APIs](https://github.com/pytorch/pytorch/issues/93173), dynamo can support the collective ops with reimplementing lowering in PyTorch/XLA. The collective op is only traceable through `torch.ops._c10d_functional` call. Below is the figure that shows how the collective op, `all_reduce` in this case, is lowered between torch and torch_xla:
 
 
@@ -2,72 +2,108 @@
 
 PyTorch/XLA documentation
 ===================================
-PyTorch/XLA is a Python package that uses the XLA deep learning compiler to connect the PyTorch deep learning framework and Cloud TPUs.
+``torch_xla`` is a Python package that implements \
+`XLA <https://openxla.org/xla>`_ as a backend for PyTorch.
+
++------------------------------------------------+------------------------------------------------+------------------------------------------------+
+| **Familiar APIs**                              | **High Performance**                           | **Cost Efficient**                             |
+|                                                |                                                |                                                |          
+| Create and train PyTorch models on TPUs,       | Scale training jobs across thousands of        | TPU hardware and the XLA compiler are optimized|
+| with only minimal changes required.            | TPU cores while maintaining high MFU.          | for cost-efficient training and inference.     |
++------------------------------------------------+------------------------------------------------+------------------------------------------------+
+
+Getting Started
+---------------
+
+Install with pip.
+
+.. code-block:: sh
+
+   pip install torch torch_xla[tpu]
+
+Verify the installation:
+
+.. code-block:: sh
+
+   python -c "import torch_xla; print(torch_xla.__version__)"
+   python -c "import torch; import torch_xla; print(torch.tensor(1.0, device='xla').device)"
+
+Tutorials
+---------
 
 .. toctree::
    :glob:
    :maxdepth: 1
-   :caption: Learn about Pytorch/XLA
+   :caption: Learn the Basics
 
-   learn/xla-overview
    learn/pytorch-on-xla-devices
-   learn/api-guide
-   learn/dynamic_shape
-   learn/eager
-   learn/pjrt
-   learn/troubleshoot
+   learn/xla-overview
 
 .. toctree::
    :glob:
    :maxdepth: 1
-   :caption: Learn about accelerators
+   :caption: Distributed Training on TPU
 
    accelerators/tpu
-   accelerators/gpu
+   perf/spmd_basic      
+   perf/spmd_advanced
+   perf/spmd_distributed_checkpoint
+   features/torch_distributed
+   perf/ddp
+   perf/fsdp_collectives
+   perf/fsdp_spmd
 
 .. toctree::
    :glob:
    :maxdepth: 1
-   :caption: Run ML workloads with Pytorch/XLA
+   :caption: Advanced Techniques
 
-   workloads/kubernetes
+   features/pallas
+   features/stablehlo
+   perf/amp
+   learn/dynamic_shape
+   perf/dynamo
+   perf/quantized_ops
+   features/scan
+   perf/fori_loop
+   perf/assume_pure
 
 .. toctree::
    :glob:
    :maxdepth: 1
-   :caption: PyTorch/XLA features
+   :caption: Troubleshooting
 
-   features/pallas.md
-   features/stablehlo.md
-   features/triton.md
-   features/scan.md
+   learn/troubleshoot
+   learn/eager
+   notes/source_of_recompilation
+   perf/recompilation
 
 .. toctree::
    :glob:
    :maxdepth: 1
-   :caption: Improve Pytorch/XLA workload performance
+   :caption: Training on GPU
 
-   perf/amp
-   perf/spmd_basic      
-   perf/spmd_advanced
-   perf/spmd_distributed_checkpoint
+   accelerators/gpu
+   features/triton
    perf/spmd_gpu
-   perf/ddp
-   perf/dynamo
-   perf/fori_loop
-   perf/fsdp
-   perf/fsdpv2
-   perf/quantized_ops
-   perf/recompilation
-   
+
 .. toctree::
    :glob:
    :maxdepth: 1
-   :caption: Contribute to Pytorch/XLA
+   :caption: Contributing
 
+   contribute/bazel
    contribute/configure-environment
-   contribute/codegen_migration
+   contribute/cpp_debugger
    contribute/op_lowering
+   contribute/codegen_migration
    contribute/plugins
-   contribute/bazel
-   contribute/recompilation
+
+API Reference
+-------------
+
+.. toctree::
+   :glob:
+   :maxdepth: 2
+
+   learn/api-guide
@@ -1,30 +1,33 @@
-# Dynamic shape
+# Dynamic Shapes
 
-Dynamic shape refers to the variable nature of a tensor shape where its shape depends on the value of another upstream tensor. For example:
-```
+Dynamic shapes means a tensor's shape depends on the value of another tensor. For example:
+```python
 >>> import torch, torch_xla
 >>> in_tensor  = torch.randint(low=0, high=2, size=(5,5), device='xla:0')
 >>> out_tensor = torch.nonzero(in_tensor)
 ```
-the shape of `out_tensor` depends on the value of `in_tensor` and is bounded by the shape of `in_tensor`. In other words, if you do
-```
+
+The shape of `out_tensor` depends on the value of `in_tensor` and is bounded by the shape of `in_tensor`. In other words, if you do
+
+```python
 >>> print(out_tensor.shape)
 torch.Size([<=25, 2])
 ```
-you can see the first dimension depends on the value of `in_tensor` and its maximum value is 25. We call the first dimension the dynamic dimension. The second dimension does not depend on any upstream tensors so we call it the static dimension.
+the first dimension depends on the value of `in_tensor` and its maximum value is 25. We call the first dimension the dynamic dimension. The second dimension does not depend on any upstream tensors so we call it the static dimension.
 
 Dynamic shape can be further categorized into bounded dynamic shape and unbounded dynamic shape.
-- bounded dynamic shape: refers to a shape whose dynamic dimensions are bounded by static values. It works for accelerators that require static memory allocation (e.g. TPU).
-- unbounded dynamic shape: refers to a shape whose dynamic dimensions can be infinitely large. It works for accelerators that don’t require static memory allocation (e.g. GPU).
+- Bounded dynamic shape: refers to a shape whose dynamic dimensions are bounded by static values. It works for accelerators that require static memory allocation (e.g. TPU).
+- Unbounded dynamic shape: refers to a shape whose dynamic dimensions can be infinitely large. It works for accelerators that don’t require static memory allocation (e.g. GPU).
 
 Today, only the bounded dynamic shape is supported and it is in the experimental phase.
 
 ## Bounded dynamic shape
 
 Currently, we support multi-layer perceptron models (MLP) with dynamic size input on TPU.
 
-This feature is controlled by a flag `XLA_EXPERIMENTAL="nonzero:masked_select"`. To run a model with the feature enabled, you can do:
-```
+This feature is controlled by a flag `XLA_EXPERIMENTAL="nonzero:masked_select"`. To run a model with the feature enabled, launch Python with the following environment variable:
+
+```sh
 XLA_EXPERIMENTAL="nonzero:masked_select:masked_scatter" python your_scripts.py
 ```
 
@@ -40,8 +43,8 @@ Here are some numbers we get when we run the MLP model for 100 iterations:
 
 One of the motivations of the dynamic shape is to reduce the number of excessive recompilation when the shape keeps changing between iterations. From the figure above, you can see the number of compilations reduced by half which results in the drop of the training time.
 
-To try it out, run
-```
+To try it:
+
+```sh
 XLA_EXPERIMENTAL="nonzero:masked_select" PJRT_DEVICE=TPU python3 pytorch/xla/test/ds/test_dynamic_shape_models.py TestDynamicShapeModels.test_backward_pass_with_dynamic_input
-```
-For more details on how we plan to expand the dynamic shape support on PyTorch/XLA in the future, feel free to review our [RFC](https://github.com/pytorch/xla/issues/3884).
+```
@@ -1,4 +1,4 @@
-# Troubleshoot
+# Troubleshooting Basics
 
 Note that the information in this section is subject to be removed in
 future releases of the *PyTorch/XLA* software, since many of them are
 
@@ -1,4 +1,4 @@
-# Pytorch/XLA overview
+# Pytorch/XLA Overview
 
 This section provides a brief overview of the basic details of PyTorch
 XLA, which should help readers better understand the required
 
@@ -1,4 +1,4 @@
-# Use `@assume_pure` to speed up lazy tensor tracing
+# Speed Up Tracing with `@assume_pure`
 
 This document explains how to use `torch_xla.experimental.assume_pure` to
 eliminate lazy tensor tracing overhead. See [this blog post][lazy-tensor] for a
 
@@ -1,4 +1,4 @@
-# How to do DistributedDataParallel(DDP)
+# Distributed Data Parallel (DDP)
 
 This document shows how to use torch.nn.parallel.DistributedDataParallel
 in xla, and further describes its difference against the native xla data
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# Bazel in Pytorch/XLA`
	`1`	`+# Building with Bazel`
`2`	`2`
`3`	`3`	`[Bazel](https://bazel.build/) is a free software tool used for the`
`4`	`4`	`automation of building and testing software.`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# Codegen migration Guide`
	`1`	`+# Codegen Migration Guide`
`2`	`2`
`3`	`3`	`As PyTorch/XLA migrates to the LTC (Lazy Tensor Core), we need to clean`
`4`	`4`	`up the existing stub code (which spans over 6+ files) that were used to`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# Configure a development environment`
	`1`	`+# Configure A Development Environment`
`2`	`2`
`3`	`3`	`The goal of this guide is to set up an interactive development`
`4`	`4`	`environment on a Cloud TPU with PyTorch/XLA installed. If this is your`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# OP Lowering Guide`
	`1`	`+# Op Lowering Guide`
`2`	`2`
`3`	`3`	`PyTorch wraps the C++ ATen tensor library that offers a wide range of`
`4`	`4`	`operations implemented on GPU and CPU. Pytorch/XLA is a PyTorch`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# Troubleshoot`
	`1`	`+# Troubleshooting Basics`
`2`	`2`
`3`	`3`	`Note that the information in this section is subject to be removed in`
`4`	`4`	`future releases of the PyTorch/XLA software, since many of them are`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# Pytorch/XLA overview`
	`1`	`+# Pytorch/XLA Overview`
`2`	`2`
`3`	`3`	`This section provides a brief overview of the basic details of PyTorch`
`4`	`4`	`XLA, which should help readers better understand the required`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		-# Use `@assume_pure` to speed up lazy tensor tracing
	`1`	+# Speed Up Tracing with `@assume_pure`
`2`	`2`
`3`	`3`	This document explains how to use `torch_xla.experimental.assume_pure` to
`4`	`4`	`eliminate lazy tensor tracing overhead. See [this blog post][lazy-tensor] for a`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# How to do DistributedDataParallel(DDP)`
	`1`	`+# Distributed Data Parallel (DDP)`
`2`	`2`
`3`	`3`	`This document shows how to use torch.nn.parallel.DistributedDataParallel`
`4`	`4`	`in xla, and further describes its difference against the native xla data`