Skip to content
This repository has been archived by the owner on Aug 7, 2024. It is now read-only.

Commit

Permalink
Merge branch 'pytorch-labs:main' into fsdp2
Browse files Browse the repository at this point in the history
  • Loading branch information
weifengpy authored Jul 17, 2024
2 parents b5cad8d + 38c02fe commit a6b8913
Showing 1 changed file with 18 additions and 15 deletions.
33 changes: 18 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,12 @@

This is an early version of a library for accelerating training with float8 in native PyTorch
according to the recipes laid out in https://arxiv.org/pdf/2209.05433.pdf.
The codebase strives to stay small, easily hackable, and debuggable with native PyTorch tooling.
``torch.compile`` is supported out of the box. With ``torch.compile`` on, initial results show
The codebase strives to stay small, easily hackable, debuggable with native PyTorch tooling,
and composable with key systems such as autograd, ```torch.compile``` and distributed.
With ``torch.compile`` on, initial results show
throughput speedups of up to 1.2x on small scale (8 GPUs) LLaMa pretraining jobs.

:warning: <em>See the [feature tracker](https://github.com/pytorch-labs/float8_experimental/issues/187) for upcoming features. Key features such as weight cast recomputation in backward and large scale distributed support are not ready yet. </em>
:warning: <em>See the [feature tracker](https://github.com/pytorch-labs/float8_experimental/issues/187) for upcoming features.</em>

:warning: <em>Backwards compatibility is not guaranteed at this point. The codebase is in active development and
will change rapidly.</em>
Expand All @@ -25,7 +26,7 @@ pip install -e .
pip install -e ".[dev]"
```

# User API
# Single GPU User API

We provide two per-tensor scaling strategies: dynamic and delayed. See https://arxiv.org/pdf/2209.05433.pdf, Section 4.3 for more details. These strategies are configurable separately for activations (`x`), weights (`w`) and gradients (`dL_dY`).

Expand Down Expand Up @@ -113,30 +114,32 @@ for _ in range(N_ITER):
optimizer.step()
```

# 🧭 Code Organization
# Multi GPU User API

* `float8_experimental/float8_linear.py`
- `Float8Linear` (main user facing entry point for Float8Linear)
* `float8_experimental/float8_tensor.py`
- `Float8Tensor`, which allows `Float8Linear` to abide by the `x.dtype == x.grad.dtype` restriction
- `ScaledMMConfig` defines the semantics for matmul in the forward and backwards pass
We compose with the `DTensor` based [distributed APIs](https://pytorch.org/docs/stable/distributed.tensor.parallel.html),
such as FSDP, TP and SP. Please see the [torchtitan](https://github.com/pytorch/torchtitan) repository for e2e examples
on using `float8_experimental` in a distributed setting.

# Testing

```bash
# run single-GPU unit tests
pytest test/test_base.py

# run a single-GPU integration test on SAM
pytest test/test_sam.py

# run single-GPU compile tests
pytest test/test_compile.py

# run single-GPU numerics integration tests
pytest test/test_numerics_integration.py

# run a two-GPU integration test on FSDP
./test/test_fsdp.sh

# run integration tests for TP/SP (outdated)
./test/test_tp.sh
# run integration tests on the DTensor TP/SP integration
./test/test_dtensor.sh

# run integration tests on the FSDP2 integration
python test/test_fsdp2/test_fsdp2_eager.py

# run all of these tests
./test/test_everything.sh
Expand Down

0 comments on commit a6b8913

Please sign in to comment.