Skip to content

Commit 8536908

Browse files
authored
Separate pitch, quick start and internals in README.md (#1742)
* Revamp main README.md * up * push * push * Update README.md
1 parent 9113926 commit 8536908

File tree

4 files changed

+240
-247
lines changed

4 files changed

+240
-247
lines changed

README.md

+24-246
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,13 @@ This repository still contains:
1919

2020
# TorchDynamo
2121

22+
> TorchDynamo makes it easy to experiment with different compiler backends to make PyTorch code faster with a single line decorator `torch._dynamo.optimize()`
23+
24+
TorchDynamo supports arbitrary PyTorch code, control flow, mutation and dynamic shapes.
25+
26+
You can follow our nightly benchmarks [here](https://github.com/pytorch/torchdynamo/issues/681)
27+
28+
2229
TorchDynamo is a Python-level JIT compiler designed to make unmodified
2330
PyTorch programs faster. TorchDynamo hooks into the frame evaluation API
2431
in CPython ([PEP 523]) to dynamically modify Python bytecode right before
@@ -90,279 +97,50 @@ cd tools/dynamo
9097
python verify_dynamo.py
9198
```
9299

93-
## Usage Example
100+
## Getting started
94101

95-
Here is a basic example of how to use TorchDynamo. One can decorate a function
96-
or a method using `torch._dynamo.optimize` to enable TorchDynamo optimization.
102+
Here is a basic example of how to use TorchDynamo. You can decorate a function
103+
or a method using `torch._dynamo.optimize()` and pass in the name of a compiler e.g: inductor and your code will run faster.
97104

98105
```py
99-
from typing import List
100-
import torch
101-
import torch._dynamo as dynamo
102-
103-
def my_compiler(gm: torch.fx.GraphModule, example_inputs: List[torch.Tensor]):
104-
print("my_compiler() called with FX graph:")
105-
gm.graph.print_tabular()
106-
return gm.forward # return a python callable
107-
108-
@dynamo.optimize(my_compiler)
106+
@dynamo.optimize("inductor")
109107
def fn(x, y):
110108
a = torch.cos(x)
111109
b = torch.sin(y)
112110
return a + b
113-
114-
fn(torch.randn(10), torch.randn(10))
115-
```
116-
117-
Running the above example produces this output
118-
119-
```
120-
my_compiler() called with FX graph:
121-
opcode name target args kwargs
122-
------------- ------ ------------------------------------------------------ ---------- --------
123-
placeholder x x () {}
124-
placeholder y y () {}
125-
call_function cos <built-in method cos of type object at 0x7f1a894649a8> (x,) {}
126-
call_function sin <built-in method sin of type object at 0x7f1a894649a8> (y,) {}
127-
call_function add <built-in function add> (cos, sin) {}
128-
output output output ((add,),) {}
129-
```
130-
131-
This works for `torch.nn.Module` as well as shown below
132-
133-
```py
134-
import torch
135-
import torch._dynamo as dynamo
136-
137-
class MockModule(torch.nn.Module):
138-
def __init__(self):
139-
super().__init__()
140-
self.relu = torch.nn.ReLU()
141-
142-
def forward(self, x):
143-
return self.relu(torch.cos(x))
144-
145-
mod = MockModule()
146-
optimized_mod = dynamo.optimize(my_compiler)(mod)
147-
optimized_mod(torch.randn(10))
148-
```
149-
150-
In the above examples, TorchDynamo uses a custom compiler (also referred to as
151-
backend in the rest of the doc) `my_compiler` that just prints the Fx
152-
GraphModule extracted by TorchDynamo's bytecode analysis, and returns the
153-
`forward` callable. One could write new compilers in a similar fashion.
154-
155-
Let's take a look at one more example with control flow.
156-
```py
157-
from typing import List
158-
import torch
159-
import torch._dynamo as dynamo
160-
161-
def my_compiler(gm: torch.fx.GraphModule, example_inputs: List[torch.Tensor]):
162-
print("my_compiler() called with FX graph:")
163-
gm.graph.print_tabular()
164-
return gm.forward # return a python callable
165-
166-
@dynamo.optimize(my_compiler)
167-
def toy_example(a, b):
168-
x = a / (torch.abs(a) + 1)
169-
if b.sum() < 0:
170-
b = b * -1
171-
return x * b
172-
173-
for _ in range(100):
174-
toy_example(torch.randn(10), torch.randn(10))
175111
```
176112

177-
Running this example produces the following output:
178-
```
179-
my_compiler() called with FX graph:
180-
opcode name target args kwargs
181-
------------- ------- ------------------------------------------------------ ---------------- --------
182-
placeholder a a () {}
183-
placeholder b b () {}
184-
call_function abs_1 <built-in method abs of type object at 0x7f8d259298a0> (a,) {}
185-
call_function add <built-in function add> (abs_1, 1) {}
186-
call_function truediv <built-in function truediv> (a, add) {}
187-
call_method sum_1 sum (b,) {}
188-
call_function lt <built-in function lt> (sum_1, 0) {}
189-
output output output ((truediv, lt),) {}
190-
191-
my_compiler() called with FX graph:
192-
opcode name target args kwargs
193-
------------- ------ ----------------------- ----------- --------
194-
placeholder b b () {}
195-
placeholder x x () {}
196-
call_function mul <built-in function mul> (b, -1) {}
197-
call_function mul_1 <built-in function mul> (x, mul) {}
198-
output output output ((mul_1,),) {}
199-
200-
my_compiler() called with FX graph:
201-
opcode name target args kwargs
202-
------------- ------ ----------------------- --------- --------
203-
placeholder b b () {}
204-
placeholder x x () {}
205-
call_function mul <built-in function mul> (x, b) {}
206-
output output output ((mul,),) {}
207-
```
113+
It's also easy to define your own compiler backends in pure python [custom backend](./documentation/custom-backend.md)
208114

209-
Note that the order of the last two graphs is nondeterministic depending
210-
on which one is encountered first by the just-in-time compiler.
211115

212116
### Existing Backends
213117

214-
TorchDynamo has a growing list of backends, which can be found in [backends.py]
215-
or `torchdynamo.list_backends()`. Note many backends require installing
216-
additional packages. Some of the most commonly used backends are
118+
TorchDynamo has a growing list of backends, which can be found in [backends.py](https://github.com/pytorch/pytorch/blob/master/torch/_dynamo/optimizations/backends.py)
119+
or `torchdynamo.list_backends()` each of which with its optional dependencies.
120+
121+
Some of the most commonly used backends are
217122

218-
Debugging backends:
123+
**Debugging backends**:
219124
* `dynamo.optimize("eager")` - Uses PyTorch to run the extracted GraphModule. This is quite useful in debugging TorchDynamo issues.
220125
* `dynamo.optimize("aot_eager")` - Uses AotAutograd with no compiler, i.e, just using PyTorch eager for the AotAutograd's extracted forward and backward graphs. This is useful for debugging, and unlikely to give speedups.
221126

222-
Training & inference backends:
223-
* `dynamo.optimize("inductor")` - Uses TorchInductor backend with AotAutograd and cudagraphs. [Read more](https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747)
127+
**Training & inference backends**:
128+
* `dynamo.optimize("inductor")` - Uses TorchInductor backend with AotAutograd and cudagraphs by leveraging codegened Triton kernels [Read more](https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747)
224129
* `dynamo.optimize("nvfuser")` - nvFuser with TorchScript. [Read more](https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-1-nvfuser-and-its-primitives/593)
225130
* `dynamo.optimize("aot_nvfuser")` - nvFuser with AotAutograd. [Read more](https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-1-nvfuser-and-its-primitives/593)
226131
* `dynamo.optimize("aot_cudagraphs")` - cudagraphs with AotAutograd. [Read more](https://github.com/pytorch/torchdynamo/pull/757)
227132

228-
Inference-only backends:
133+
**Inference-only backend**s:
229134
* `dynamo.optimize("ofi")` - Uses Torchscript optimize_for_inference. [Read more](https://pytorch.org/docs/stable/generated/torch.jit.optimize_for_inference.html)
230135
* `dynamo.optimize("fx2trt")` - Uses Nvidia TensorRT for inference optimizations. [Read more](https://github.com/pytorch/TensorRT/blob/master/docsrc/tutorials/getting_started_with_fx_path.rst)
231136
* `dynamo.optimize("onnxrt")` - Uses ONNXRT for inference on CPU/GPU. [Read more](https://onnxruntime.ai/)
232137
* `dynamo.optimize("ipex")` - Uses IPEX for inference on CPU. [Read more](https://github.com/intel/intel-extension-for-pytorch)
233138

234-
### Training and AotAutograd
235-
236-
Torchdynamo supports training, using AotAutograd to capture backwards:
237-
* the .forward() graph and optimizer.step() is captured by torchdynamo's python evalframe frontend
238-
* for each segment of .forward() that torchdynamo captures, it uses AotAutograd to generate a backward graph segment
239-
* each pair of forward, backward graph are (optionally) min-cut partitioned to save the minimal state between forward/backward
240-
* the forward, backward pairs are wrapped in autograd.function modules
241-
* usercode calling .backward() still triggers eager's autograd engine, which runs each 'compiled backward' graph as if it were one op, also running any non-compiled eager ops' .backward() functions
242-
243-
Current limitations:
244-
* DDP and FSDP, which rely on autograd 'hooks' firing between backward ops to schedule communications ops, may be pessimized by having all communication ops scheduled _after_ whole compiled regions of backwards ops (WIP to fix this)
245-
246-
Example
247-
```py
248-
model = ...
249-
optimizer = ...
250-
251-
@dynamo.optimize("inductor")
252-
def training_iter_fn(...):
253-
outputs = model(...)
254-
loss = outputs.loss
255-
loss.backward()
256-
optimizer.step()
257-
optimizer.zero_grad()
258-
return loss
259-
260-
for _ in range(100):
261-
loss = training_iter_fn(...)
262-
```
263-
For more details, you can follow our [E2E model training benchmark](./benchmarks/training_loss.py) to onboard your own model training and evaluation. It's running the popular [hugging face Bert model](https://huggingface.co/docs/transformers/training) on [Yelp Reviews datasets](https://huggingface.co/datasets/yelp_review_full). It also prints our if the loss converged and performance speedup comparing to native PyTorch at the end.
264-
265-
266-
## Troubleshooting
267-
See [Troubleshooting](./documentation/TROUBLESHOOTING.md).
268-
269-
## Adding Backends
270-
271-
One could replace `my_compiler()` in the examples above with something that generates faster
272-
code, for example one using [optimize_for_inference]:
273-
```py
274-
def my_compiler(gm: torch.fx.GraphModule, example_inputs: List[torch.Tensor]):
275-
scripted = torch.jit.trace(gm, example_inputs)
276-
return torch.jit.optimize_for_inference(scripted)
277-
```
278-
279-
TorchDynamo also includes many backends, which can be found in
280-
[backends.py] or `torchdynamo.list_backends()`. Note many backends
281-
require installing additional packages. You can combine these backends
282-
together with code like:
283-
```py
284-
from torch._dynamo.optimizations import BACKENDS
285-
286-
def my_compiler(gm: torch.fx.GraphModule, example_inputs: List[torch.Tensor]):
287-
trt_compiled = BACKENDS["tensorrt"](gm, example_inputs)
288-
if trt_compiled is not None:
289-
return trt_compiled
290-
# first backend failed, try something else...
291-
292-
cudagraphs_compiled = BACKENDS["cudagraphs"](gm, example_inputs)
293-
if cudagraphs_compiled is not None:
294-
return cudagraphs_compiled
295-
296-
return gm.forward
297-
```
298-
299-
[optimize_for_inference]: https://pytorch.org/docs/stable/generated/torch.jit.optimize_for_inference.html
300-
[backends.py]: https://github.com/pytorch/torchdynamo/blob/main/torchdynamo/optimizations/backends.py
301-
302-
## Guards
303-
304-
TorchDynamo operates just-in-time and specializes graphs based on dynamic
305-
properties. For example, the first graph above has the following guards:
306-
```
307-
GUARDS:
308-
- local 'a' TENSOR_MATCH
309-
- local 'b' TENSOR_MATCH
310-
- global 'torch' FUNCTION_MATCH
311-
```
312-
313-
If any of those guards fail, the graph will be recaptured and recompiled.
314-
The interesting guard type there is `TENSOR_MATCH`, which checks the
315-
following torch.Tensor properties:
316-
- Python class of the tensor (tensor subclassing, etc)
317-
- dtype
318-
- device
319-
- requires_grad
320-
- dispatch_key (with thread-local includes/excludes applied)
321-
- ndim
322-
- sizes* (optional)
323-
- strides* (optional)
324-
325-
*For sizes/strides you can disable this specialization by setting:
326-
```py
327-
torch._dynamo.config.dynamic_shapes = True
328-
```
329-
330-
The full specialization mode allows the backend compiler to assume
331-
an entirely static graph. Unfortunately, most backends require this.
332-
Operators which return dynamic shapes will trigger a graph break when
333-
not in dynamic shape mode.
334-
335-
## Run Mode / Quiescence Guarantee
336-
337-
In some cases, you may not want unexpected compiles after a program
338-
has warmed up. For example, if you are serving production traffic in a
339-
latency critical application. For this, TorchDynamo provides an alternate
340-
mode where prior compiled graphs are used, but no new ones are generated:
341-
```py
342-
frozen_toy_example = dynamo.run(toy_example)
343-
frozen_toy_example(torch.randn(10), torch.randn(10))
344-
```
345-
346-
## Single Whole-Program Graph Mode
347-
348-
In some cases, you may want to ensure there are no graph breaks in your
349-
program to debug performance issues. You can turn graph breaks into
350-
errors by setting
351-
`nopython=True`:
352-
```py
353-
@dynamo.optimize(my_compiler, nopython=True)
354-
def toy_example(a, b):
355-
```
139+
## Next steps
140+
* [Troubleshooting](./documentation/TROUBLESHOOTING.md)
141+
* [FAQ](./documentation/FAQ.md)
142+
* [Add your own backend](./documentation/custom-backend.md)
356143

357-
Which will trigger the following error in the example program above:
358-
```py
359-
Traceback (most recent call last):
360-
...
361-
torch._dynamo.exc.Unsupported: generic_jump TensorVariable()
362-
Processing original code:
363-
File "example.py", line 7, in toy_example
364-
if b.sum() < 0:
365-
```
366144
## License
367145

368146
TorchDynamo has a BSD-style license, as found in the LICENSE file.

documentation/DeeperDive.md

+36-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,39 @@
1-
## TorchDynamo Deeper Dive
1+
# TorchDynamo Deeper Dive
2+
3+
## What is a guard?
4+
5+
TorchDynamo operates just-in-time and specializes graphs based on dynamic
6+
properties. For example, the first graph above has the following guards:
7+
```
8+
GUARDS:
9+
- local 'a' TENSOR_MATCH
10+
- local 'b' TENSOR_MATCH
11+
- global 'torch' FUNCTION_MATCH
12+
```
13+
14+
If any of those guards fail, the graph will be recaptured and recompiled.
15+
The interesting guard type there is `TENSOR_MATCH`, which checks the
16+
following torch.Tensor properties:
17+
- Python class of the tensor (tensor subclassing, etc)
18+
- dtype
19+
- device
20+
- requires_grad
21+
- dispatch_key (with thread-local includes/excludes applied)
22+
- ndim
23+
- sizes* (optional)
24+
- strides* (optional)
25+
26+
*For sizes/strides you can disable this specialization by setting:
27+
```py
28+
torch._dynamo.config.dynamic_shapes = True
29+
```
30+
31+
The full specialization mode allows the backend compiler to assume
32+
an entirely static graph. Unfortunately, most backends require this.
33+
Operators which return dynamic shapes will trigger a graph break when
34+
not in dynamic shape mode.
35+
36+
## What is dynamo doing?
237

338
If you want to understand better what TorchDynamo is doing, you can set:
439
```py

documentation/FAQ.md

+20
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,15 @@ Below is the TorchDynamo compiler stack.
55

66
At a high level, the TorchDynamo stack consists of a graph capture from Python code using dynamo and a backend compiler. In this example the backend compiler consists of backward graph tracing using AOTAutograd and graph lowering using TorchInductor. There are of course many more compilers available here https://github.com/pytorch/torchdynamo/blob/0b8aaf340dad4777a080ef24bf09623f1aa6f3dd/README.md#existing-backend but for this document we will focus on inductor as a motivating example
77

8+
Torchdynamo supports training, using AotAutograd to capture backwards:
9+
1. the `.forward()` graph and `optimizer.step()` is captured by torchdynamo's python evalframe frontend
10+
2. for each segment of `.forward()` that torchdynamo captures, it uses AotAutograd to generate a backward graph segment
11+
3. each pair of forward, backward graph are (optionally) min-cut partitioned to save the minimal state between forward/backward
12+
4. the forward, backward pairs are wrapped in autograd.function modules
13+
5. usercode calling` .backward()` still triggers eager's autograd engine, which runs each 'compiled backward' graph as if it were one op, also running any non-compiled eager ops' .backward() functions
14+
15+
Current limitations:
16+
* DDP and FSDP, which rely on autograd 'hooks' firing between backward ops to schedule communications ops, may be pessimized by having all communication ops scheduled _after_ whole compiled regions of backwards ops (WIP to fix this)
817

918
## Why is my code crashing?
1019

@@ -69,6 +78,17 @@ print(prof.report())
6978

7079
Many of the reasons for graph breaks and excessive recompilation will be fixed with upcoming support for [tracing dynamic tensor shapes](https://docs.google.com/document/d/1QJB-GOnbv-9PygGlOMXwiO9K6vVNm8sNg_olixJ9koc/edit?usp=sharing), more careful choices for guards and better tuned heuristics.
7180

81+
### Why are you recompiling in production?
82+
83+
In some cases, you may not want unexpected compiles after a program
84+
has warmed up. For example, if you are serving production traffic in a
85+
latency critical application. For this, TorchDynamo provides an alternate
86+
mode where prior compiled graphs are used, but no new ones are generated:
87+
```py
88+
frozen_toy_example = dynamo.run(toy_example)
89+
frozen_toy_example(torch.randn(10), torch.randn(10))
90+
```
91+
7292
## Why am I not seeing speedups?
7393

7494
### Graph Breaks

0 commit comments

Comments
 (0)