Separate pitch, quick start and internals in README.md (#1742)

msaroufim · web-flow · commit 85369080149b · 2022-10-21T16:20:56.000-04:00
* Revamp main README.md

* up

* push

* push

* Update README.md
diff --git a/README.md b/README.md
@@ -19,6 +19,13 @@ This repository still contains:
 
 # TorchDynamo
 
+> TorchDynamo makes it easy to experiment with different compiler backends to make PyTorch code faster with a single line decorator `torch._dynamo.optimize()`
+
+TorchDynamo supports arbitrary PyTorch code, control flow, mutation and dynamic shapes.
+
+You can follow our nightly benchmarks [here](https://github.com/pytorch/torchdynamo/issues/681)
+
+
 TorchDynamo is a Python-level JIT compiler designed to make unmodified
 PyTorch programs faster. TorchDynamo hooks into the frame evaluation API
 in CPython ([PEP 523]) to dynamically modify Python bytecode right before
@@ -90,279 +97,50 @@ cd tools/dynamo
 python verify_dynamo.py
 ```
 
-## Usage Example
+## Getting started
 
-Here is a basic example of how to use TorchDynamo. One can decorate a function
-or a method using `torch._dynamo.optimize` to enable TorchDynamo optimization.
+Here is a basic example of how to use TorchDynamo. You can decorate a function
+or a method using `torch._dynamo.optimize()` and pass in the name of a compiler e.g: inductor and your code will run faster.
 
 ```py
-from typing import List
-import torch
-import torch._dynamo as dynamo
-
-def my_compiler(gm: torch.fx.GraphModule, example_inputs: List[torch.Tensor]):
-    print("my_compiler() called with FX graph:")
-    gm.graph.print_tabular()
-    return gm.forward  # return a python callable
-
-@dynamo.optimize(my_compiler)
+@dynamo.optimize("inductor")
 def fn(x, y):
     a = torch.cos(x)
     b = torch.sin(y)
     return a + b
-
-fn(torch.randn(10), torch.randn(10))
-```
-
-Running the above example produces this output
-
-```
-my_compiler() called with FX graph:
-opcode         name    target                                                  args        kwargs
--------------  ------  ------------------------------------------------------  ----------  --------
-placeholder    x       x                                                       ()          {}
-placeholder    y       y                                                       ()          {}
-call_function  cos     <built-in method cos of type object at 0x7f1a894649a8>  (x,)        {}
-call_function  sin     <built-in method sin of type object at 0x7f1a894649a8>  (y,)        {}
-call_function  add     <built-in function add>                                 (cos, sin)  {}
-output         output  output                                                  ((add,),)   {}
-```
-
-This works for `torch.nn.Module` as well as shown below
-
-```py
-import torch
-import torch._dynamo as dynamo
-
-class MockModule(torch.nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.relu = torch.nn.ReLU()
-
-    def forward(self, x):
-        return self.relu(torch.cos(x))
-
-mod = MockModule()
-optimized_mod = dynamo.optimize(my_compiler)(mod)
-optimized_mod(torch.randn(10))
-```
-
-In the above examples, TorchDynamo uses a custom compiler (also referred to as
-backend in the rest of the doc) `my_compiler` that just prints the Fx
-GraphModule extracted by TorchDynamo's bytecode analysis, and returns the
-`forward` callable. One could write new compilers in a similar fashion.
-
-Let's take a look at one more example with control flow.
-```py
-from typing import List
-import torch
-import torch._dynamo as dynamo
-
-def my_compiler(gm: torch.fx.GraphModule, example_inputs: List[torch.Tensor]):
-    print("my_compiler() called with FX graph:")
-    gm.graph.print_tabular()
-    return gm.forward  # return a python callable
-
-@dynamo.optimize(my_compiler)
-def toy_example(a, b):
-    x = a / (torch.abs(a) + 1)
-    if b.sum() < 0:
-        b = b * -1
-    return x * b
-
-for _ in range(100):
-    toy_example(torch.randn(10), torch.randn(10))
 ```
 
-Running this example produces the following output:
-```
-my_compiler() called with FX graph:
-opcode         name     target                                                  args              kwargs
--------------  -------  ------------------------------------------------------  ----------------  --------
-placeholder    a        a                                                       ()                {}
-placeholder    b        b                                                       ()                {}
-call_function  abs_1    <built-in method abs of type object at 0x7f8d259298a0>  (a,)              {}
-call_function  add      <built-in function add>                                 (abs_1, 1)        {}
-call_function  truediv  <built-in function truediv>                             (a, add)          {}
-call_method    sum_1    sum                                                     (b,)              {}
-call_function  lt       <built-in function lt>                                  (sum_1, 0)        {}
-output         output   output                                                  ((truediv, lt),)  {}
-
-my_compiler() called with FX graph:
-opcode         name    target                   args         kwargs
--------------  ------  -----------------------  -----------  --------
-placeholder    b       b                        ()           {}
-placeholder    x       x                        ()           {}
-call_function  mul     <built-in function mul>  (b, -1)      {}
-call_function  mul_1   <built-in function mul>  (x, mul)     {}
-output         output  output                   ((mul_1,),)  {}
-
-my_compiler() called with FX graph:
-opcode         name    target                   args       kwargs
--------------  ------  -----------------------  ---------  --------
-placeholder    b       b                        ()         {}
-placeholder    x       x                        ()         {}
-call_function  mul     <built-in function mul>  (x, b)     {}
-output         output  output                   ((mul,),)  {}
-```
+It's also easy to define your own compiler backends in pure python [custom backend](./documentation/custom-backend.md)
 
-Note that the order of the last two graphs is nondeterministic depending
-on which one is encountered first by the just-in-time compiler.
 
 ### Existing Backends
 
-TorchDynamo has a growing list of backends, which can be found in [backends.py]
-or `torchdynamo.list_backends()`. Note many backends require installing
-additional packages. Some of the most commonly used backends are
+TorchDynamo has a growing list of backends, which can be found in [backends.py](https://github.com/pytorch/pytorch/blob/master/torch/_dynamo/optimizations/backends.py)
+or `torchdynamo.list_backends()` each of which with its optional dependencies.
+
+Some of the most commonly used backends are
 
-Debugging backends:
+**Debugging backends**:
 * `dynamo.optimize("eager")` - Uses PyTorch to run the extracted GraphModule. This is quite useful in debugging TorchDynamo issues.
 * `dynamo.optimize("aot_eager")` - Uses AotAutograd with no compiler, i.e, just using PyTorch eager for the AotAutograd's extracted forward and backward graphs. This is useful for debugging, and unlikely to give speedups.
 
-Training & inference backends:
-* `dynamo.optimize("inductor")` - Uses TorchInductor backend with AotAutograd and cudagraphs.  [Read more](https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747)
+**Training & inference backends**:
+* `dynamo.optimize("inductor")` - Uses TorchInductor backend with AotAutograd and cudagraphs by leveraging codegened Triton kernels  [Read more](https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747)
 * `dynamo.optimize("nvfuser")` -  nvFuser with TorchScript. [Read more](https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-1-nvfuser-and-its-primitives/593)
 * `dynamo.optimize("aot_nvfuser")` -  nvFuser with AotAutograd. [Read more](https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-1-nvfuser-and-its-primitives/593)
 * `dynamo.optimize("aot_cudagraphs")` - cudagraphs with AotAutograd. [Read more](https://github.com/pytorch/torchdynamo/pull/757)
 
-Inference-only backends:
+**Inference-only backend**s:
 * `dynamo.optimize("ofi")` -  Uses Torchscript optimize_for_inference.  [Read more](https://pytorch.org/docs/stable/generated/torch.jit.optimize_for_inference.html)
 * `dynamo.optimize("fx2trt")` -  Uses Nvidia TensorRT for inference optimizations.  [Read more](https://github.com/pytorch/TensorRT/blob/master/docsrc/tutorials/getting_started_with_fx_path.rst)
 * `dynamo.optimize("onnxrt")` -  Uses ONNXRT for inference on CPU/GPU.  [Read more](https://onnxruntime.ai/)
 * `dynamo.optimize("ipex")` -  Uses IPEX for inference on CPU.  [Read more](https://github.com/intel/intel-extension-for-pytorch)
 
-### Training and AotAutograd
-
-Torchdynamo supports training, using AotAutograd to capture backwards:
-* the .forward() graph and optimizer.step() is captured by torchdynamo's python evalframe frontend
-* for each segment of .forward() that torchdynamo captures, it uses AotAutograd to generate a backward graph segment
-* each pair of forward, backward graph are (optionally) min-cut partitioned to save the minimal state between forward/backward
-* the forward, backward pairs are wrapped in autograd.function modules
-* usercode calling .backward() still triggers eager's autograd engine, which runs each 'compiled backward' graph as if it were one op, also running any non-compiled eager ops' .backward() functions
-
-Current limitations:
-* DDP and FSDP, which rely on autograd 'hooks' firing between backward ops to schedule communications ops, may be pessimized by having all communication ops scheduled _after_ whole compiled regions of backwards ops (WIP to fix this)
-
-Example
-```py
-model = ...
-optimizer = ...
-
-@dynamo.optimize("inductor")
-def training_iter_fn(...):
-    outputs = model(...)
-    loss = outputs.loss
-    loss.backward()
-    optimizer.step()
-    optimizer.zero_grad()
-    return loss
-
-for _ in range(100):
-    loss = training_iter_fn(...)
-```
-For more details, you can follow our [E2E model training benchmark](./benchmarks/training_loss.py) to onboard your own model training and evaluation. It's running the popular [hugging face Bert model](https://huggingface.co/docs/transformers/training) on [Yelp Reviews datasets](https://huggingface.co/datasets/yelp_review_full). It also prints our if the loss converged and performance speedup comparing to native PyTorch at the end.
-
-
-## Troubleshooting
-See [Troubleshooting](./documentation/TROUBLESHOOTING.md).
-
-## Adding Backends
-
-One could replace `my_compiler()` in the examples above with something that generates faster
-code, for example one using [optimize_for_inference]:
-```py
-def my_compiler(gm: torch.fx.GraphModule, example_inputs: List[torch.Tensor]):
-    scripted = torch.jit.trace(gm, example_inputs)
-    return torch.jit.optimize_for_inference(scripted)
-```
-
-TorchDynamo also includes many backends, which can be found in
-[backends.py] or `torchdynamo.list_backends()`.  Note many backends
-require installing additional packages.  You can combine these backends
-together with code like:
-```py
-from torch._dynamo.optimizations import BACKENDS
-
-def my_compiler(gm: torch.fx.GraphModule, example_inputs: List[torch.Tensor]):
-    trt_compiled = BACKENDS["tensorrt"](gm, example_inputs)
-    if trt_compiled is not None:
-        return trt_compiled
-    # first backend failed, try something else...
-
-    cudagraphs_compiled = BACKENDS["cudagraphs"](gm, example_inputs)
-    if cudagraphs_compiled is not None:
-        return cudagraphs_compiled
-
-    return gm.forward
-```
-
-[optimize_for_inference]: https://pytorch.org/docs/stable/generated/torch.jit.optimize_for_inference.html
-[backends.py]: https://github.com/pytorch/torchdynamo/blob/main/torchdynamo/optimizations/backends.py
-
-## Guards
-
-TorchDynamo operates just-in-time and specializes graphs based on dynamic
-properties.  For example, the first graph above has the following guards:
-```
-GUARDS:
- - local 'a' TENSOR_MATCH
- - local 'b' TENSOR_MATCH
- - global 'torch' FUNCTION_MATCH
-```
-
-If any of those guards fail, the graph will be recaptured and recompiled.
-The interesting guard type there is `TENSOR_MATCH`, which checks the
-following torch.Tensor properties:
-- Python class of the tensor (tensor subclassing, etc)
-- dtype
-- device
-- requires_grad
-- dispatch_key (with thread-local includes/excludes applied)
-- ndim
-- sizes* (optional)
-- strides* (optional)
-
-*For sizes/strides you can disable this specialization by setting:
-```py
-torch._dynamo.config.dynamic_shapes = True
-```
-
-The full specialization mode allows the backend compiler to assume
-an entirely static graph.  Unfortunately, most backends require this.
-Operators which return dynamic shapes will trigger a graph break when
-not in dynamic shape mode.
-
-## Run Mode / Quiescence Guarantee
-
-In some cases, you may not want unexpected compiles after a program
-has warmed up.  For example, if you are serving production traffic in a
-latency critical application.  For this, TorchDynamo provides an alternate
-mode where prior compiled graphs are used, but no new ones are generated:
-```py
-frozen_toy_example = dynamo.run(toy_example)
-frozen_toy_example(torch.randn(10), torch.randn(10))
-```
-
-## Single Whole-Program Graph Mode
-
-In some cases, you may want to ensure there are no graph breaks in your
-program to debug performance issues.  You can turn graph breaks into
-errors by setting
-`nopython=True`:
-```py
-@dynamo.optimize(my_compiler, nopython=True)
-def toy_example(a, b):
-```
+## Next steps
+* [Troubleshooting](./documentation/TROUBLESHOOTING.md)
+* [FAQ](./documentation/FAQ.md)
+* [Add your own backend](./documentation/custom-backend.md)
 
-Which will trigger the following error in the example program above:
-```py
-Traceback (most recent call last):
-  ...
-torch._dynamo.exc.Unsupported: generic_jump TensorVariable()
-Processing original code:
-  File "example.py", line 7, in toy_example
-    if b.sum() < 0:
-```
 ## License
 
 TorchDynamo has a BSD-style license, as found in the LICENSE file.
diff --git a/documentation/DeeperDive.md b/documentation/DeeperDive.md
@@ -1,4 +1,39 @@
-## TorchDynamo Deeper Dive
+# TorchDynamo Deeper Dive
+
+## What is a guard?
+
+TorchDynamo operates just-in-time and specializes graphs based on dynamic
+properties.  For example, the first graph above has the following guards:
+```
+GUARDS:
+ - local 'a' TENSOR_MATCH
+ - local 'b' TENSOR_MATCH
+ - global 'torch' FUNCTION_MATCH
+```
+
+If any of those guards fail, the graph will be recaptured and recompiled.
+The interesting guard type there is `TENSOR_MATCH`, which checks the
+following torch.Tensor properties:
+- Python class of the tensor (tensor subclassing, etc)
+- dtype
+- device
+- requires_grad
+- dispatch_key (with thread-local includes/excludes applied)
+- ndim
+- sizes* (optional)
+- strides* (optional)
+
+*For sizes/strides you can disable this specialization by setting:
+```py
+torch._dynamo.config.dynamic_shapes = True
+```
+
+The full specialization mode allows the backend compiler to assume
+an entirely static graph.  Unfortunately, most backends require this.
+Operators which return dynamic shapes will trigger a graph break when
+not in dynamic shape mode.
+
+## What is dynamo doing?
 
 If you want to understand better what TorchDynamo is doing, you can set:
 ```py
diff --git a/documentation/FAQ.md b/documentation/FAQ.md
@@ -5,6 +5,15 @@ Below is the TorchDynamo compiler stack.
 
 At a high level, the TorchDynamo stack consists of a graph capture from Python code using dynamo and a backend compiler. In this example the backend compiler consists of backward graph tracing using AOTAutograd and graph lowering using TorchInductor. There are of course many more compilers available here https://github.com/pytorch/torchdynamo/blob/0b8aaf340dad4777a080ef24bf09623f1aa6f3dd/README.md#existing-backend but for this document we will focus on inductor as a motivating example
 
+Torchdynamo supports training, using AotAutograd to capture backwards:
+1. the `.forward()` graph and `optimizer.step()` is captured by torchdynamo's python evalframe frontend
+2. for each segment of `.forward()` that torchdynamo captures, it uses AotAutograd to generate a backward graph segment
+3. each pair of forward, backward graph are (optionally) min-cut partitioned to save the minimal state between forward/backward
+4. the forward, backward pairs are wrapped in autograd.function modules
+5. usercode calling` .backward()` still triggers eager's autograd engine, which runs each 'compiled backward' graph as if it were one op, also running any non-compiled eager ops' .backward() functions
+
+Current limitations:
+* DDP and FSDP, which rely on autograd 'hooks' firing between backward ops to schedule communications ops, may be pessimized by having all communication ops scheduled _after_ whole compiled regions of backwards ops (WIP to fix this)
 
 ## Why is my code crashing?
 
@@ -69,6 +78,17 @@ print(prof.report())
 
 Many of the reasons for graph breaks and excessive recompilation will be fixed with upcoming support for [tracing dynamic tensor shapes](https://docs.google.com/document/d/1QJB-GOnbv-9PygGlOMXwiO9K6vVNm8sNg_olixJ9koc/edit?usp=sharing), more careful choices for guards and better tuned heuristics.
 
+### Why are you recompiling in production?
+
+In some cases, you may not want unexpected compiles after a program
+has warmed up.  For example, if you are serving production traffic in a
+latency critical application.  For this, TorchDynamo provides an alternate
+mode where prior compiled graphs are used, but no new ones are generated:
+```py
+frozen_toy_example = dynamo.run(toy_example)
+frozen_toy_example(torch.randn(10), torch.randn(10))
+```
+
 ## Why am I not seeing speedups?
 
 ### Graph Breaks
diff --git a/documentation/custom-backend.md b/documentation/custom-backend.md