[WIP] Thunderfx report #1636

kiya00 · 2025-01-10T17:13:42Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

1. Adds `thunderfx_report` API

thunderfx_report(fn: Callable, *args, compile_kwargs: dict = None, folder_path: str | PathLike = "/tmp/thunderfx_report", check_consistency: bool = True, check_benchmark: bool = True, save_consistency_inputs: bool = False, save_benchmark_inputs: bool = False, **kwargs,), it

prints the exception string if thunderfx fails, saves the repro scripts if possible
if check_consistency=True, checks the results of thunder.jit(with compile_kwargs) with eager, the scripts are saved in consistency folder, the inputs are saved in .pt file if save_consistency_inputs=True else the random tensors are used
similarly, if check_benchmark=True, the benchmark scripts are saved in benchmark folder with the option to save the input tensors.
The report is printed on the screen.

Example

The saved file structure(when save_benchmark_inputs=True):

.
├── benchmark
│   ├── __pycache__
│   │   ├── graph0_thunder_0.cpython-312-pytest-8.1.1.pyc
│   │   └── graph1_thunder_0.cpython-312-pytest-8.1.1.pyc
│   ├── graph0_thunder_0.json
│   ├── graph0_thunder_0.py
│   ├── graph0_thunder_0_inputs.pt
│   ├── graph1_thunder_0.json
│   ├── graph1_thunder_0.py
│   └── graph1_thunder_0_inputs.pt
├── consistency
│   ├── graph0_thunder_0.py
│   └── graph1_thunder_0.py

The report is displayed on the screen as follows (there are 2 FXgraphs: graph0_thunder_0, graph1_thunder_0):

Content of the report

The input callable can be successfully executed by ThunderFX.

Verifying consistency between Thunder and Torch eager ...
[graph0_thunder_0] Consistency check succeeded
[graph1_thunder_0] Consistency check succeeded

Analyzing performance through benchmarking, this might take a moment...
graph0_thunder_0:
Running 8 items in this shard
........                                                                 [100%]

----------------------------------------------------------------------------- benchmark 'compute_type=ComputeType.TRAINING_BACKWARD': 4 tests ------------------------------------------------------------------------------
Name (time in us)                                        Min                 Max                Mean            StdDev              Median               IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_graph0_thunder_0[backward-eager]                34.6431 (1.0)       53.1871 (1.0)       35.8379 (1.0)      3.1075 (1.0)       35.0285 (1.0)      0.4629 (1.0)       122;172       27.9034 (1.0)        2408          12
test_graph0_thunder_0[backward-torch_inductor]       65.1505 (1.88)      97.6453 (1.84)      67.9273 (1.90)     4.2301 (1.36)      66.8875 (1.91)     1.2630 (2.73)       88;110       14.7216 (0.53)       1534          10
test_graph0_thunder_0[backward-thunder_cugraph]     133.6019 (3.86)     211.0944 (3.97)     138.7169 (3.87)     6.9738 (2.24)     136.7520 (3.90)     2.3469 (5.07)        66;79        7.2089 (0.26)        743          10
test_graph0_thunder_0[backward-thunder]             134.0730 (3.87)     188.0126 (3.53)     138.3093 (3.86)     5.5063 (1.77)     136.6975 (3.90)     2.2269 (4.81)        57;66        7.2302 (0.26)        746          10
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'compute_type=ComputeType.TRAINING_FORWARD': 4 tests -----------------------------------------------------------------------------
Name (time in us)                                       Min                 Max                Mean            StdDev              Median               IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_graph0_thunder_0[forward-eager]                 4.2473 (1.0)        6.4680 (1.0)        4.3972 (1.0)      0.3725 (1.0)        4.3045 (1.0)      0.0299 (1.0)       119;180      227.4189 (1.0)        2351         100
test_graph0_thunder_0[forward-torch_inductor]       19.3764 (4.56)      31.3543 (4.85)      20.2809 (4.61)     1.9849 (5.33)      19.8094 (4.60)     0.2820 (9.44)      120;185       49.3076 (0.22)       2573          20
test_graph0_thunder_0[forward-thunder]             123.7120 (29.13)    159.8947 (24.72)    128.0359 (29.12)    4.5992 (12.35)    126.7161 (29.44)    2.0274 (67.85)       60;82        7.8103 (0.03)        808          10
test_graph0_thunder_0[forward-thunder_cugraph]     128.7621 (30.32)    163.7167 (25.31)    132.5799 (30.15)    4.5898 (12.32)    131.4400 (30.54)    1.8199 (60.91)       44;47        7.5426 (0.03)        776          10
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
8 passed, 54393 warnings in 50.37s


graph1_thunder_0:
Running 8 items in this shard
........                                                                 [100%]

------------------------------------------------------------------------------ benchmark 'compute_type=ComputeType.TRAININ     G_BACKWARD': 4 tests ------------------------------------------------------------------------------
Name (time in us)                                        Min                 Max                Mean             StdDev                   Median               IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------     ---------------------------------------------------------------------------------------------------
test_graph1_thunder_0[backward-eager]                51.0826 (1.0)       91.2683 (1.0)       53.0721 (1.0)       3.9710 (1     .02)      52.1101 (1.0)      0.6900 (1.0)        99;149       18.8423 (1.0)        1936          10
test_graph1_thunder_0[backward-torch_inductor]       79.6944 (1.56)     105.2492 (1.15)      82.7173 (1.56)      3.8760 (1     .0)       81.7223 (1.57)     1.1479 (1.66)        69;83       12.0894 (0.64)       1262          10
test_graph1_thunder_0[backward-thunder_cugraph]     149.4829 (2.93)     230.7002 (2.53)     154.4036 (2.91)      7.6522 (1     .97)     152.0009 (2.92)     2.9549 (4.28)        67;74        6.4765 (0.34)        670          10
test_graph1_thunder_0[backward-thunder]             161.7689 (3.17)     514.5594 (5.64)     171.0046 (3.22)     17.8343 (4     .60)     166.0554 (3.19)     4.3313 (6.28)      147;180        5.8478 (0.31)       2063           3
--------------------------------------------------------------------------------------------------------------------------     ---------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'compute_type=ComputeType.TRAININ     G_FORWARD': 4 tests -----------------------------------------------------------------------------
Name (time in us)                                       Min                 Max                Mean            StdDev                   Median               IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------     -------------------------------------------------------------------------------------------------
test_graph1_thunder_0[forward-eager]                 7.1465 (1.0)        9.6177 (1.0)        7.3252 (1.0)      0.3800 (1.0     )        7.2330 (1.0)      0.0507 (1.0)        69;120      136.5153 (1.0)        1398         100
test_graph1_thunder_0[forward-torch_inductor]       18.9999 (2.66)      35.8237 (3.72)      20.1437 (2.75)     2.3191 (6.1     0)      19.4813 (2.69)     0.4085 (8.06)      162;247       49.6434 (0.36)       2622          20
test_graph1_thunder_0[forward-thunder]             129.5229 (18.12)    199.6013 (20.75)    134.8926 (18.41)    5.5045 (14.     49)    133.7800 (18.50)    3.8628 (76.20)       56;52        7.4133 (0.05)        771          10
test_graph1_thunder_0[forward-thunder_cugraph]     132.3270 (18.52)    220.7174 (22.95)    137.8833 (18.82)    6.5722 (17.     30)    136.0469 (18.81)    3.3962 (67.00)       59;62        7.2525 (0.05)        755          10
--------------------------------------------------------------------------------------------------------------------------     -------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
8 passed, 52474 warnings in 51.08s


Max allocated memory usage:
graph0_thunder_0[forward-thunder]: 9.765625e-07 GB
graph0_thunder_0[forward-torch_inductor]: 9.765625e-07 GB
graph0_thunder_0[forward-eager]: 9.765625e-07 GB
graph0_thunder_0[forward-thunder_cugraph]: 9.765625e-07 GB


graph0_thunder_0[backward-thunder]: 2.9296875e-06 GB
graph0_thunder_0[backward-torch_inductor]: 2.9296875e-06 GB
graph0_thunder_0[backward-eager]: 2.9296875e-06 GB
graph0_thunder_0[backward-thunder_cugraph]: 2.9296875e-06 GB


graph1_thunder_0[forward-thunder]: 1.46484375e-06 GB
graph1_thunder_0[forward-torch_inductor]: 1.46484375e-06 GB
graph1_thunder_0[forward-thunder_cugraph]: 1.46484375e-06 GB
graph1_thunder_0[forward-eager]: 1.953125e-06 GB


graph1_thunder_0[backward-thunder]: 3.90625e-06 GB
graph1_thunder_0[backward-torch_inductor]: 3.90625e-06 GB
graph1_thunder_0[backward-thunder_cugraph]: 3.90625e-06 GB
graph1_thunder_0[backward-eager]: 4.39453125e-06 GB

2. Modifies the reproducer script to have command line options

save_reproducer_to_folder(self, reproducer_folder: str | PathLike, use_pytest_benchmark: bool = False, save_input_tensor=False,) can be used to save benchmark/reproducer script with random or dumped pt input tensors.
The reproducer script has configurable options, the default one is --check_consistency=False --compute_type=forward --executor=thunder

The repro script example

"""
Environment information get from `torch.utils.collect_env.get_pretty_env_info()`:
PyTorch version: 2.6.0a0+ecf3bae40a.nvInternal
Is debug build: False
CUDA used to build PyTorch: 12.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.1 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version: Could not collect
CMake version: version 3.28.3
Libc version: glibc-2.39

Python version: 3.12.3 (main, Nov  6 2024, 18:32:19) [GCC 13.2.0] (64-bit runtime)
Python platform: Linux-6.5.0-44-generic-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: 12.8.61
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA RTX 6000 Ada Generation
GPU 1: NVIDIA RTX 6000 Ada Generation

Nvidia driver version: 545.29.02
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.7.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             32
On-line CPU(s) list:                0-31
Vendor ID:                          AuthenticAMD
BIOS Vendor ID:                     Advanced Micro Devices, Inc.
Model name:                         AMD Ryzen 9 7950X 16-Core Processor
BIOS Model name:                    AMD Ryzen 9 7950X 16-Core Processor             Unknown CPU @ 4.5GHz
BIOS CPU family:                    107
CPU family:                         25
Model:                              97
Thread(s) per core:                 2
Core(s) per socket:                 16
Socket(s):                          1
Stepping:                           2
CPU(s) scaling MHz:                 63%
CPU max MHz:                        5881.0000
CPU min MHz:                        400.0000
BogoMIPS:                           8999.83
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization:                     AMD-V
L1d cache:                          512 KiB (16 instances)
L1i cache:                          512 KiB (16 instances)
L2 cache:                           16 MiB (16 instances)
L3 cache:                           64 MiB (2 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-31
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cudnn-frontend==1.9.0
[pip3] optree==0.14.0
[pip3] optree==0.14.0
[pip3] pytorch-lightning==2.5.0.post0
[pip3] pytorch-triton==3.1.0+cf34004b8.internal
[pip3] torch==2.6.0a0+ecf3bae40a.nvinternal
[pip3] torchmetrics==1.6.1
[pip3] torchvision==0.20.0a0
[conda] Could not collect

Versions of Thunder related libraries:
lightning-thunder==0.2.0.dev0
nvfuser==0.2.24+git062dd50

Split Information:
The original graph is not split, and is entirely run by Thunder.
"""

from math import inf
from math import nan
NoneType = type(None)
import torch
from torch import device
import torch.fx._pytree as fx_pytree
import torch.utils._pytree as pytree
from functools import partial
import thunder
from thunder.transforms.cudagraph import CUDAGraphTransform
from thunder.dev_utils.nvtx_profile_transform import NvtxProfileTransform

# NOTE: The reproducer function has already been processed by TorchDynamo.
# If we let it go through TorchDynamo again, it could be segmented further.
# To avoid this, we directly use Inductor here.
# See issue https://github.com/Lightning-AI/lightning-thunder/issues/1521
def torch_inductor(fn, inputs):
    from torch._inductor import compile as inductor_compile
    from torch.fx import symbolic_trace

    fx_graph = symbolic_trace(fn)
    return inductor_compile(fx_graph, inputs)


bench_executors_dict = {}
bench_executors_dict["thunder"]=partial(thunder.jit, )
bench_executors_dict["torch_inductor"]=torch_inductor
bench_executors_dict["eager"]=None

bench_executors_dict["thunder_cugraph"]=partial(thunder.jit, transform=CUDAGraphTransform())


import argparse

parser = argparse.ArgumentParser(description="Script for executing an FX graph with specified configurations.")

parser.add_argument(
    "--check_consistency",
    type=bool,
    default=False,
    help="Whether to check consistency (default: False)"
)
parser.add_argument(
    "--compute_type",
    type=str,
    choices=["forward", "forward+backward"],
    default="forward",
    help="Type of computation to perform (forward, forward+backward)"
)
parser.add_argument(
    "--executor",
    type=str,
    choices=["torch_inductor", "thunder", "eager"],
    default="thunder",
    help="Type of executors (thunder, torch_inductor, or eager)"
)

args = parser.parse_args()
ex_name = args.executor
compute_type = args.compute_type
check_acc = args.check_consistency

def test_graph0_thunder_0():
    class DynamoModule(torch.nn.Module):
      def forward(self, l_x_ : torch.Tensor):
          y = l_x_.sin();  l_x_ = None
          return y

    inputs = [
        torch.testing.make_tensor((4, 4), dtype=torch.float32,  device='cuda:0', requires_grad=True, low=-1.35869300365448, high=1.8803045749664307,).as_strided((4, 4), (4, 1)),

    ]

    mod = DynamoModule()
    from thunder.dynamo.report import run_repro

    result = run_repro(bench_executors_dict, ex_name, mod, compute_type, *inputs)
    if check_acc:
        eager_result = run_repro(bench_executors_dict, "eager", mod, compute_type, *inputs)
        for (compute_t, eager_v), (_, cur_v) in zip(eager_result.items(), result.items()):
            torch.testing.assert_close(eager_v, cur_v, msg=lambda e : f'{compute_t}: {e}')


test_graph0_thunder_0()

3. Next step

Next step is #1700, and after that try to replace the Thunder specific report to use the FXReport API

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

… consistency

kiya00 · 2025-01-22T18:26:44Z

Hi @mruberry @IvanYashchuk @kshitij12345 , I’ve prepared an initial version of the report, which you can find in the PR description. Could you let me know if this aligns with the direction you had in mind? Additionally, I’d appreciate any suggestions you have regarding the report's content, thank you

mruberry · 2025-01-23T19:10:34Z

fyi @riccardofelluga, too, since he's had some experience with benchmarking

mruberry · 2025-01-23T19:57:20Z

Super cool!

A few notes and questions:

What does graph0_thunder_0 mean? Is it the first FX graph (graph0) and then the first segment that thunder creates?
NoneType can be imported from the types module: from types import NoneType
If benchmarking, can the executors to be used be passed to the function? Like if I wanted to benchmark eager and torch.compile, could I just pass those two to thunderfx_report, maybe like: executors=(('eager', lambda x: x), ('torchcompile', torch.compile))?

For this PR, let's focus on clarifying this small issues and merge the thunderfx_report function so it can be used.

For the next PR, I think it'd be useful to create an FXReport object. What I'm thinking is:

Create a new function called fx_report (not Thunder-specific)
Return a FXReport object from fx_report
The FXReport object has a sequence of FXGraphReport objects, one per fx graph in the original program
Each FXGraphReport has a method to write a reproducer script
- The method should be able to easily select whether the reproducer uses PyTorch eager, torch.compile, or thunderfx (with some supported options)
- For total control, the method should allow writing an arbitrary executor string and allow for extending what's imported with arbitrary import strings, in case someone wants to use a very special version of thunderfx or torch.compile or PyTorch eager
Each FXGraphReport should allow its FXGraph's inputs to be queried (they can just be FakeTensors for now)

So someone could do something like:

report: FXReport = fx_report(foo)

fxgr: FXGraphReport
for fxgr in report.graph_reports:
  print(fxgr.inputs)
  fxgr.write_eager_reproducer(filename0)
  fxgr.write_torchcompile_reproducer(filename1)
  fxgr.write_thunder_reproducer(filename2)
  fxgr.write_reproducer(filename3, executor_string="my_crazy_executor", import_strings=("import my_crazy_module", "from my_crazy_module import my_crazy_executor"))

And that would generate four reproducers for each of the FX graphs in the program, each one using the different requested executor.

Writing the reproducers directly like this may not be super interesting in isolation, but once that's working we can add more of the data that thunderfx_report makes available, and that will become really interesting really fast. It would also be great if the existing reporting functionality, like thunderfx_report could be updated to use FXGraphReport and FXReport objects internally for consistency. Using modular Python objects for the report data and analysis should make it easier for us to develop and for others to extend.

What are your thoughts, @kiya00?

kiya00 · 2025-01-24T10:12:46Z

Hi @mruberry , thank you so much for your detailed suggestion!

What does graph0_thunder_0 mean? Is it the first FX graph (graph0) and then the first segment that thunder creates?

yes, that's the naming, correspond to the name saved in the subgraph_info

NoneType can be imported from the types module: from types import NoneType

This part comes from the custom_builtins that torch uses, custom_builtins = "\n".join([v.import_str for v in _custom_builtins.values()]), I noticed it's used with the fxgraph code string in Torch, although I haven't seen anything other than device appear in fxgraph code yet.

from math import inf
from math import nan
NoneType = type(None)
import torch
from torch import device
import torch.fx._pytree as fx_pytree
import torch.utils._pytree as pytree

If benchmarking, can the executors to be used be passed to the function? Like if I wanted to benchmark eager and torch.compile, could I just pass those two to thunderfx_report, maybe like: executors=(('eager', lambda x: x), ('torchcompile', torch.compile))?

It depends on if we can convert any executor code into a corresponding string, I'm not very familiar with it, is it possible if I simply use repr or inspect to recover the code string from an object? Currently the possible executors are fixed in a dict I created and the user can pass in the name to select which to run

That's a very great suggestion about creating FXReport/FXGraphReport object!

And that would generate four reproducers for each of the FX graphs in the program, each one using the different requested executor.
Writing the reproducers directly like this may not be super interesting in isolation, but once that's working we can add more of the data that thunderfx_report makes available, and that will become really interesting really fast.

~~Regarding the generated scripts, in this PR I'm trying something like~~

save_reproducer_to_folder("repro", use_pytest_benchmark= False, save_input_tensor=False,)
save_reproducer_to_folder("benchmark", use_pytest_benchmark= True, save_input_tensor=False,)

and the above will save one script per fx graph per repro/benchmark, and the script can be used by giving different options, like pytest benchmark/graph0_thunder_0.py --compute_type backward --executor thunder or python repro/graph0_thunder_0.py --check_consistency=True --compute_type=forward --executor=torch_inductor. it's like we can either generate one script per executor, it can make the script more clear and the separate fxgr.write_eager_reproducer fxgr.write_torchcompile_reproducer fxgr.write_thunder_reproducer APIs are more easy to use, or we can generate single script with command line options for all the exes, it's about UX, I think both are OK, which one do you like better?
I seem to get it wrong in the previous text, FXReport is not thunder specific but general, that means we don't get the fx graph from the ThunderCompiler.subgraph_infos, but instead use the original Dynamo API to get the fx graph and apply the executors the user wants (user defined ones will be input as string) and generate the script for each executor as you mentioned. Then the next step is to try to replace the Thunder specific report to use the FXReport API, and for the current PR we could keep the "one script with command line options" format for now, is that right?

riccardofelluga

just a pair of nits but for the rest it looks good to me! One thing I would add is printing the output folder path such that it's easier for users to go and fish out the reproducer files.

On a more technical note, is there a way to make sure that the memory reports are accurate? I've tested it a bunch and they seem to be correct, tho sometimes with repeated tests one might interfere with the other. Maybe we can add a little warning of some sort

thunder/dynamo/compiler.py

thunder/dynamo/utils.py

thunder/dynamo/report.py

mruberry · 2025-01-24T16:25:01Z

Hi @mruberry , thank you so much for your detailed suggestion!

What does graph0_thunder_0 mean? Is it the first FX graph (graph0) and then the first segment that thunder creates?

yes, that's the naming, correspond to the name saved in the subgraph_info

Sounds good.

NoneType can be imported from the types module: from types import NoneType

This part comes from the custom_builtins that torch uses, custom_builtins = "\n".join([v.import_str for v in _custom_builtins.values()]), I noticed it's used with the fxgraph code string in Torch, although I haven't seen anything other than device appear in fxgraph code yet.

No worries, that sounds great.

from math import inf
from math import nan
NoneType = type(None)
import torch
from torch import device
import torch.fx._pytree as fx_pytree
import torch.utils._pytree as pytree
If benchmarking, can the executors to be used be passed to the function? Like if I wanted to benchmark eager and torch.compile, could I just pass those two to thunderfx_report, maybe like: executors=(('eager', lambda x: x), ('torchcompile', torch.compile))?

It depends on if we can convert any executor code into a corresponding string, I'm not very familiar with it, is it possible if I simply use repr or inspect to recover the code string from an object? Currently the possible executors are fixed in a dict I created and the user can pass in the name to select which to run

Great question! So it is, in general, difficult to convert a Python callable into a string representing how to call that callable. Using a fixed set of executors that you know how to print is the right thing to do. Passing in a name to select which to run is a good start. In the future we can maybe let developers specify their own strings, too (per my comment above).

That's a very great suggestion about creating FXReport/FXGraphReport object!

And that would generate four reproducers for each of the FX graphs in the program, each one using the different requested executor.
Writing the reproducers directly like this may not be super interesting in isolation, but once that's working we can add more of the data that thunderfx_report makes available, and that will become really interesting really fast.

~~Regarding the generated scripts, in this PR I'm trying something like~~
save_reproducer_to_folder("repro", use_pytest_benchmark= False, save_input_tensor=False,)
save_reproducer_to_folder("benchmark", use_pytest_benchmark= True, save_input_tensor=False,)
and the above will save one script per fx graph per repro/benchmark, and the script can be used by giving different options, like pytest benchmark/graph0_thunder_0.py --compute_type backward --executor thunder or python repro/graph0_thunder_0.py --check_consistency=True --compute_type=forward --executor=torch_inductor. it's like we can either generate one script per executor, it can make the script more clear and the separate fxgr.write_eager_reproducer fxgr.write_torchcompile_reproducer fxgr.write_thunder_reproducer APIs are more easy to use, or we can generate single script with command line options for all the exes, it's about UX, I think both are OK, which one do you like better? I seem to get it wrong in the previous text, FXReport is not thunder specific but general, that means we don't get the fx graph from the ThunderCompiler.subgraph_infos, but instead use the original Dynamo API to get the fx graph and apply the executors the user wants (user defined ones will be input as string) and generate the script for each executor as you mentioned. Then the next step is to try to replace the Thunder specific report to use the FXReport API, and for the current PR we could keep the "one script with command line options" format for now, is that right?

That's right!

mruberry

crcrpar · 2025-02-05T11:19:45Z

Can we try to avoid merging PRs whose title indicates it's WIP by modifying the commit message when merging or updating the PR title?

kiya00 requested a review from IvanYashchuk January 10, 2025 17:13

kiya00 force-pushed the thunderfx_report branch from df71313 to 49e72b4 Compare January 21, 2025 21:14

kiya00 added 5 commits January 21, 2025 22:19

Add options to save the input tensor to pt file; add options to check…

005a0f7

… consistency

saves a json format report

d68f463

adds options to check consistency or benchmark only

d8a86ec

fix

835b5c9

Add command line options; add fwd/bwd

3c5d713

kiya00 force-pushed the thunderfx_report branch from 49e72b4 to 3c5d713 Compare January 21, 2025 21:31

kiya00 added 2 commits January 22, 2025 18:57

clean up

472d009

rm unused

dc1b19e

kiya00 mentioned this pull request Jan 22, 2025

Add utility function to benchmark performance of fusion region with nvfuser and torch.compile #1682

Merged

mruberry requested a review from riccardofelluga January 23, 2025 19:10

riccardofelluga approved these changes Jan 24, 2025

View reviewed changes

thunder/dynamo/compiler.py Show resolved Hide resolved

thunder/dynamo/utils.py Outdated Show resolved Hide resolved

thunder/dynamo/report.py Show resolved Hide resolved

add comments

0713f2b

kiya00 mentioned this pull request Jan 27, 2025

[Reporting Tool] Modular report classes for saving non-Thunder-specific repro scripts #1700

Closed

kiya00 marked this pull request as ready for review January 27, 2025 13:19

kiya00 requested review from mruberry, lantiga and t-vi as code owners January 27, 2025 13:19

kiya00 and others added 2 commits January 27, 2025 14:35

follow comments

42e302c

Merge branch 'main' into thunderfx_report

19713a8

mruberry approved these changes Jan 27, 2025

View reviewed changes

mruberry enabled auto-merge (squash) January 27, 2025 17:23

mruberry merged commit 956076f into main Jan 27, 2025
49 checks passed

mruberry deleted the thunderfx_report branch January 27, 2025 17:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Thunderfx report #1636

[WIP] Thunderfx report #1636

Uh oh!

kiya00 commented Jan 10, 2025 •

edited

Loading

Uh oh!

kiya00 commented Jan 22, 2025

Uh oh!

mruberry commented Jan 23, 2025

Uh oh!

mruberry commented Jan 23, 2025

Uh oh!

kiya00 commented Jan 24, 2025 •

edited

Loading

Uh oh!

riccardofelluga left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mruberry commented Jan 24, 2025

Uh oh!

mruberry left a comment

Uh oh!

Uh oh!

crcrpar commented Feb 5, 2025

Uh oh!

Uh oh!

[WIP] Thunderfx report #1636

[WIP] Thunderfx report #1636

Uh oh!

Conversation

kiya00 commented Jan 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

1. Adds thunderfx_report API

Example

2. Modifies the reproducer script to have command line options

3. Next step

PR review

Did you have fun?

Uh oh!

kiya00 commented Jan 22, 2025

Uh oh!

mruberry commented Jan 23, 2025

Uh oh!

mruberry commented Jan 23, 2025

Uh oh!

kiya00 commented Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

riccardofelluga left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mruberry commented Jan 24, 2025

Uh oh!

mruberry left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

crcrpar commented Feb 5, 2025

Uh oh!

Uh oh!

kiya00 commented Jan 10, 2025 •

edited

Loading

1. Adds `thunderfx_report` API

kiya00 commented Jan 24, 2025 •

edited

Loading