Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Thunderfx report #1636

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft

[WIP] Thunderfx report #1636

wants to merge 7 commits into from

Conversation

kiya00
Copy link
Collaborator

@kiya00 kiya00 commented Jan 10, 2025

Before submitting
  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

1. Adds thunderfx_report API

thunderfx_report(fn: Callable, *args, compile_kwargs: dict = None, folder_path: str | PathLike = "/tmp/thunderfx_report", check_consistency: bool = True, check_benchmark: bool = True, save_consistency_inputs: bool = False, save_benchmark_inputs: bool = False, **kwargs,), it

  1. prints the exception string if thunderfx fails, saves the repro scripts if possible
  2. if check_consistency=True, checks the results of thunder.jit(with compile_kwargs) with eager, the scripts are saved in consistency folder, the inputs are saved in .pt file if save_consistency_inputs=True else the random tensors are used
  3. similarly, if check_benchmark=True, the benchmark scripts are saved in benchmark folder with the option to save the input tensors.
  4. The report is printed on the screen.

Example

The saved file structure(when save_benchmark_inputs=True):

.
├── benchmark
│   ├── __pycache__
│   │   ├── graph0_thunder_0.cpython-312-pytest-8.1.1.pyc
│   │   └── graph1_thunder_0.cpython-312-pytest-8.1.1.pyc
│   ├── graph0_thunder_0.json
│   ├── graph0_thunder_0.py
│   ├── graph0_thunder_0_inputs.pt
│   ├── graph1_thunder_0.json
│   ├── graph1_thunder_0.py
│   └── graph1_thunder_0_inputs.pt
├── consistency
│   ├── graph0_thunder_0.py
│   └── graph1_thunder_0.py

The report is displayed on the screen as follows (there are 2 FXgraphs: graph0_thunder_0, graph1_thunder_0):

Content of the report
The input callable can be successfully executed by ThunderFX.

Verifying consistency between Thunder and Torch eager ...
[graph0_thunder_0] Consistency check succeeded
[graph1_thunder_0] Consistency check succeeded

Analyzing performance through benchmarking, this might take a moment...
graph0_thunder_0:
Running 8 items in this shard
........                                                                 [100%]

----------------------------------------------------------------------------- benchmark 'compute_type=ComputeType.TRAINING_BACKWARD': 4 tests ------------------------------------------------------------------------------
Name (time in us)                                        Min                 Max                Mean            StdDev              Median               IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_graph0_thunder_0[backward-eager]                34.6431 (1.0)       53.1871 (1.0)       35.8379 (1.0)      3.1075 (1.0)       35.0285 (1.0)      0.4629 (1.0)       122;172       27.9034 (1.0)        2408          12
test_graph0_thunder_0[backward-torch_inductor]       65.1505 (1.88)      97.6453 (1.84)      67.9273 (1.90)     4.2301 (1.36)      66.8875 (1.91)     1.2630 (2.73)       88;110       14.7216 (0.53)       1534          10
test_graph0_thunder_0[backward-thunder_cugraph]     133.6019 (3.86)     211.0944 (3.97)     138.7169 (3.87)     6.9738 (2.24)     136.7520 (3.90)     2.3469 (5.07)        66;79        7.2089 (0.26)        743          10
test_graph0_thunder_0[backward-thunder]             134.0730 (3.87)     188.0126 (3.53)     138.3093 (3.86)     5.5063 (1.77)     136.6975 (3.90)     2.2269 (4.81)        57;66        7.2302 (0.26)        746          10
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'compute_type=ComputeType.TRAINING_FORWARD': 4 tests -----------------------------------------------------------------------------
Name (time in us)                                       Min                 Max                Mean            StdDev              Median               IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_graph0_thunder_0[forward-eager]                 4.2473 (1.0)        6.4680 (1.0)        4.3972 (1.0)      0.3725 (1.0)        4.3045 (1.0)      0.0299 (1.0)       119;180      227.4189 (1.0)        2351         100
test_graph0_thunder_0[forward-torch_inductor]       19.3764 (4.56)      31.3543 (4.85)      20.2809 (4.61)     1.9849 (5.33)      19.8094 (4.60)     0.2820 (9.44)      120;185       49.3076 (0.22)       2573          20
test_graph0_thunder_0[forward-thunder]             123.7120 (29.13)    159.8947 (24.72)    128.0359 (29.12)    4.5992 (12.35)    126.7161 (29.44)    2.0274 (67.85)       60;82        7.8103 (0.03)        808          10
test_graph0_thunder_0[forward-thunder_cugraph]     128.7621 (30.32)    163.7167 (25.31)    132.5799 (30.15)    4.5898 (12.32)    131.4400 (30.54)    1.8199 (60.91)       44;47        7.5426 (0.03)        776          10
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
8 passed, 54393 warnings in 50.37s


graph1_thunder_0:
Running 8 items in this shard
........                                                                 [100%]

------------------------------------------------------------------------------ benchmark 'compute_type=ComputeType.TRAININ     G_BACKWARD': 4 tests ------------------------------------------------------------------------------
Name (time in us)                                        Min                 Max                Mean             StdDev                   Median               IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------     ---------------------------------------------------------------------------------------------------
test_graph1_thunder_0[backward-eager]                51.0826 (1.0)       91.2683 (1.0)       53.0721 (1.0)       3.9710 (1     .02)      52.1101 (1.0)      0.6900 (1.0)        99;149       18.8423 (1.0)        1936          10
test_graph1_thunder_0[backward-torch_inductor]       79.6944 (1.56)     105.2492 (1.15)      82.7173 (1.56)      3.8760 (1     .0)       81.7223 (1.57)     1.1479 (1.66)        69;83       12.0894 (0.64)       1262          10
test_graph1_thunder_0[backward-thunder_cugraph]     149.4829 (2.93)     230.7002 (2.53)     154.4036 (2.91)      7.6522 (1     .97)     152.0009 (2.92)     2.9549 (4.28)        67;74        6.4765 (0.34)        670          10
test_graph1_thunder_0[backward-thunder]             161.7689 (3.17)     514.5594 (5.64)     171.0046 (3.22)     17.8343 (4     .60)     166.0554 (3.19)     4.3313 (6.28)      147;180        5.8478 (0.31)       2063           3
--------------------------------------------------------------------------------------------------------------------------     ---------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------ benchmark 'compute_type=ComputeType.TRAININ     G_FORWARD': 4 tests -----------------------------------------------------------------------------
Name (time in us)                                       Min                 Max                Mean            StdDev                   Median               IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------     -------------------------------------------------------------------------------------------------
test_graph1_thunder_0[forward-eager]                 7.1465 (1.0)        9.6177 (1.0)        7.3252 (1.0)      0.3800 (1.0     )        7.2330 (1.0)      0.0507 (1.0)        69;120      136.5153 (1.0)        1398         100
test_graph1_thunder_0[forward-torch_inductor]       18.9999 (2.66)      35.8237 (3.72)      20.1437 (2.75)     2.3191 (6.1     0)      19.4813 (2.69)     0.4085 (8.06)      162;247       49.6434 (0.36)       2622          20
test_graph1_thunder_0[forward-thunder]             129.5229 (18.12)    199.6013 (20.75)    134.8926 (18.41)    5.5045 (14.     49)    133.7800 (18.50)    3.8628 (76.20)       56;52        7.4133 (0.05)        771          10
test_graph1_thunder_0[forward-thunder_cugraph]     132.3270 (18.52)    220.7174 (22.95)    137.8833 (18.82)    6.5722 (17.     30)    136.0469 (18.81)    3.3962 (67.00)       59;62        7.2525 (0.05)        755          10
--------------------------------------------------------------------------------------------------------------------------     -------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
8 passed, 52474 warnings in 51.08s


Max allocated memory usage:
graph0_thunder_0[forward-thunder]: 9.765625e-07 GB
graph0_thunder_0[forward-torch_inductor]: 9.765625e-07 GB
graph0_thunder_0[forward-eager]: 9.765625e-07 GB
graph0_thunder_0[forward-thunder_cugraph]: 9.765625e-07 GB


graph0_thunder_0[backward-thunder]: 2.9296875e-06 GB
graph0_thunder_0[backward-torch_inductor]: 2.9296875e-06 GB
graph0_thunder_0[backward-eager]: 2.9296875e-06 GB
graph0_thunder_0[backward-thunder_cugraph]: 2.9296875e-06 GB


graph1_thunder_0[forward-thunder]: 1.46484375e-06 GB
graph1_thunder_0[forward-torch_inductor]: 1.46484375e-06 GB
graph1_thunder_0[forward-thunder_cugraph]: 1.46484375e-06 GB
graph1_thunder_0[forward-eager]: 1.953125e-06 GB


graph1_thunder_0[backward-thunder]: 3.90625e-06 GB
graph1_thunder_0[backward-torch_inductor]: 3.90625e-06 GB
graph1_thunder_0[backward-thunder_cugraph]: 3.90625e-06 GB
graph1_thunder_0[backward-eager]: 4.39453125e-06 GB

2. Modifies the reproducer script to have command line options

save_reproducer_to_folder(self, reproducer_folder: str | PathLike, use_pytest_benchmark: bool = False, save_input_tensor=False,) can be used to save benchmark/reproducer script with random or dumped pt input tensors.
The reproducer script has configurable options, the default one is --check_consistency=False --compute_type=forward --executor=thunder

The repro script example
"""
Environment information get from `torch.utils.collect_env.get_pretty_env_info()`:
PyTorch version: 2.6.0a0+ecf3bae40a.nvInternal
Is debug build: False
CUDA used to build PyTorch: 12.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.1 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version: Could not collect
CMake version: version 3.28.3
Libc version: glibc-2.39

Python version: 3.12.3 (main, Nov  6 2024, 18:32:19) [GCC 13.2.0] (64-bit runtime)
Python platform: Linux-6.5.0-44-generic-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: 12.8.61
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA RTX 6000 Ada Generation
GPU 1: NVIDIA RTX 6000 Ada Generation

Nvidia driver version: 545.29.02
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.7.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             32
On-line CPU(s) list:                0-31
Vendor ID:                          AuthenticAMD
BIOS Vendor ID:                     Advanced Micro Devices, Inc.
Model name:                         AMD Ryzen 9 7950X 16-Core Processor
BIOS Model name:                    AMD Ryzen 9 7950X 16-Core Processor             Unknown CPU @ 4.5GHz
BIOS CPU family:                    107
CPU family:                         25
Model:                              97
Thread(s) per core:                 2
Core(s) per socket:                 16
Socket(s):                          1
Stepping:                           2
CPU(s) scaling MHz:                 63%
CPU max MHz:                        5881.0000
CPU min MHz:                        400.0000
BogoMIPS:                           8999.83
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization:                     AMD-V
L1d cache:                          512 KiB (16 instances)
L1i cache:                          512 KiB (16 instances)
L2 cache:                           16 MiB (16 instances)
L3 cache:                           64 MiB (2 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-31
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cudnn-frontend==1.9.0
[pip3] optree==0.14.0
[pip3] optree==0.14.0
[pip3] pytorch-lightning==2.5.0.post0
[pip3] pytorch-triton==3.1.0+cf34004b8.internal
[pip3] torch==2.6.0a0+ecf3bae40a.nvinternal
[pip3] torchmetrics==1.6.1
[pip3] torchvision==0.20.0a0
[conda] Could not collect

Versions of Thunder related libraries:
lightning-thunder==0.2.0.dev0
nvfuser==0.2.24+git062dd50

Split Information:
The original graph is not split, and is entirely run by Thunder.
"""

from math import inf
from math import nan
NoneType = type(None)
import torch
from torch import device
import torch.fx._pytree as fx_pytree
import torch.utils._pytree as pytree
from functools import partial
import thunder
from thunder.transforms.cudagraph import CUDAGraphTransform
from thunder.dev_utils.nvtx_profile_transform import NvtxProfileTransform

# NOTE: The reproducer function has already been processed by TorchDynamo.
# If we let it go through TorchDynamo again, it could be segmented further.
# To avoid this, we directly use Inductor here.
# See issue https://github.com/Lightning-AI/lightning-thunder/issues/1521
def torch_inductor(fn, inputs):
    from torch._inductor import compile as inductor_compile
    from torch.fx import symbolic_trace

    fx_graph = symbolic_trace(fn)
    return inductor_compile(fx_graph, inputs)


bench_executors_dict = {}
bench_executors_dict["thunder"]=partial(thunder.jit, )
bench_executors_dict["torch_inductor"]=torch_inductor
bench_executors_dict["eager"]=None

bench_executors_dict["thunder_cugraph"]=partial(thunder.jit, transform=CUDAGraphTransform())


import argparse

parser = argparse.ArgumentParser(description="Script for executing an FX graph with specified configurations.")

parser.add_argument(
    "--check_consistency",
    type=bool,
    default=False,
    help="Whether to check consistency (default: False)"
)
parser.add_argument(
    "--compute_type",
    type=str,
    choices=["forward", "forward+backward"],
    default="forward",
    help="Type of computation to perform (forward, forward+backward)"
)
parser.add_argument(
    "--executor",
    type=str,
    choices=["torch_inductor", "thunder", "eager"],
    default="thunder",
    help="Type of executors (thunder, torch_inductor, or eager)"
)

args = parser.parse_args()
ex_name = args.executor
compute_type = args.compute_type
check_acc = args.check_consistency

def test_graph0_thunder_0():
    class DynamoModule(torch.nn.Module):
      def forward(self, l_x_ : torch.Tensor):
          y = l_x_.sin();  l_x_ = None
          return y

    inputs = [
        torch.testing.make_tensor((4, 4), dtype=torch.float32,  device='cuda:0', requires_grad=True, low=-1.35869300365448, high=1.8803045749664307,).as_strided((4, 4), (4, 1)),

    ]

    mod = DynamoModule()
    from thunder.dynamo.report import run_repro

    result = run_repro(bench_executors_dict, ex_name, mod, compute_type, *inputs)
    if check_acc:
        eager_result = run_repro(bench_executors_dict, "eager", mod, compute_type, *inputs)
        for (compute_t, eager_v), (_, cur_v) in zip(eager_result.items(), result.items()):
            torch.testing.assert_close(eager_v, cur_v, msg=lambda e : f'{compute_t}: {e}')


test_graph0_thunder_0()

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@kiya00
Copy link
Collaborator Author

kiya00 commented Jan 22, 2025

Hi @mruberry @IvanYashchuk @kshitij12345 , I’ve prepared an initial version of the report, which you can find in the PR description. Could you let me know if this aligns with the direction you had in mind? Additionally, I’d appreciate any suggestions you have regarding the report's content, thank you

@mruberry
Copy link
Collaborator

fyi @riccardofelluga, too, since he's had some experience with benchmarking

@mruberry
Copy link
Collaborator

Super cool!

A few notes and questions:

  • What does graph0_thunder_0 mean? Is it the first FX graph (graph0) and then the first segment that thunder creates?
  • NoneType can be imported from the types module: from types import NoneType
  • If benchmarking, can the executors to be used be passed to the function? Like if I wanted to benchmark eager and torch.compile, could I just pass those two to thunderfx_report, maybe like: executors=(('eager', lambda x: x), ('torchcompile', torch.compile))?

For this PR, let's focus on clarifying this small issues and merge the thunderfx_report function so it can be used.

For the next PR, I think it'd be useful to create an FXReport object. What I'm thinking is:

  • Create a new function called fx_report (not Thunder-specific)
  • Return a FXReport object from fx_report
  • The FXReport object has a sequence of FXGraphReport objects, one per fx graph in the original program
  • Each FXGraphReport has a method to write a reproducer script
    • The method should be able to easily select whether the reproducer uses PyTorch eager, torch.compile, or thunderfx (with some supported options)
    • For total control, the method should allow writing an arbitrary executor string and allow for extending what's imported with arbitrary import strings, in case someone wants to use a very special version of thunderfx or torch.compile or PyTorch eager
  • Each FXGraphReport should allow its FXGraph's inputs to be queried (they can just be FakeTensors for now)

So someone could do something like:

report: FXReport = fx_report(foo)

fxgr: FXGraphReport
for fxgr in report.graph_reports:
  print(fxgr.inputs)
  fxgr.write_eager_reproducer(filename0)
  fxgr.write_torchcompile_reproducer(filename1)
  fxgr.write_thunder_reproducer(filename2)
  fxgr.write_reproducer(filename3, executor_string="my_crazy_executor", import_strings=("import my_crazy_module", "from my_crazy_module import my_crazy_executor"))

And that would generate four reproducers for each of the FX graphs in the program, each one using the different requested executor.

Writing the reproducers directly like this may not be super interesting in isolation, but once that's working we can add more of the data that thunderfx_report makes available, and that will become really interesting really fast. It would also be great if the existing reporting functionality, like thunderfx_report could be updated to use FXGraphReport and FXReport objects internally for consistency. Using modular Python objects for the report data and analysis should make it easier for us to develop and for others to extend.

What are your thoughts, @kiya00?

@kiya00
Copy link
Collaborator Author

kiya00 commented Jan 24, 2025

Hi @mruberry , thank you so much for your detailed suggestion!

What does graph0_thunder_0 mean? Is it the first FX graph (graph0) and then the first segment that thunder creates?

yes, that's the naming, correspond to the name saved in the subgraph_info

NoneType can be imported from the types module: from types import NoneType

This part comes from the custom_builtins that torch uses, custom_builtins = "\n".join([v.import_str for v in _custom_builtins.values()]), I noticed it's used with the fxgraph code string in Torch, although I haven't seen anything other than device appear in fxgraph code yet.

from math import inf
from math import nan
NoneType = type(None)
import torch
from torch import device
import torch.fx._pytree as fx_pytree
import torch.utils._pytree as pytree

If benchmarking, can the executors to be used be passed to the function? Like if I wanted to benchmark eager and torch.compile, could I just pass those two to thunderfx_report, maybe like: executors=(('eager', lambda x: x), ('torchcompile', torch.compile))?

It depends on if we can convert any executor code into a corresponding string, I'm not very familiar with it, is it possible if I simply use repr or inspect to recover the code string from an object? Currently the possible executors are fixed in a dict I created and the user can pass in the name to select which to run

That's a very great suggestion about creating FXReport/FXGraphReport object!

And that would generate four reproducers for each of the FX graphs in the program, each one using the different requested executor.
Writing the reproducers directly like this may not be super interesting in isolation, but once that's working we can add more of the data that thunderfx_report makes available, and that will become really interesting really fast.

Regarding the generated scripts, in this PR I'm trying something like

save_reproducer_to_folder("repro", use_pytest_benchmark= False, save_input_tensor=False,)
save_reproducer_to_folder("benchmark", use_pytest_benchmark= True, save_input_tensor=False,)

and the above will save one script per fx graph per repro/benchmark, and the script can be used by giving different options, like pytest benchmark/graph0_thunder_0.py --compute_type backward --executor thunder or python repro/graph0_thunder_0.py --check_consistency=True --compute_type=forward --executor=torch_inductor. it's like we can either generate one script per executor, it can make the script more clear and the separate fxgr.write_eager_reproducer fxgr.write_torchcompile_reproducer fxgr.write_thunder_reproducer APIs are more easy to use, or we can generate single script with command line options for all the exes, it's about UX, I think both are OK, which one do you like better?
I seem to get it wrong in the previous text, FXReport is not thunder specific but general, that means we don't get the fx graph from the ThunderCompiler.subgraph_infos, but instead use the original Dynamo API to get the fx graph and apply the executors the user wants (user defined ones will be input as string) and generate the script for each executor as you mentioned. Then the next step is to try to replace the Thunder specific report to use the FXReport API, and for the current PR we could keep the "one script with command line options" format for now, is that right?

Copy link
Collaborator

@riccardofelluga riccardofelluga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a pair of nits but for the rest it looks good to me! One thing I would add is printing the output folder path such that it's easier for users to go and fish out the reproducer files.

On a more technical note, is there a way to make sure that the memory reports are accurate? I've tested it a bunch and they seem to be correct, tho sometimes with repeated tests one might interfere with the other. Maybe we can add a little warning of some sort

def save_reproducer_to_folder(self, reproducer_folder: str | PathLike, use_pytest_benchmark: bool = False):
def save_reproducer_to_folder(
self,
reproducer_folder: str | PathLike,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There does not seem to be any check to validate the folder string passed here, maybe it's better to add it here or later in reproducer function since it's user facing

bench_executors_dict["thunder_cugraph"]=partial(thunder.jit, transform=CUDAGraphTransform())
"""

COMMAND_LINE_ARGS = f"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for f as there is no variable included in the string

Suggested change
COMMAND_LINE_ARGS = f"""
COMMAND_LINE_ARGS = """

Comment on lines +148 to +150
print(
f"Failed to run the benchmarking script({benchmark_folder / file}), exception: {benchmark_result.stderr}"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add a file cleanup in case of failure?

@mruberry
Copy link
Collaborator

Hi @mruberry , thank you so much for your detailed suggestion!

What does graph0_thunder_0 mean? Is it the first FX graph (graph0) and then the first segment that thunder creates?

yes, that's the naming, correspond to the name saved in the subgraph_info

Sounds good.

NoneType can be imported from the types module: from types import NoneType

This part comes from the custom_builtins that torch uses, custom_builtins = "\n".join([v.import_str for v in _custom_builtins.values()]), I noticed it's used with the fxgraph code string in Torch, although I haven't seen anything other than device appear in fxgraph code yet.

No worries, that sounds great.

from math import inf
from math import nan
NoneType = type(None)
import torch
from torch import device
import torch.fx._pytree as fx_pytree
import torch.utils._pytree as pytree

If benchmarking, can the executors to be used be passed to the function? Like if I wanted to benchmark eager and torch.compile, could I just pass those two to thunderfx_report, maybe like: executors=(('eager', lambda x: x), ('torchcompile', torch.compile))?

It depends on if we can convert any executor code into a corresponding string, I'm not very familiar with it, is it possible if I simply use repr or inspect to recover the code string from an object? Currently the possible executors are fixed in a dict I created and the user can pass in the name to select which to run

Great question! So it is, in general, difficult to convert a Python callable into a string representing how to call that callable. Using a fixed set of executors that you know how to print is the right thing to do. Passing in a name to select which to run is a good start. In the future we can maybe let developers specify their own strings, too (per my comment above).

That's a very great suggestion about creating FXReport/FXGraphReport object!

And that would generate four reproducers for each of the FX graphs in the program, each one using the different requested executor.
Writing the reproducers directly like this may not be super interesting in isolation, but once that's working we can add more of the data that thunderfx_report makes available, and that will become really interesting really fast.

Regarding the generated scripts, in this PR I'm trying something like

save_reproducer_to_folder("repro", use_pytest_benchmark= False, save_input_tensor=False,)
save_reproducer_to_folder("benchmark", use_pytest_benchmark= True, save_input_tensor=False,)

and the above will save one script per fx graph per repro/benchmark, and the script can be used by giving different options, like pytest benchmark/graph0_thunder_0.py --compute_type backward --executor thunder or python repro/graph0_thunder_0.py --check_consistency=True --compute_type=forward --executor=torch_inductor. it's like we can either generate one script per executor, it can make the script more clear and the separate fxgr.write_eager_reproducer fxgr.write_torchcompile_reproducer fxgr.write_thunder_reproducer APIs are more easy to use, or we can generate single script with command line options for all the exes, it's about UX, I think both are OK, which one do you like better? I seem to get it wrong in the previous text, FXReport is not thunder specific but general, that means we don't get the fx graph from the ThunderCompiler.subgraph_infos, but instead use the original Dynamo API to get the fx graph and apply the executors the user wants (user defined ones will be input as string) and generate the script for each executor as you mentioned. Then the next step is to try to replace the Thunder specific report to use the FXReport API, and for the current PR we could keep the "one script with command line options" format for now, is that right?

That's right!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants