feat: lowering replace aten.full_like with aten.full #3077

chohk88 · 2024-08-12T17:33:16Z

Description

Lowering: Replaced aten.full_like with aten.full.
Decomposition: For aten.scatter_add, replaced torch.zeros_like with torch.zeros.
Removed expected_ops_param for the aten.scatter_add test case.
Linting for py/torch_tensorrt/dynamo/lowering/_decompositions.py file.

Type of change

Please delete options that are not relevant and/or add your own.

New feature (non-breaking change which adds functionality)

Checklist:

My code follows the style guidelines of this project (You can use the linters)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas and hacks
I have made corresponding changes to the documentation
I have added tests to verify my fix or my feature
New and existing unit tests pass locally with my changes
I have added the relevant labels to my PR in so that relevant reviewers are notified

chohk88 · 2024-08-12T17:45:04Z

One concern is that the test case takes a significant amount of runtime, as shown below:

DEBUG: [Torch-TensorRT - Debug Build] - Torch-TensorRT TensorRT Engine:
  Name: _run_on_acc_0_engine
  Inputs: [
  ]
  Outputs: [
    id: 0
      name: output0
      shape: [3, 3]
      dtype: Float
  ]
  Device: Device(ID: 0, Name: NVIDIA GeForce RTX 3080 Ti, SM Capability: 8.6, Type: GPU)
  Hardware Compatibility: Disabled

DEBUG: [Torch-TensorRT - Debug Build] - Attempting to run engine (ID: _run_on_acc_0_engine); Hardware Compatible: 0
INFO: [Torch-TensorRT - Debug Build] - Execution profiling is enabled, find results here:
  Device selection profile: /tmp/_run_on_acc_0_engine_device_config_profile.trace
  Input packing profile: /tmp/_run_on_acc_0_engine_input_profile.trace
  Output packing profile: /tmp/_run_on_acc_0_engine_output_profile.trace
  TRT enqueue profile: /tmp/_run_on_acc_0_engine_enqueue_profile.trace
  Engine execution profile: /tmp/_run_on_acc_0_engine_engine_exectuion_profile.trace

DEBUG: [Torch-TensorRT - Debug Build] - Output Name: output0 Shape: [3, 3]
INFO: [Torch-TensorRT - Debug Build] - 
========== _run_on_acc_0_engine profile ==========
                                                   TensorRT layer name    Runtime, %  Invocations  Runtime, ms
                                                                output          0.0%            1         0.00
                   Reformatting CopyNode for Output Tensor 0 to output        100.0%            1         2.28
========== _run_on_acc_0_engine total runtime = 2.27635 ms ==========

INFO: [Torch-TensorRT - Debug Build] - The profiling verbosity was set to ProfilingVerbosity::kLAYER_NAMES_ONLY when the engine was built, so only the layer names will be returned. Rebuild the engine with ProfilingVerbosity::kDETAILED to get more verbose layer information.
..INFO:torch_tensorrt.dynamo.utils:Using Default Torch-TRT Runtime (as requested by user)
INFO:torch_tensorrt.dynamo.utils:Device not specified, using Torch default current device - cuda:0. If this is incorrect, please specify an input device, via the device keyword.
INFO:torch_tensorrt.dynamo.utils:Compilation Settings: CompilationSettings(enabled_precisions={<dtype.f32: 7>}, debug=False, workspace_size=0, min_block_size=1, torch_executed_ops=set(), pass_through_build_failures=True, max_aux_streams=None, version_compatible=False, optimization_level=None, use_python_runtime=False, truncate_double=False, use_fast_partitioner=True, enable_experimental_decompositions=False, device=Device(type=DeviceType.GPU, gpu_id=0), require_full_compilation=False, disable_tf32=False, assume_dynamic_shape_support=False, sparse_weights=False, make_refitable=False, engine_capability=<EngineCapability.STANDARD: 1>, num_avg_timing_iters=1, dla_sram_size=1048576, dla_local_dram_size=1073741824, dla_global_dram_size=536870912, dryrun=False, hardware_compatible=False, timing_cache_path='/tmp/timing_cache.bin', lazy_engine_init=False)

I think we could optimize here

TensorRT/py/torch_tensorrt/dynamo/conversion/impl/full.py

Line 28 in 6321710

return np.full(shape, fill_value)

peri044 · 2024-08-14T02:58:18Z

py/torch_tensorrt/dynamo/lowering/passes/replace_full_like_with_full.py

+            # Extract arguments from full_like
+            input_tensor = node.args[0]
+            fill_value = node.args[1]
+            shape = list(input_tensor.meta["tensor_meta"].shape)


Can you use "val" key instead of "tensor_meta" ? If "val" isn't available, then use tensor_meta

peri044 · 2024-08-15T23:38:19Z

One concern is that the test case takes a significant amount of runtime, as shown below:

DEBUG: [Torch-TensorRT - Debug Build] - Torch-TensorRT TensorRT Engine:
  Name: _run_on_acc_0_engine
  Inputs: [
  ]
  Outputs: [
    id: 0
      name: output0
      shape: [3, 3]
      dtype: Float
  ]
  Device: Device(ID: 0, Name: NVIDIA GeForce RTX 3080 Ti, SM Capability: 8.6, Type: GPU)
  Hardware Compatibility: Disabled

DEBUG: [Torch-TensorRT - Debug Build] - Attempting to run engine (ID: _run_on_acc_0_engine); Hardware Compatible: 0
INFO: [Torch-TensorRT - Debug Build] - Execution profiling is enabled, find results here:
  Device selection profile: /tmp/_run_on_acc_0_engine_device_config_profile.trace
  Input packing profile: /tmp/_run_on_acc_0_engine_input_profile.trace
  Output packing profile: /tmp/_run_on_acc_0_engine_output_profile.trace
  TRT enqueue profile: /tmp/_run_on_acc_0_engine_enqueue_profile.trace
  Engine execution profile: /tmp/_run_on_acc_0_engine_engine_exectuion_profile.trace

DEBUG: [Torch-TensorRT - Debug Build] - Output Name: output0 Shape: [3, 3]
INFO: [Torch-TensorRT - Debug Build] - 
========== _run_on_acc_0_engine profile ==========
                                                   TensorRT layer name    Runtime, %  Invocations  Runtime, ms
                                                                output          0.0%            1         0.00
                   Reformatting CopyNode for Output Tensor 0 to output        100.0%            1         2.28
========== _run_on_acc_0_engine total runtime = 2.27635 ms ==========

INFO: [Torch-TensorRT - Debug Build] - The profiling verbosity was set to ProfilingVerbosity::kLAYER_NAMES_ONLY when the engine was built, so only the layer names will be returned. Rebuild the engine with ProfilingVerbosity::kDETAILED to get more verbose layer information.
..INFO:torch_tensorrt.dynamo.utils:Using Default Torch-TRT Runtime (as requested by user)
INFO:torch_tensorrt.dynamo.utils:Device not specified, using Torch default current device - cuda:0. If this is incorrect, please specify an input device, via the device keyword.
INFO:torch_tensorrt.dynamo.utils:Compilation Settings: CompilationSettings(enabled_precisions={<dtype.f32: 7>}, debug=False, workspace_size=0, min_block_size=1, torch_executed_ops=set(), pass_through_build_failures=True, max_aux_streams=None, version_compatible=False, optimization_level=None, use_python_runtime=False, truncate_double=False, use_fast_partitioner=True, enable_experimental_decompositions=False, device=Device(type=DeviceType.GPU, gpu_id=0), require_full_compilation=False, disable_tf32=False, assume_dynamic_shape_support=False, sparse_weights=False, make_refitable=False, engine_capability=<EngineCapability.STANDARD: 1>, num_avg_timing_iters=1, dla_sram_size=1048576, dla_local_dram_size=1073741824, dla_global_dram_size=536870912, dryrun=False, hardware_compatible=False, timing_cache_path='/tmp/timing_cache.bin', lazy_engine_init=False)

I think we could optimize here

TensorRT/py/torch_tensorrt/dynamo/conversion/impl/full.py

Line 28 in 6321710

return np.full(shape, fill_value)

Do you mean the static case which uses np.full takes more time or the dynamic one ? For the static, I'm not sure what control we have for optimization. Also, there would be variation b/w debug and release build. So, you might want to try the release build for perf measurement.

This reverts commit a985057.

…sorRT into converter_full_like_lowering

peri044 · 2024-08-21T00:14:57Z

@chohk88 Opened a new issue for the perf improvement you were mentioning #3107

feat: lowering replace aten.full_like with aten.full

482320b

chohk88 requested a review from apbose August 12, 2024 17:33

facebook-github-bot added the cla signed label Aug 12, 2024

chohk88 requested a review from peri044 August 12, 2024 17:33

chohk88 self-assigned this Aug 12, 2024

github-actions bot added component: tests Issues re: Tests component: lowering Issues re: The lowering / preprocessing passes component: api [Python] Issues re: Python API component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Aug 12, 2024

chore: minor linting

551dc8e

chohk88 requested a review from lanluo-nvidia August 12, 2024 17:45

peri044 reviewed Aug 14, 2024

View reviewed changes

peri044 and others added 5 commits August 19, 2024 14:35

chore: updates

7f728f2

Merge branch 'main' into converter_full_like_lowering

a0fe1fa

chore: CI test

a985057

Revert "chore: CI test"

96ce071

This reverts commit a985057.

chore: CI test

918a995

chohk88 linked an issue Aug 20, 2024 that may be closed by this pull request

🐛 [Bug] Segmentation Fault in TorchTensorRTModule with PyTorch 2.5.0.dev Due to full_like Converter Replacement #3104

Closed

peri044 added 2 commits August 20, 2024 11:30

chore: Fix seg fault

933cc77

Merge branch 'converter_full_like_lowering' of github.com:pytorch/Ten…

eab999f

…sorRT into converter_full_like_lowering

github-actions bot added component: core Issues re: The core compiler component: runtime labels Aug 20, 2024

peri044 added 2 commits August 20, 2024 11:38

chore: revert previous changes

61ac6cf

chore: updates

04a2ee3

github-actions bot added component: conversion Issues re: Conversion stage component: converters Issues re: Specific op converters labels Aug 20, 2024

peri044 added 2 commits August 20, 2024 12:00

chore: minor fix

52e23ce

chore: updates

4868333

Merge branch 'main' into converter_full_like_lowering

3232a3c

peri044 merged commit 7d0f540 into main Aug 21, 2024
45 of 67 checks passed

peri044 mentioned this pull request Aug 21, 2024

✨[Feature] Perf improvement for full converter #3107

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: lowering replace aten.full_like with aten.full #3077

feat: lowering replace aten.full_like with aten.full #3077

chohk88 commented Aug 12, 2024 •

edited

Loading

chohk88 commented Aug 12, 2024

peri044 Aug 14, 2024

peri044 commented Aug 15, 2024

peri044 commented Aug 21, 2024

feat: lowering replace aten.full_like with aten.full #3077

feat: lowering replace aten.full_like with aten.full #3077

Conversation

chohk88 commented Aug 12, 2024 • edited Loading

Description

Type of change

Checklist:

chohk88 commented Aug 12, 2024

peri044 Aug 14, 2024

Choose a reason for hiding this comment

peri044 commented Aug 15, 2024

peri044 commented Aug 21, 2024

chohk88 commented Aug 12, 2024 •

edited

Loading