Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: lowering replace aten.full_like with aten.full #3077

Merged
merged 14 commits into from
Aug 21, 2024

Conversation

chohk88
Copy link
Collaborator

@chohk88 chohk88 commented Aug 12, 2024

Description

  1. Lowering: Replaced aten.full_like with aten.full.
  2. Decomposition: For aten.scatter_add, replaced torch.zeros_like with torch.zeros.
  3. Removed expected_ops_param for the aten.scatter_add test case.
  4. Linting for py/torch_tensorrt/dynamo/lowering/_decompositions.py file.

Type of change

Please delete options that are not relevant and/or add your own.

  • New feature (non-breaking change which adds functionality)

Checklist:

  • My code follows the style guidelines of this project (You can use the linters)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas and hacks
  • I have made corresponding changes to the documentation
  • I have added tests to verify my fix or my feature
  • New and existing unit tests pass locally with my changes
  • I have added the relevant labels to my PR in so that relevant reviewers are notified

@chohk88 chohk88 requested a review from apbose August 12, 2024 17:33
@chohk88 chohk88 requested a review from peri044 August 12, 2024 17:33
@chohk88 chohk88 self-assigned this Aug 12, 2024
@github-actions github-actions bot added component: tests Issues re: Tests component: lowering Issues re: The lowering / preprocessing passes component: api [Python] Issues re: Python API component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Aug 12, 2024
@chohk88
Copy link
Collaborator Author

chohk88 commented Aug 12, 2024

One concern is that the test case takes a significant amount of runtime, as shown below:

DEBUG: [Torch-TensorRT - Debug Build] - Torch-TensorRT TensorRT Engine:
  Name: _run_on_acc_0_engine
  Inputs: [
  ]
  Outputs: [
    id: 0
      name: output0
      shape: [3, 3]
      dtype: Float
  ]
  Device: Device(ID: 0, Name: NVIDIA GeForce RTX 3080 Ti, SM Capability: 8.6, Type: GPU)
  Hardware Compatibility: Disabled

DEBUG: [Torch-TensorRT - Debug Build] - Attempting to run engine (ID: _run_on_acc_0_engine); Hardware Compatible: 0
INFO: [Torch-TensorRT - Debug Build] - Execution profiling is enabled, find results here:
  Device selection profile: /tmp/_run_on_acc_0_engine_device_config_profile.trace
  Input packing profile: /tmp/_run_on_acc_0_engine_input_profile.trace
  Output packing profile: /tmp/_run_on_acc_0_engine_output_profile.trace
  TRT enqueue profile: /tmp/_run_on_acc_0_engine_enqueue_profile.trace
  Engine execution profile: /tmp/_run_on_acc_0_engine_engine_exectuion_profile.trace

DEBUG: [Torch-TensorRT - Debug Build] - Output Name: output0 Shape: [3, 3]
INFO: [Torch-TensorRT - Debug Build] - 
========== _run_on_acc_0_engine profile ==========
                                                   TensorRT layer name    Runtime, %  Invocations  Runtime, ms
                                                                output          0.0%            1         0.00
                   Reformatting CopyNode for Output Tensor 0 to output        100.0%            1         2.28
========== _run_on_acc_0_engine total runtime = 2.27635 ms ==========

INFO: [Torch-TensorRT - Debug Build] - The profiling verbosity was set to ProfilingVerbosity::kLAYER_NAMES_ONLY when the engine was built, so only the layer names will be returned. Rebuild the engine with ProfilingVerbosity::kDETAILED to get more verbose layer information.
..INFO:torch_tensorrt.dynamo.utils:Using Default Torch-TRT Runtime (as requested by user)
INFO:torch_tensorrt.dynamo.utils:Device not specified, using Torch default current device - cuda:0. If this is incorrect, please specify an input device, via the device keyword.
INFO:torch_tensorrt.dynamo.utils:Compilation Settings: CompilationSettings(enabled_precisions={<dtype.f32: 7>}, debug=False, workspace_size=0, min_block_size=1, torch_executed_ops=set(), pass_through_build_failures=True, max_aux_streams=None, version_compatible=False, optimization_level=None, use_python_runtime=False, truncate_double=False, use_fast_partitioner=True, enable_experimental_decompositions=False, device=Device(type=DeviceType.GPU, gpu_id=0), require_full_compilation=False, disable_tf32=False, assume_dynamic_shape_support=False, sparse_weights=False, make_refitable=False, engine_capability=<EngineCapability.STANDARD: 1>, num_avg_timing_iters=1, dla_sram_size=1048576, dla_local_dram_size=1073741824, dla_global_dram_size=536870912, dryrun=False, hardware_compatible=False, timing_cache_path='/tmp/timing_cache.bin', lazy_engine_init=False)

I think we could optimize here

return np.full(shape, fill_value)

# Extract arguments from full_like
input_tensor = node.args[0]
fill_value = node.args[1]
shape = list(input_tensor.meta["tensor_meta"].shape)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use "val" key instead of "tensor_meta" ? If "val" isn't available, then use tensor_meta

@peri044
Copy link
Collaborator

peri044 commented Aug 15, 2024

One concern is that the test case takes a significant amount of runtime, as shown below:

DEBUG: [Torch-TensorRT - Debug Build] - Torch-TensorRT TensorRT Engine:
  Name: _run_on_acc_0_engine
  Inputs: [
  ]
  Outputs: [
    id: 0
      name: output0
      shape: [3, 3]
      dtype: Float
  ]
  Device: Device(ID: 0, Name: NVIDIA GeForce RTX 3080 Ti, SM Capability: 8.6, Type: GPU)
  Hardware Compatibility: Disabled

DEBUG: [Torch-TensorRT - Debug Build] - Attempting to run engine (ID: _run_on_acc_0_engine); Hardware Compatible: 0
INFO: [Torch-TensorRT - Debug Build] - Execution profiling is enabled, find results here:
  Device selection profile: /tmp/_run_on_acc_0_engine_device_config_profile.trace
  Input packing profile: /tmp/_run_on_acc_0_engine_input_profile.trace
  Output packing profile: /tmp/_run_on_acc_0_engine_output_profile.trace
  TRT enqueue profile: /tmp/_run_on_acc_0_engine_enqueue_profile.trace
  Engine execution profile: /tmp/_run_on_acc_0_engine_engine_exectuion_profile.trace

DEBUG: [Torch-TensorRT - Debug Build] - Output Name: output0 Shape: [3, 3]
INFO: [Torch-TensorRT - Debug Build] - 
========== _run_on_acc_0_engine profile ==========
                                                   TensorRT layer name    Runtime, %  Invocations  Runtime, ms
                                                                output          0.0%            1         0.00
                   Reformatting CopyNode for Output Tensor 0 to output        100.0%            1         2.28
========== _run_on_acc_0_engine total runtime = 2.27635 ms ==========

INFO: [Torch-TensorRT - Debug Build] - The profiling verbosity was set to ProfilingVerbosity::kLAYER_NAMES_ONLY when the engine was built, so only the layer names will be returned. Rebuild the engine with ProfilingVerbosity::kDETAILED to get more verbose layer information.
..INFO:torch_tensorrt.dynamo.utils:Using Default Torch-TRT Runtime (as requested by user)
INFO:torch_tensorrt.dynamo.utils:Device not specified, using Torch default current device - cuda:0. If this is incorrect, please specify an input device, via the device keyword.
INFO:torch_tensorrt.dynamo.utils:Compilation Settings: CompilationSettings(enabled_precisions={<dtype.f32: 7>}, debug=False, workspace_size=0, min_block_size=1, torch_executed_ops=set(), pass_through_build_failures=True, max_aux_streams=None, version_compatible=False, optimization_level=None, use_python_runtime=False, truncate_double=False, use_fast_partitioner=True, enable_experimental_decompositions=False, device=Device(type=DeviceType.GPU, gpu_id=0), require_full_compilation=False, disable_tf32=False, assume_dynamic_shape_support=False, sparse_weights=False, make_refitable=False, engine_capability=<EngineCapability.STANDARD: 1>, num_avg_timing_iters=1, dla_sram_size=1048576, dla_local_dram_size=1073741824, dla_global_dram_size=536870912, dryrun=False, hardware_compatible=False, timing_cache_path='/tmp/timing_cache.bin', lazy_engine_init=False)

I think we could optimize here

return np.full(shape, fill_value)

Do you mean the static case which uses np.full takes more time or the dynamic one ? For the static, I'm not sure what control we have for optimization. Also, there would be variation b/w debug and release build. So, you might want to try the release build for perf measurement.

@github-actions github-actions bot added component: core Issues re: The core compiler component: runtime labels Aug 20, 2024
@github-actions github-actions bot added component: conversion Issues re: Conversion stage component: converters Issues re: Specific op converters labels Aug 20, 2024
@peri044 peri044 merged commit 7d0f540 into main Aug 21, 2024
45 of 67 checks passed
@peri044
Copy link
Collaborator

peri044 commented Aug 21, 2024

@chohk88 Opened a new issue for the perf improvement you were mentioning #3107

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed component: api [Python] Issues re: Python API component: conversion Issues re: Conversion stage component: converters Issues re: Specific op converters component: core Issues re: The core compiler component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths component: lowering Issues re: The lowering / preprocessing passes component: runtime component: tests Issues re: Tests
Projects
None yet
3 participants