Skip to content

fix: Destory cuda graphs before setting weight streaming #3461

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 4, 2025

Conversation

keehyuna
Copy link
Collaborator

@keehyuna keehyuna commented Apr 2, 2025

Description

When cuda graphs and weigh streaming are used together, cuda graphs is destroyed after setting the weight streaming.
Weight streaming recreates the context and load new module. Destroying cudagraphs with old reference caused application crash. Fix is to move cuda graphs reset before the weight streaming setting.

The timing of del is not entirely predictable in python. Moved cudagraph reset logic from del to dedicated reset_cudagraph method and it's called when exiting from CudaGraphsTorchTensorRTModule context block

Fixes #3460

Type of change

Please delete options that are not relevant and/or add your own.

  • Bug fix (non-breaking change which fixes an issue)

Checklist:

  • My code follows the style guidelines of this project (You can use the linters)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas and hacks
  • I have made corresponding changes to the documentation
  • I have added tests to verify my fix or my feature
  • New and existing unit tests pass locally with my changes
  • I have added the relevant labels to my PR in so that relevant reviewers are notified

@keehyuna keehyuna self-assigned this Apr 2, 2025
@github-actions github-actions bot added component: core Issues re: The core compiler component: api [Python] Issues re: Python API component: runtime component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Apr 2, 2025
@github-actions github-actions bot requested a review from bowang007 April 2, 2025 15:24
@@ -103,9 +103,13 @@ def validate_input_shapes(self, inputs: Sequence[torch.Tensor]) -> bool:

return False

def __del__(self) -> None:
def reset_captured_graph(self) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_reset_captured_graph

@narendasan narendasan added the cherry-pick To cherry-pick to a release branch label Apr 4, 2025
@narendasan
Copy link
Collaborator

@zewenli98 please cherry pick this to 2.7

@narendasan narendasan added needs-release-cherrypick and removed cherry-pick To cherry-pick to a release branch labels Apr 4, 2025
@keehyuna keehyuna merged commit 297adef into pytorch:main Apr 4, 2025
76 checks passed
@zewenli98 zewenli98 mentioned this pull request Apr 4, 2025
7 tasks
zewenli98 added a commit that referenced this pull request Apr 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed component: api [Python] Issues re: Python API component: core Issues re: The core compiler component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths component: runtime needs-release-cherrypick
Projects
None yet
Development

Successfully merging this pull request may close these issues.

🐛 [Bug] Weight streaming test fail
3 participants