Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TAPIR performance degradation with cudnn9 #99

Open
cdoersch opened this issue Jun 17, 2024 · 3 comments
Open

TAPIR performance degradation with cudnn9 #99

cdoersch opened this issue Jun 17, 2024 · 3 comments

Comments

@cdoersch
Copy link
Collaborator

Internally running with cudnn9 results in poor TAPIR performance. It's unclear if anyone external has encountered the same issue. Our teams have traced the issue to a broken cudnn9 convolution kernel. This is being tracked in the following bug at nvidia:

https://partners.nvidia.com/bug/viewbug/4705291

@bhack
Copy link

bhack commented Jul 11, 2024

Any news on this? I don't know if we have visibility on this upstream ticket. Can you share something more about the performance. Is it about accuracy or speed?

@cdoersch
Copy link
Collaborator Author

Same speed, catastrophic collapse in accuracy. If you look at the results, the failure will be obvious.

We suspect that CUDA is reading/writing memory that doesn't belong to the tensor it's supposed to be reading/writing, leading to garbage in the network.

@bhack
Copy link

bhack commented Jul 11, 2024

Is it related to a specific cudnn9 version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants