-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Stop breaking backwards compatibility or at least warn #1386
Comments
Hello @danielzgtg, Thank you for flagging the need for clearer error messages with ROCm and library version mismatches. Your feedback is vital in refining our library's usability. Our team will investigate and refine the error notifications to offer guidance for resolving library version disparities. Additionally, we'll clarify any backward compatibility restrictions to assist users in navigating version conflicts more effectively. We'll keep you updated on our progress as we work to enhance the error messages. Your patience and any additional insights during this process are immensely valuable. Wasiq |
@danielzgtg , |
That explains it. Spent the last week troubleshooting why Rocm suddenly stopped working, turns out to be a backwards compatibility issue. Quite frustrating. |
@danielzgtg and @Trat8547 , Having said that, In general when a major version changes ( we follow semantic versioning) API breaking is expected, and upon reviewing the Release notes we see breaking changes in the HIP, and appropriate notification is published here. Those changes could have contributed to the issue reported here. |
Here: TensorLibrary.txt. I think the Your linked https://rocm.docs.amd.com/en/latest/about/release-notes.html#hip appears to only list API breaking changes. What my issue is about is ABI breaking changes. The problem is that the pytorch ROCm is bundling This is why rebuilding pytorch was a workaround for this problem. But I would rather not wait for the long pytorch compile every time, and I also don't want the prepackaged pytorch builds to contain the |
Hi @danielzgtg, Thanks for reporting this. In general, we recommend using the component versions listed in the components list: Component Versions to ensure compatibility, as this has been rigorously tested prior to release. Specifically for rocBLAS, it is highly recommended to use the Tensile version found in tensile_tag as this is the official version supported. |
Versions are completely managed by AMD's pip and apt repositories for Ubuntu on my computer. Anyway AMD's PyTorch-ROCm has updated to ROCm 6.2. To avoid conflicts and duplicate shared libraries, I have been just building PyTorch-ROCm from source. You can close this issue if you believe that this won't happen again halfway through a future ROCm 7 rollout. |
Hi @danielzgtg, The versioning using |
Can you please ensure that this becomes the case? AMD's docs for Ubuntu specifies https://repo.radeon.com/rocm/apt/6.3/ . However, the PyTorch docs only have https://download.pytorch.org/whl/rocm6.2 ( If your release management were synchronized, everything would have been fine. But because this has happened across two pairs of versions, I am still requesting the addition of a mismatched version detection feature. |
Hi @danielzgtg, I had a chat with the internal PyTorch team and they confirmed that the current cadence between ROCm and the corresponding PyTorch wheels release is around 2 weeks. If you are encountering incompatibility issues during these periods, rolling back or not upgrading your ROCm version temporarily should solve it. |
I guess that means "workaround exists" and WONTFIX. |
Describe the bug
rocBLAS 5.6 fails with a confusing error message when mixed with ROCm 6.0 libraries or TensileLibrary.
To Reproduce
Precise version of rocBLAS installed or rocBLAS commit hash if building from source.
Steps to reproduce the behavior:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.6
Expected behavior
I should not have to spend an hour debugging this, and only find the problem using gdb. rocBLAS 5.6 should either succeed or give a clear error message when loading the TensileLibrary from rocBLAS 6.0 or when loaded while mixed in with ROCm shared libraries.
Log-files
Environment
environment.txt
Workaround
Recompile pytorch manually. This will ensure that it loads shared libraries from
/opt
instead ofvenv
.The text was updated successfully, but these errors were encountered: