-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output issue on H100 with memory-efficient kernel #468
Comments
I also have the same issue with my protein. Here is what I see with H100 and A100 with the same MSAs and Fasta seq and input settings. Do you have any temporary fix for this issue @psohani ? |
Thanks for the additional confirmation. The temporary fix would be to simply hard-code the criteria that will bypass the custom kernel, wherever it is getting called. Specifically, these changes should be sufficient:
Please confirm if this workaround resolves the issue on H100. |
I just added this flag in the inference command and it worked (but just takes longer time to run): |
Hello! |
I installed openfold with cuda 12. https://openfold.readthedocs.io/en/latest/Installation.html. You need to do : git clone -b pl_upgrades https://github.com/aqlaboratory/openfold.git refer: #462 (comment) |
@abhinavb22 Thank you! |
@abhinavb22 I tried to install from "pl_upgrades" branch. But with default environment.yml, it installed for example numpy=2.* which doesn't work with this version of OpenFold. Also, pandas and pytorch-lightning had incompatible new versions. So I could not run train of OF. I changed:
Also I tried with default installed PL (v. 2.4.0) but with the same error about 'dataloader_idx'. Now I got error about 'dataloader_idx' and I don't know how to solve it. Could you provide what versions of packages do you have? I checked with V100 and H100, and got the same error.
|
@vetmax7 I haven't seen this issue. Are the unit tests working? |
I think I fixed numpy to 1.26. The following is the environment.yml file I used: name: openfold_cuda12
|
Hi all !
However, main problem it did not resolved. I still get:
UPD: in openfold/utils/logger.py |
I cannot speak to the other failed tests you're receiving, but the one with regards to the precision appears to be the same as I'm getting here: #481 I agree the aspect about CUTLASS appears to be missing from the wiki documentation. Exporting the env var helps pass additional tests and is what allowed me to get to the point I'm at now. |
I noticed that only some tests were adapted for this branch. If I run only them, they work OK. |
Hello,
As per this line, when neither Deepspeed nor LMA is selected, the custom memory-efficient kernel is used for the attention layer. When running inference with this option on H100, the output looks to be completely random. So there could be a potential bug with this kernel that shows up on only the H100.
For reference, here are the unrelaxed predictions on A100 and H100, for the 5XQN sequence (left, green is A100 output; right, yellow is H100 output):
Both the above tests were run with the same pre-computed MSA alignment. In case it helps, we can also share the MSA used for this protein.
A possible workaround is to unconditionally disable the memory-efficient kernel, at the cost of increased memory usage. The other alternative is of course, to enable the Deepspeed kernels. We have analyzed both these options and confirmed that their outputs are correct.
Please consider how this issue can be resolved; thanks!
The text was updated successfully, but these errors were encountered: