Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output issue on H100 with memory-efficient kernel #468

Open
psohani opened this issue Jul 15, 2024 · 12 comments
Open

Output issue on H100 with memory-efficient kernel #468

psohani opened this issue Jul 15, 2024 · 12 comments

Comments

@psohani
Copy link

psohani commented Jul 15, 2024

Hello,

As per this line, when neither Deepspeed nor LMA is selected, the custom memory-efficient kernel is used for the attention layer. When running inference with this option on H100, the output looks to be completely random. So there could be a potential bug with this kernel that shows up on only the H100.
For reference, here are the unrelaxed predictions on A100 and H100, for the 5XQN sequence (left, green is A100 output; right, yellow is H100 output):
Screenshot 2024-07-15 at 11 29 14
Both the above tests were run with the same pre-computed MSA alignment. In case it helps, we can also share the MSA used for this protein.

A possible workaround is to unconditionally disable the memory-efficient kernel, at the cost of increased memory usage. The other alternative is of course, to enable the Deepspeed kernels. We have analyzed both these options and confirmed that their outputs are correct.

Please consider how this issue can be resolved; thanks!

@abhinavb22
Copy link

I also have the same issue with my protein. Here is what I see with H100 and A100 with the same MSAs and Fasta seq and input settings.
Screenshot from 2024-07-17 11-37-34
image

Do you have any temporary fix for this issue @psohani ?

@psohani
Copy link
Author

psohani commented Jul 17, 2024

Thanks for the additional confirmation. The temporary fix would be to simply hard-code the criteria that will bypass the custom kernel, wherever it is getting called. Specifically, these changes should be sufficient:

  1. Add inplace_safe = False before this line: structure_module.py#L435
  2. Add use_memory_efficient_kernel = False before this line: primitives.py#L517

Please confirm if this workaround resolves the issue on H100.

@abhinavb22
Copy link

I just added this flag in the inference command and it worked (but just takes longer time to run):
--use_deepspeed_evoformer_attention
I will try your fix and see if that helps too. Thank you.

@vetmax7
Copy link

vetmax7 commented Aug 19, 2024

Hello!
@psohani @abhinavb22, could you explain please how you used H100 with Openfond if by default it uses pytorch=1.12. and cuda:11.3.1 which do not support H100 (sm_90)?

@abhinavb22
Copy link

abhinavb22 commented Aug 20, 2024

Hello! @psohani @abhinavb22, could you explain please how you used H100 with Openfond if by default it uses pytorch=1.12. and cuda:11.3.1 which do not support H100 (sm_90)?

I installed openfold with cuda 12. https://openfold.readthedocs.io/en/latest/Installation.html. You need to do : git clone -b pl_upgrades https://github.com/aqlaboratory/openfold.git
and then follow the next steps....

refer: #462 (comment)

@vetmax7
Copy link

vetmax7 commented Aug 21, 2024

@abhinavb22 Thank you!

@vetmax7
Copy link

vetmax7 commented Aug 25, 2024

Hi @abhinavb22 @psohani

@abhinavb22 I tried to install from "pl_upgrades" branch. But with default environment.yml, it installed for example numpy=2.* which doesn't work with this version of OpenFold. Also, pandas and pytorch-lightning had incompatible new versions. So I could not run train of OF.

I changed:

  • pytorch-lightning=2.1.4
  • numpy=1.21.*
  • pandas=2.0.*

Also I tried with default installed PL (v. 2.4.0) but with the same error about 'dataloader_idx'.

Now I got error about 'dataloader_idx' and I don't know how to solve it. Could you provide what versions of packages do you have? I checked with V100 and H100, and got the same error.

TypeError: PerformanceLoggingCallback.on_train_batch_start() missing 1 required positional argument: 'dataloader_idx' [rank: 1] Child process with PID 3461284 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟

train_openfold.py", line 703, in <module> main(args) File "train_openfold.py", line 452, in main trainer.fit( File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch return function(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run results = self._run_stage() File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_stage self.fit_loop.run() File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run self.advance() File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance self.epoch_loop.run(self._data_fetcher) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run self.advance(data_fetcher) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 223, in advance call._call_callback_hooks(trainer, "on_train_batch_start", batch, batch_idx) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks fn(trainer, trainer.lightning_module, *args, **kwargs) TypeError: PerformanceLoggingCallback.on_train_batch_start() missing 1 required positional argument: 'dataloader_idx' [rank: 1] Child process with PID 3461284 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟

@RJ3
Copy link

RJ3 commented Aug 26, 2024

@vetmax7 I haven't seen this issue. Are the unit tests working?

@abhinavb22
Copy link

Hi @abhinavb22 @psohani

@abhinavb22 I tried to install from "pl_upgrades" branch. But with default environment.yml, it installed for example numpy=2.* which doesn't work with this version of OpenFold. Also, pandas and pytorch-lightning had incompatible new versions. So I could not run train of OF.

I changed:

  • pytorch-lightning=2.1.4
  • numpy=1.21.*
  • pandas=2.0.*

Also I tried with default installed PL (v. 2.4.0) but with the same error about 'dataloader_idx'.

Now I got error about 'dataloader_idx' and I don't know how to solve it. Could you provide what versions of packages do you have? I checked with V100 and H100, and got the same error.

TypeError: PerformanceLoggingCallback.on_train_batch_start() missing 1 required positional argument: 'dataloader_idx' [rank: 1] Child process with PID 3461284 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟

train_openfold.py", line 703, in <module> main(args) File "train_openfold.py", line 452, in main trainer.fit( File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch return function(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run results = self._run_stage() File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_stage self.fit_loop.run() File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run self.advance() File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance self.epoch_loop.run(self._data_fetcher) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run self.advance(data_fetcher) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 223, in advance call._call_callback_hooks(trainer, "on_train_batch_start", batch, batch_idx) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks fn(trainer, trainer.lightning_module, *args, **kwargs) TypeError: PerformanceLoggingCallback.on_train_batch_start() missing 1 required positional argument: 'dataloader_idx' [rank: 1] Child process with PID 3461284 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟

I think I fixed numpy to 1.26. The following is the environment.yml file I used:

name: openfold_cuda12
channels:

  • conda-forge
  • bioconda
  • pytorch
  • nvidia
    dependencies:
  • python=3.10
  • libgcc=7.2
  • setuptools=59.5.0
  • pip
  • openmm=7.7
  • pdbfixer
  • pytorch-lightning
  • biopython
  • numpy=1.26
  • pandas
  • PyYAML==5.4.1
  • requests
  • scipy
  • tqdm==4.62.2
  • typing-extensions
  • wandb
  • modelcif==0.7
  • awscli
  • ml-collections
  • mkl=2022.1
  • aria2
  • git
  • bioconda::hmmer
  • bioconda::hhsuite
  • bioconda::kalign2
  • pytorch::pytorch=2.1
  • pytorch::pytorch-cuda=12.1
  • pip:

@vetmax7
Copy link

vetmax7 commented Aug 28, 2024

Hi all !
@RJ3
I fixed problem with tests. It asked about: Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
So I downloaded and bound it to container.
After that I got:

[2024-08-28 14:36:35,517] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
s.................Using /home/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/.cache/torch_extensions/py310_cu121/evoformer_attn/build.ninja...
Building extension module evoformer_attn...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module evoformer_attn...
..F.........s...s.sss.ss...FFsssssssss.sss....ssssss..s.s.s.ss.s......s.s..ss...ss.s.s....s........
======================================================================
FAIL: test_compare_model (tests.test_deepspeed_evo_attention.TestDeepSpeedKernel)
Run full model with and without using DeepSpeed Evoformer attention kernel
----------------------------------------------------------------------
Traceback (most recent call last):
  File openfold/tests/test_deepspeed_evo_attention.py", line 334, in test_compare_model
    compare_utils.assert_mean_abs_diff_small(out_repro, out_repro_ds, eps)
  File openfold/tests/compare_utils.py", line 139, in assert_mean_abs_diff_small
    _assert_abs_diff_small_base(torch.mean, expected, actual, eps)
  File openfold/tests/compare_utils.py", line 131, in _assert_abs_diff_small_base
    torch.testing.assert_close(err, zero_tensor, atol=eps, rtol=rtol)
  File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1520, in assert_close
    raise error_metas[0].to_error(msg)
AssertionError: Scalars are not close!

Expected 0.0 but got 8.250319480895996.
Absolute difference: 8.250319480895996 (up to 0.2 allowed)
Relative difference: inf (up to 1.3e-06 allowed)

======================================================================
FAIL: test_attention_core_backward (tests.test_kernels.TestAttentionCore)
----------------------------------------------------------------------
Traceback (most recent call last):
  File openfold/tests/test_kernels.py", line 77, in test_attention_core_backward
    self.assertTrue(
AssertionError: tensor(False, device='cuda:0') is not true

======================================================================
FAIL: test_attention_core_forward (tests.test_kernels.TestAttentionCore)
----------------------------------------------------------------------
Traceback (most recent call last):
  File openfold/tests/test_kernels.py", line 28, in test_attention_core_forward
    self.assertTrue(torch.max(torch.abs(out_repro - out_gt)) < consts.eps)
AssertionError: tensor(False, device='cuda:0') is not true

----------------------------------------------------------------------
Ran 117 tests in 24.608s

FAILED (failures=3, skipped=41)
Time to load evoformer_attn op: 0.10246014595031738 seconds

Test(s) failed. Make sure you've installed all Python dependencies.

However, main problem it did not resolved. I still get:

93.2 M    Trainable params
0         Non-trainable params
93.2 M    Total params
372.916   Total estimated model params size (MB)
4451      Modules in train mode
0         Modules in eval mode
/opt/conda/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py:105: Total length of `list` across ranks is zero. Please make sure this was your intention.
TypeError: PerformanceLoggingCallback.on_train_batch_start() missing 1 required positional argument: 'dataloader_idx'
[rank: 1] Child process with PID 4104199 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟

UPD: in openfold/utils/logger.py
I changed: from dataloader_idx to dataloader_idx=None. At least it run now. But..........

@RJ3
Copy link

RJ3 commented Aug 28, 2024

I cannot speak to the other failed tests you're receiving, but the one with regards to the precision appears to be the same as I'm getting here: #481
It should pass sometimes.

I agree the aspect about CUTLASS appears to be missing from the wiki documentation. Exporting the env var helps pass additional tests and is what allowed me to get to the point I'm at now.

@vetmax7
Copy link

vetmax7 commented Aug 29, 2024

I cannot speak to the other failed tests you're receiving, but the one with regards to the precision appears to be the same as I'm getting here: #481 It should pass sometimes.

I agree the aspect about CUTLASS appears to be missing from the wiki documentation. Exporting the env var helps pass additional tests and is what allowed me to get to the point I'm at now.

I noticed that only some tests were adapted for this branch. If I run only them, they work OK.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants