Output issue on H100 with memory-efficient kernel #468

psohani · 2024-07-15T10:32:13Z

Hello,

As per this line, when neither Deepspeed nor LMA is selected, the custom memory-efficient kernel is used for the attention layer. When running inference with this option on H100, the output looks to be completely random. So there could be a potential bug with this kernel that shows up on only the H100.
For reference, here are the unrelaxed predictions on A100 and H100, for the 5XQN sequence (left, green is A100 output; right, yellow is H100 output):

Both the above tests were run with the same pre-computed MSA alignment. In case it helps, we can also share the MSA used for this protein.

A possible workaround is to unconditionally disable the memory-efficient kernel, at the cost of increased memory usage. The other alternative is of course, to enable the Deepspeed kernels. We have analyzed both these options and confirmed that their outputs are correct.

Please consider how this issue can be resolved; thanks!

abhinavb22 · 2024-07-17T16:43:02Z

I also have the same issue with my protein. Here is what I see with H100 and A100 with the same MSAs and Fasta seq and input settings.

Do you have any temporary fix for this issue @psohani ?

psohani · 2024-07-17T18:34:09Z

Thanks for the additional confirmation. The temporary fix would be to simply hard-code the criteria that will bypass the custom kernel, wherever it is getting called. Specifically, these changes should be sufficient:

Add inplace_safe = False before this line: structure_module.py#L435
Add use_memory_efficient_kernel = False before this line: primitives.py#L517

Please confirm if this workaround resolves the issue on H100.

abhinavb22 · 2024-07-17T21:29:05Z

I just added this flag in the inference command and it worked (but just takes longer time to run):
--use_deepspeed_evoformer_attention
I will try your fix and see if that helps too. Thank you.

vetmax7 · 2024-08-19T03:58:05Z

Hello!
@psohani @abhinavb22, could you explain please how you used H100 with Openfond if by default it uses pytorch=1.12. and cuda:11.3.1 which do not support H100 (sm_90)?

abhinavb22 · 2024-08-20T00:11:01Z

Hello! @psohani @abhinavb22, could you explain please how you used H100 with Openfond if by default it uses pytorch=1.12. and cuda:11.3.1 which do not support H100 (sm_90)?

I installed openfold with cuda 12. https://openfold.readthedocs.io/en/latest/Installation.html. You need to do : git clone -b pl_upgrades https://github.com/aqlaboratory/openfold.git
and then follow the next steps....

refer: #462 (comment)

vetmax7 · 2024-08-21T03:46:16Z

@abhinavb22 Thank you!

vetmax7 · 2024-08-25T06:53:06Z

Hi @abhinavb22 @psohani

@abhinavb22 I tried to install from "pl_upgrades" branch. But with default environment.yml, it installed for example numpy=2.* which doesn't work with this version of OpenFold. Also, pandas and pytorch-lightning had incompatible new versions. So I could not run train of OF.

I changed:

pytorch-lightning=2.1.4
numpy=1.21.*
pandas=2.0.*

Also I tried with default installed PL (v. 2.4.0) but with the same error about 'dataloader_idx'.

Now I got error about 'dataloader_idx' and I don't know how to solve it. Could you provide what versions of packages do you have? I checked with V100 and H100, and got the same error.

TypeError: PerformanceLoggingCallback.on_train_batch_start() missing 1 required positional argument: 'dataloader_idx' [rank: 1] Child process with PID 3461284 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟

train_openfold.py", line 703, in <module> main(args) File "train_openfold.py", line 452, in main trainer.fit( File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch return function(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run results = self._run_stage() File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_stage self.fit_loop.run() File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run self.advance() File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance self.epoch_loop.run(self._data_fetcher) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run self.advance(data_fetcher) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 223, in advance call._call_callback_hooks(trainer, "on_train_batch_start", batch, batch_idx) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks fn(trainer, trainer.lightning_module, *args, **kwargs) TypeError: PerformanceLoggingCallback.on_train_batch_start() missing 1 required positional argument: 'dataloader_idx' [rank: 1] Child process with PID 3461284 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟

RJ3 · 2024-08-26T14:08:01Z

@vetmax7 I haven't seen this issue. Are the unit tests working?

abhinavb22 · 2024-08-27T06:03:18Z

Hi @abhinavb22 @psohani

@abhinavb22 I tried to install from "pl_upgrades" branch. But with default environment.yml, it installed for example numpy=2.* which doesn't work with this version of OpenFold. Also, pandas and pytorch-lightning had incompatible new versions. So I could not run train of OF.

I changed:

pytorch-lightning=2.1.4

numpy=1.21.*

pandas=2.0.*

Also I tried with default installed PL (v. 2.4.0) but with the same error about 'dataloader_idx'.

Now I got error about 'dataloader_idx' and I don't know how to solve it. Could you provide what versions of packages do you have? I checked with V100 and H100, and got the same error.

TypeError: PerformanceLoggingCallback.on_train_batch_start() missing 1 required positional argument: 'dataloader_idx' [rank: 1] Child process with PID 3461284 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟

train_openfold.py", line 703, in <module> main(args) File "train_openfold.py", line 452, in main trainer.fit( File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch return function(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run results = self._run_stage() File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_stage self.fit_loop.run() File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run self.advance() File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance self.epoch_loop.run(self._data_fetcher) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run self.advance(data_fetcher) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 223, in advance call._call_callback_hooks(trainer, "on_train_batch_start", batch, batch_idx) File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks fn(trainer, trainer.lightning_module, *args, **kwargs) TypeError: PerformanceLoggingCallback.on_train_batch_start() missing 1 required positional argument: 'dataloader_idx' [rank: 1] Child process with PID 3461284 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟

I think I fixed numpy to 1.26. The following is the environment.yml file I used:

name: openfold_cuda12
channels:

conda-forge
bioconda
pytorch
nvidia
dependencies:
python=3.10
libgcc=7.2
setuptools=59.5.0
pip
openmm=7.7
pdbfixer
pytorch-lightning
biopython
numpy=1.26
pandas
PyYAML==5.4.1
requests
scipy
tqdm==4.62.2
typing-extensions
wandb
modelcif==0.7
awscli
ml-collections
mkl=2022.1
aria2
git
bioconda::hmmer
bioconda::hhsuite
bioconda::kalign2
pytorch::pytorch=2.1
pytorch::pytorch-cuda=12.1
pip:
- deepspeed==0.12.4
- dm-tree==0.1.6
- git+https://github.com/NVIDIA/dllogger.git
- flash-attn

vetmax7 · 2024-08-28T14:45:40Z

Hi all !
@RJ3
I fixed problem with tests. It asked about: Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
So I downloaded and bound it to container.
After that I got:

[2024-08-28 14:36:35,517] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
s.................Using /home/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/.cache/torch_extensions/py310_cu121/evoformer_attn/build.ninja...
Building extension module evoformer_attn...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module evoformer_attn...
..F.........s...s.sss.ss...FFsssssssss.sss....ssssss..s.s.s.ss.s......s.s..ss...ss.s.s....s........
======================================================================
FAIL: test_compare_model (tests.test_deepspeed_evo_attention.TestDeepSpeedKernel)
Run full model with and without using DeepSpeed Evoformer attention kernel
----------------------------------------------------------------------
Traceback (most recent call last):
  File openfold/tests/test_deepspeed_evo_attention.py", line 334, in test_compare_model
    compare_utils.assert_mean_abs_diff_small(out_repro, out_repro_ds, eps)
  File openfold/tests/compare_utils.py", line 139, in assert_mean_abs_diff_small
    _assert_abs_diff_small_base(torch.mean, expected, actual, eps)
  File openfold/tests/compare_utils.py", line 131, in _assert_abs_diff_small_base
    torch.testing.assert_close(err, zero_tensor, atol=eps, rtol=rtol)
  File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1520, in assert_close
    raise error_metas[0].to_error(msg)
AssertionError: Scalars are not close!

Expected 0.0 but got 8.250319480895996.
Absolute difference: 8.250319480895996 (up to 0.2 allowed)
Relative difference: inf (up to 1.3e-06 allowed)

======================================================================
FAIL: test_attention_core_backward (tests.test_kernels.TestAttentionCore)
----------------------------------------------------------------------
Traceback (most recent call last):
  File openfold/tests/test_kernels.py", line 77, in test_attention_core_backward
    self.assertTrue(
AssertionError: tensor(False, device='cuda:0') is not true

======================================================================
FAIL: test_attention_core_forward (tests.test_kernels.TestAttentionCore)
----------------------------------------------------------------------
Traceback (most recent call last):
  File openfold/tests/test_kernels.py", line 28, in test_attention_core_forward
    self.assertTrue(torch.max(torch.abs(out_repro - out_gt)) < consts.eps)
AssertionError: tensor(False, device='cuda:0') is not true

----------------------------------------------------------------------
Ran 117 tests in 24.608s

FAILED (failures=3, skipped=41)
Time to load evoformer_attn op: 0.10246014595031738 seconds

Test(s) failed. Make sure you've installed all Python dependencies.

However, main problem it did not resolved. I still get:

93.2 M    Trainable params
0         Non-trainable params
93.2 M    Total params
372.916   Total estimated model params size (MB)
4451      Modules in train mode
0         Modules in eval mode
/opt/conda/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py:105: Total length of `list` across ranks is zero. Please make sure this was your intention.

TypeError: PerformanceLoggingCallback.on_train_batch_start() missing 1 required positional argument: 'dataloader_idx'
[rank: 1] Child process with PID 4104199 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟

UPD: in openfold/utils/logger.py
I changed: from dataloader_idx to dataloader_idx=None. At least it run now. But..........

RJ3 · 2024-08-28T23:15:26Z

I cannot speak to the other failed tests you're receiving, but the one with regards to the precision appears to be the same as I'm getting here: #481
It should pass sometimes.

I agree the aspect about CUTLASS appears to be missing from the wiki documentation. Exporting the env var helps pass additional tests and is what allowed me to get to the point I'm at now.

vetmax7 · 2024-08-29T04:15:49Z

I cannot speak to the other failed tests you're receiving, but the one with regards to the precision appears to be the same as I'm getting here: #481 It should pass sometimes.

I agree the aspect about CUTLASS appears to be missing from the wiki documentation. Exporting the env var helps pass additional tests and is what allowed me to get to the point I'm at now.

I noticed that only some tests were adapted for this branch. If I run only them, they work OK.

vetmax7 mentioned this issue Aug 26, 2024

Trouble following documentation #226

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output issue on H100 with memory-efficient kernel #468

Output issue on H100 with memory-efficient kernel #468

psohani commented Jul 15, 2024

abhinavb22 commented Jul 17, 2024

psohani commented Jul 17, 2024

abhinavb22 commented Jul 17, 2024

vetmax7 commented Aug 19, 2024

abhinavb22 commented Aug 20, 2024 •

edited

Loading

vetmax7 commented Aug 21, 2024

vetmax7 commented Aug 25, 2024 •

edited

Loading

RJ3 commented Aug 26, 2024

abhinavb22 commented Aug 27, 2024

vetmax7 commented Aug 28, 2024 •

edited

Loading

RJ3 commented Aug 28, 2024

vetmax7 commented Aug 29, 2024

Output issue on H100 with memory-efficient kernel #468

Output issue on H100 with memory-efficient kernel #468

Comments

psohani commented Jul 15, 2024

abhinavb22 commented Jul 17, 2024

psohani commented Jul 17, 2024

abhinavb22 commented Jul 17, 2024

vetmax7 commented Aug 19, 2024

abhinavb22 commented Aug 20, 2024 • edited Loading

vetmax7 commented Aug 21, 2024

vetmax7 commented Aug 25, 2024 • edited Loading

RJ3 commented Aug 26, 2024

abhinavb22 commented Aug 27, 2024

vetmax7 commented Aug 28, 2024 • edited Loading

RJ3 commented Aug 28, 2024

vetmax7 commented Aug 29, 2024

abhinavb22 commented Aug 20, 2024 •

edited

Loading

vetmax7 commented Aug 25, 2024 •

edited

Loading

vetmax7 commented Aug 28, 2024 •

edited

Loading