Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_vep_embeddings.py GPU memory allocation error #33

Closed
peterdfields opened this issue Jan 2, 2025 · 6 comments
Closed

run_vep_embeddings.py GPU memory allocation error #33

peterdfields opened this issue Jan 2, 2025 · 6 comments

Comments

@peterdfields
Copy link

Hi,

I've been trying to run through the Arabidopsis example as a test run in order to understand the pipeline prior to trying it out on my own system. However, I've run into a few errors that I was hoping to run by you.

The error I'm currently stuck on is resulting in the following message:

Traceback (most recent call last):
  File "/home/fieldp/miniconda3/envs/gpn-arabidopsis/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/fieldp/miniconda3/envs/gpn-arabidopsis/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/fieldp/miniconda3/envs/gpn-arabidopsis/lib/python3.9/site-packages/gpn/ss/run_vep_embeddings.py", line 205, in <module>
    pred = run_vep(
  File "/home/fieldp/miniconda3/envs/gpn-arabidopsis/lib/python3.9/site-packages/gpn/ss/run_vep_embeddings.py", line 144, in run_vep
    return trainer.predict(test_dataset=variants).predictions
  File "/home/fieldp/miniconda3/envs/gpn-arabidopsis/lib/python3.9/site-packages/transformers/trainer.py", line 4128, in predict
    output = eval_loop(
  File "/home/fieldp/miniconda3/envs/gpn-arabidopsis/lib/python3.9/site-packages/transformers/trainer.py", line 4271, in evaluation_loop
    all_preds.add(logits)
  File "/home/fieldp/miniconda3/envs/gpn-arabidopsis/lib/python3.9/site-packages/transformers/trainer_pt_utils.py", line 322, in add
    self.tensors = nested_concat(self.tensors, tensors, padding_index=self.padding_index)
  File "/home/fieldp/miniconda3/envs/gpn-arabidopsis/lib/python3.9/site-packages/transformers/trainer_pt_utils.py", line 136, in nested_concat
    return torch_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index)
  File "/home/fieldp/miniconda3/envs/gpn-arabidopsis/lib/python3.9/site-packages/transformers/trainer_pt_utils.py", line 94, in torch_pad_and_concatenate
    return torch.cat((tensor1, tensor2), dim=0)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 38.75 GiB. GPU 0 has a total capacity of 79.14 GiB of which 38.74 GiB is free. Including non-PyTorch memory, this process has 40.39 GiB memory in use. Of the allocated memory 39.02 GiB is allocated by PyTorch, and 48.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

This error seems a bit strange given the large jump in memory allocation. I've tried varying the batch size down but I keep getting the error at around 62% completion no matter the batch size specified. The suggested modifications in #13 seem to no longer exist in the relevant script but perhaps I've missed something? You can see the full conda environment here: yaml . Exporting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True also makes no difference. Before I dig in a bit more I was just wondering if the solution is more obvious to you already?

I also ran into the issue described in #27 and created a fork for a fix there. It's pretty small so I'm not sure I should put in a pull request given you may want to keep a consistent structure in the repo.

Please let me know if any additional information would be helpful. Thank you for your time and assistance.

@gonzalobenegas
Copy link
Collaborator

Hello, thank you for your interest and I'm sorry you are running into this issue. run_vep_embeddings is not documented and I've only used it in an unpublished project. I think I know what the issue might be. This script will calculate embedding similarity metrics including per-hidden-dimension metrics, so it's a pretty high-dimensional score per variant, and could use up a lot of memory (I've only used it to score a few thousand variants in my other project so it was not an issue). I don't recommend using this function as is for large-scale variant scoring.

Some useful low-dimensional scores which should work out of the box:

  • Log-likelihood ratio (run_vep.py)
  • Euclidean distance between embeddings (run_vep_embed_dist.py)

run_vep_embeddings is a bloated script that calculates euclidean distance, cosine distance, inner product betwen embeddings, using all hidden dimensions as well as individual dimensions. If you do want to calculate all these for millions of variants without running out of memory, I imagine one strategy would be to more often pass predictions from GPU to CPU (https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.eval_accumulation_steps) or more often save predictions into disk if CPU itself doesn't have enough memory. A simple approach would be to split the variants into shards and run the script on each of them separately.

Let me know if this makes sense and I'm happy to talk more about documented and undocumented ideas :)
I wonder if you were just trying to run gpn.ss.get_embeddings to get something like the UMAP?

Regarding #27 I could certainly accept a pull request, thank you!

@peterdfields
Copy link
Author

Hi @gonzalobenegas. Thank you for your reply! I think I might have misunderstood the structure of the worked example provided at https://github.com/songlab-cal/gpn/tree/main/analysis/arabidopsis. I was applying the snakemake pipeline script to performance test my resources for a smaller genome like A. thaliana, and then was going to adapt it to some of the different taxa I work out to have a closer look at the results. I just wanted to make sure it worked all the way through before started to mess with individual steps.

Given your response, would I be correct in thinking there's a different set of scripts or the ipython notebooks are needed to, say, adapt the analysis described in your 2023 PNAS paper to a different set of inputs? I'm interested in particular in both the validation of genome feature annotation and the variant effect prediction aspects of the manuscript.

Sounds good about the pull request!

Thanks again for your help.

@gonzalobenegas
Copy link
Collaborator

My bad. I see now that the Snakefile will run run_vep_embeddings by default as it is in the all rule. This is not part of the 2023 paper and it shouldn't be there, I'm sorry.

Some thoughts. If you just want to reproduce exactly the 2023 paper I would go back to the release labeled 0.2 and perhaps edit the rule all to specify which targets you are interested in (e.g. expand("output/variants/all/vep/{model}.parquet", model=models), for variant effect prediction and expand("output/embedding/umap/{model}.parquet", model=models + ["kmers"]), for embedding UMAP).

However, I've done modifications to the training and inference code (updated code is in https://github.com/songlab-cal/gpn/blob/main/README.md), which I would recommend for new analysis. Some are related to performance, such as the use of torchrun and torch compile, and some are slight changes in argument names for consistency. You can still use the code in the Snakefile as a guide, but might need renaming e.g. per-device-batch-size -> per_device_batch_size.

Apologies again since my management of old analysis code and updated training scripts has been messy.

@gonzalobenegas
Copy link
Collaborator

Another note, if you just want to run the jupyter notebooks you can download all the intermediate files from Intermediate files necessary for running notebooks, such as embeddings and variant scores.

@peterdfields
Copy link
Author

@gonzalobenegas Thank you for the clarification, that is very helpful. I'll pull the 0.2 release and start working through things. I will go ahead and close this issue.

I do have a few questions regarding applying the human model to mouse variants, and also the use of pangenomes as a training inputs. Would it be possible to email you directly with these questions?

@gonzalobenegas
Copy link
Collaborator

Sounds good, feel free to email me at [email protected]!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants