-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trouble following documentation #226
Comments
Hm. PDB70 should be downloaded by The input FASTA file should contain the sequences for which you compute alignments, and so isn't included by default in the downloaded data. If you just want a large database of any alignments, I recommend checking out OpenProteinSet, our database of 4.5 million precomputed MSAs, of which 400k also come with template hits and AF structure predictions: https://registry.opendata.aws/openfold/ |
Hi Gahdritz, |
If you ran the download scripts, you probably already have the Protein Data Bank mmCIF files. If not, you can run The |
This is what I see now: | Name | Type | Params0 | model | AlphaFold | 93.2 M
|
@gahdritz, can you please help? Starting from the beginning, I downloaded all the data available here: https://registry.opendata.aws/openfold/ Specifically, I downloaded the pdb, uniclust30 and uniclust30_overflow directories. I am just trying to test things out so instead of attempting to train the model on all the data, I moved 1000 directories from each of the directories above (pdb, uniclust30 and uniclust30_overflow) to another location so I have smaller versions of each. Since then, I have been trying to run the training script. Following the documentation, I was able to run this: But only after downloading the .cif files corresponding to some entries in the pdb directory. I downloaded those cif files from here: s3://pdbsnapshots/20220103/pub/pdb/data/structures/all/mmCIF/ After generating the mmcif_cache.json file, I was able to run this: Now I am trying to run this: But I keep getting the sampling error (RuntimeError: cannot sample n_sample <= 0 samples) Can you please point me in the right direction, keeping in mind that I am not familiar at all with openfold? Thank you so much. |
Hello! Could you find a solution? |
Hello @RJ3
By the way, do you use the latest version from branch main? Did you install OpenFold with default environment.yml? What versions for numpy, pandas and pytorch-lightning do you have? |
Thanks, |
@RJ3 Cuda 12 or Cuda 11 version are you trying to install? |
I'm trying CUDA 12 and the |
@RJ3 I have: |
@RJ3 were you able to figure out the out of mem error? I am trying to train on A100 (40GB) GPUs with a crop size of 384 and its crashing with out of mem error. |
Hi. I am trying to follow the documentation to install and train the model.
I have successfully installed everything and have run the following commands so far, also successfully:
bash scripts/download_alphafold_dbs.sh data/
bash scripts/download_mmseqs_dbs.sh data/
bash scripts/prep_mmseqs_dbs.sh data/
In my data directory, I have the following:
total 407176420
drwxrwxr-x 2 ubuntu ubuntu 6144 Oct 3 19:16 bfd
drwxrwxr-x 2 ubuntu ubuntu 6144 Oct 3 14:05 colabfold
-rw-rw-r-- 1 ubuntu ubuntu 117965643010 Sep 30 21:20 colabfold_envdb_202108.tar.gz
drwxrwxr-x 2 ubuntu ubuntu 38912 Oct 1 19:03 mmseqs_dbs
drwxrwxr-x 5 ubuntu ubuntu 6144 Oct 1 18:45 tmp
drwxrwxr-x 2 ubuntu ubuntu 6144 Oct 3 15:03 uniref30
-rw-rw-r-- 1 ubuntu ubuntu 149491476480 Sep 30 16:23 uniref30_2103.tar
-rw-rw-r-- 1 ubuntu ubuntu 149491476480 Oct 1 09:47 uniref30_2103.tar.gz
I am now trying to run the training part but I feel I am missing the data I need. For instance, I thought I was going to be able to do this:
python3 scripts/precompute_alignments_mmseqs.py input.fasta
data/mmseqs_dbs
uniref30_2103_db
alignment_dir
~/MMseqs2/build/bin/mmseqs
/usr/bin/hhsearch
--env_db colabfold_envdb_202108_db
--pdb70 data/pdb70/pdb70
But I don't seem to have what I need to create the input.fasta file and I also don't have colabfold_envdb_202108_db and data/pdb70/pdb70.
Can someone kindly point me in the right direction? I am no a data scientists, I am a data engineer/IT/wear many hats person so if I say something that doesn't make much sense in terms of models, etc. I apologize.
Thank you.
The text was updated successfully, but these errors were encountered: