Skip to content
This repository has been archived by the owner on Aug 1, 2024. It is now read-only.

index out of bounds during zero-shot with msa1b #649

Open
Maxwell-downtown opened this issue Jan 12, 2024 · 0 comments
Open

index out of bounds during zero-shot with msa1b #649

Maxwell-downtown opened this issue Jan 12, 2024 · 0 comments

Comments

@Maxwell-downtown
Copy link

When running zero-shot variant prediction using msa1b with the codes provided in examples/variant-prediction, I came across the following error:
File "predict.py", line 180, in <lambda> lambda row: label_row( File "predict.py", line 114, in label_row score = token_probs[0, 1 + idx, mt_encoded] - token_probs[0, 1 + idx, wt_encoded] IndexError: index 216 is out of bounds for dimension 1 with size 216
the code I use is as followed:
python predict.py --model-location esm_msa1b_t12_100M_UR50S --sequence MSIQHFRVALIPFFAAFCLPVFAHPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRVDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW --dms-input ./data/BLAT_ECOLX_Ranganathan2015.csv --mutation-col mutant --dms-output ./data/BLAT_ECOLX_Ranganathan2015_labeled.csv --offset-idx 1 --scoring-strategy masked-marginals --msa-path ./data/MSA/trial_BLAT.a2m
I use the entire BLAT_ECOLX sequences of 286aa as the input sequence, and all the entries in my .a2m file are of the same length. I also set the -offset-idx to 1, but it doesn't seem to work. I print out the dimension of the batch_tokens and the token_probs in predict.py and find the size which I think represents the length of the protein sequence is 216 while it should be 286 in this case.
Other proteins of different length were also tested, but the dimensions never match. Am i understanding the dimensions of the token_probs wrong?
Besides, running the demonstration codes under examples/variant-prediction with data provided in this directory results in error
RuntimeError: Received unaligned sequences for input to MSA, all sequence lengths must be equal.
code:
python predict.py \ --model-location esm_msa1b_t12_100M_UR50S \ --sequence HPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRVDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW \ --dms-input ./data/BLAT_ECOLX_Ranganathan2015.csv \ --mutation-col mutant \ --dms-output ./data/BLAT_ECOLX_Ranganathan2015_labeled.csv \ --offset-idx 24 \ --scoring-strategy masked-marginals \ --msa-path ./data/BLAT_ECOLX_1_b0.5.a3m

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant