Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent parsing ways for fasta ids in get_esm*_embedding.py and data_process.py. #1

Open
alchemistcai opened this issue Sep 11, 2024 · 1 comment

Comments

@alchemistcai
Copy link
Contributor

alchemistcai commented Sep 11, 2024

In get_esm_embedding.py>process_fasta,get_esm_if_embedding.py>embedding and data_process.py>prep_test_dataset,utils.py>process_fasta_file functions,fasta ids are parsed like:

# in get_esm_embedding.py>process_fasta
ID_list.append(rec.id.split("|")[1])

# in get_esm_if_embedding.py>embedding
ids = [rec.id.split("|")[1] for rec in recs]
seqs = {rec.id.split("|")[1]: str(rec.seq) for rec in recs} 

# in data_process.py>prep_test_dataset
ID_list = [rec.id for rec in recs]

# in utils.py>process_fasta_file
for i in range(0, len(lines), 3): # hard code of fasta formats,not robust
    id = lines[i].strip().replace(">", "")

I use get_esm*_embedding.py to generate embedding (see.npy) from a fasta file like:

>|see   # `>sea` will lead to get_esm*_embedding.py raise IndexError: list index out of range
some sequence
>|sea
some sequence

When I use inference.py,the id is parsed as |sea and the script fails.I adjusts data_process.py to make it work.

I suggest:

  • the scripts above use the same strategy to parse ids
  • refactor code to call only one function to keep consistent
  • define an optional argument key to pass a Callable object to let others decide how to parse rec.id,like python's list.sort(key=None).
@empyriumz
Copy link
Member

Hi @alchemistcai,

Thanks for testing the code and pointing out inconsistencies. They resulted from handling different fasta ID conventions when I developed the pipeline. I will refactor the code to use a single function to parse the fasta file to be consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants