Inconsistent parsing ways for fasta ids in `get_esm*_embedding.py` and `data_process.py`. #1

alchemistcai · 2024-09-11T11:14:26Z

In get_esm_embedding.py>process_fasta,get_esm_if_embedding.py>embedding and data_process.py>prep_test_dataset,utils.py>process_fasta_file functions,fasta ids are parsed like:

# in get_esm_embedding.py>process_fasta
ID_list.append(rec.id.split("|")[1])

# in get_esm_if_embedding.py>embedding
ids = [rec.id.split("|")[1] for rec in recs]
seqs = {rec.id.split("|")[1]: str(rec.seq) for rec in recs} 

# in data_process.py>prep_test_dataset
ID_list = [rec.id for rec in recs]

# in utils.py>process_fasta_file
for i in range(0, len(lines), 3): # hard code of fasta formats,not robust
    id = lines[i].strip().replace(">", "")

I use get_esm*_embedding.py to generate embedding (see.npy) from a fasta file like:

>|see   # `>sea` will lead to get_esm*_embedding.py raise IndexError: list index out of range
some sequence
>|sea
some sequence

When I use inference.py,the id is parsed as |sea and the script fails.I adjusts data_process.py to make it work.

I suggest:

the scripts above use the same strategy to parse ids
refactor code to call only one function to keep consistent
define an optional argument key to pass a Callable object to let others decide how to parse rec.id,like python's list.sort(key=None).

The text was updated successfully, but these errors were encountered:

empyriumz · 2024-09-11T14:32:42Z

Hi @alchemistcai,

Thanks for testing the code and pointing out inconsistencies. They resulted from handling different fasta ID conventions when I developed the pipeline. I will refactor the code to use a single function to parse the fasta file to be consistent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent parsing ways for fasta ids in `get_esm*_embedding.py` and `data_process.py`. #1

Inconsistent parsing ways for fasta ids in `get_esm*_embedding.py` and `data_process.py`. #1

alchemistcai commented Sep 11, 2024 •

edited

Loading

empyriumz commented Sep 11, 2024

Inconsistent parsing ways for fasta ids in get_esm*_embedding.py and data_process.py. #1

Inconsistent parsing ways for fasta ids in get_esm*_embedding.py and data_process.py. #1

Comments

alchemistcai commented Sep 11, 2024 • edited Loading

empyriumz commented Sep 11, 2024

Inconsistent parsing ways for fasta ids in `get_esm*_embedding.py` and `data_process.py`. #1

Inconsistent parsing ways for fasta ids in `get_esm*_embedding.py` and `data_process.py`. #1

alchemistcai commented Sep 11, 2024 •

edited

Loading