Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Fix: Preserve full FASTA ID in alignment directory parsing to prevent truncation errors #485

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

juliocesar-io
Copy link

Background

When running inference with run_pretrained_openfold.py and using precomputed alignments, the parse_fasta function is partially extracting the FASTA tag/ID from the original ID used to generate the alignments output folder. It removes special characters, such as hyphens (-) or periods (.), which are often used in FASTA IDs.

This causes the inference to fail, as the partially extracted ID does not match the alignments folder.

For example, if you have a FASTA file like this:

>my-fasta-sequence
AABBCC

Then, after running the precompute_alignments.py script, the following alignments are generated (as expected):

├── input
│   └── fasta_dir
│       └── my-fasta-sequence.fasta
├── output
│   ├── alignments
│   │   └── my-fasta-sequence
│   │       ├── bfd_uniclust_hits.a3m
│   │       ├── hhsearch_output.hhr
│   │       ├── mgnify_hits.sto
│   │       └── uniref90_hits.sto

However, when you run the run_pretrained_openfold.py script with the --use_precomputed_alignments flag, you will encounter the following error:

Traceback (most recent call last):
  File "/opt/openfold/run_pretrained_openfold.py", line 499, in <module>
    main(args)
  File "/opt/openfold/run_pretrained_openfold.py", line 299, in main
    feature_dict = generate_feature_dict(
  File "/opt/openfold/run_pretrained_openfold.py", line 151, in generate_feature_dict
    feature_dict = data_processor.process_fasta(
  File "/opt/openfold/openfold/data/data_pipeline.py", line 883, in process_fasta
    hits = self._parse_template_hit_files(
  File "/opt/openfold/openfold/data/data_pipeline.py", line 795, in _parse_template_hit_files
    for f in os.listdir(alignment_dir):
FileNotFoundError: [Errno 2] No such file or directory: '/run_path/output/alignments/my'

Fix

The error occurs because of the truncation performed by parse_fasta, causing it to look for "my" instead of the expected "my-fasta-sequence". I have updated the parse_fasta function to fix this issue.

Previously, the part of the code that split the IDs using the regex (re.split('\W|\|', t)) was cutting off parts of the ID. For the workflow using precomputed alignments to function correctly, the full ID must be preserved so that it matches the folder.

Changes:

  • Each entry is now split into the tag (header) and the sequence, while preserving the entire header.
  • The regex splitting that truncated the header has been removed, so the entire line after > is treated as the ID.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant