Bug Fix: Preserve full FASTA ID in alignment directory parsing to prevent truncation errors #485

juliocesar-io · 2024-09-07T05:23:31Z

Background

When running inference with run_pretrained_openfold.py and using precomputed alignments, the parse_fasta function is partially extracting the FASTA tag/ID from the original ID used to generate the alignments output folder. It removes special characters, such as hyphens (-) or periods (.), which are often used in FASTA IDs.

This causes the inference to fail, as the partially extracted ID does not match the alignments folder.

For example, if you have a FASTA file like this:

>my-fasta-sequence
AABBCC

Then, after running the precompute_alignments.py script, the following alignments are generated (as expected):

├── input
│   └── fasta_dir
│       └── my-fasta-sequence.fasta
├── output
│   ├── alignments
│   │   └── my-fasta-sequence
│   │       ├── bfd_uniclust_hits.a3m
│   │       ├── hhsearch_output.hhr
│   │       ├── mgnify_hits.sto
│   │       └── uniref90_hits.sto

However, when you run the run_pretrained_openfold.py script with the --use_precomputed_alignments flag, you will encounter the following error:

Traceback (most recent call last):
  File "/opt/openfold/run_pretrained_openfold.py", line 499, in <module>
    main(args)
  File "/opt/openfold/run_pretrained_openfold.py", line 299, in main
    feature_dict = generate_feature_dict(
  File "/opt/openfold/run_pretrained_openfold.py", line 151, in generate_feature_dict
    feature_dict = data_processor.process_fasta(
  File "/opt/openfold/openfold/data/data_pipeline.py", line 883, in process_fasta
    hits = self._parse_template_hit_files(
  File "/opt/openfold/openfold/data/data_pipeline.py", line 795, in _parse_template_hit_files
    for f in os.listdir(alignment_dir):
FileNotFoundError: [Errno 2] No such file or directory: '/run_path/output/alignments/my'

Fix

The error occurs because of the truncation performed by parse_fasta, causing it to look for "my" instead of the expected "my-fasta-sequence". I have updated the parse_fasta function to fix this issue.

Previously, the part of the code that split the IDs using the regex (re.split('\W|\|', t)) was cutting off parts of the ID. For the workflow using precomputed alignments to function correctly, the full ID must be preserved so that it matches the folder.

Changes:

Each entry is now split into the tag (header) and the sequence, while preserving the entire header.
The regex splitting that truncated the header has been removed, so the entire line after > is treated as the ID.

fix bug in parse fasta

3887d40

juliocesar-io mentioned this pull request Sep 7, 2024

Inference error using precomputed alignments #422

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Fix: Preserve full FASTA ID in alignment directory parsing to prevent truncation errors #485

Bug Fix: Preserve full FASTA ID in alignment directory parsing to prevent truncation errors #485

juliocesar-io commented Sep 7, 2024

Bug Fix: Preserve full FASTA ID in alignment directory parsing to prevent truncation errors #485

Are you sure you want to change the base?

Bug Fix: Preserve full FASTA ID in alignment directory parsing to prevent truncation errors #485

Conversation

juliocesar-io commented Sep 7, 2024

Background

Fix