Add use_custom_templates option #408

rostro36 · 2024-02-20T10:16:22Z

Introduction & motiviation

Dear OpenFold team,

Thank you very much for creating such a useful tool and give the community the opportunity to use, ablate and extend the implementation!

In my research, I look at the impact of templates and try to find new protocols centered around them. For some aspects of this, I found that localcolabfold was too restrictive and I would like to use some features, which OpenFold can offer. That is why I implemented the use_custom_templates option, which should be similar to the --custom-template-path argument of localcolabfold. I think this should be enough to have an implementation of AF2Rank (see tests below), touted by some their "favorite in the post-AF2 realm".

Changes

Basically, I fitted the template pipeline from thread_sequence to use it with run_pretrained_model.py

openfold/data/templates.py
- extend get_custom_templates to use multiple files
- create CustomHitFeaturizer to use this method
openfold/data/data_pipeline.py
- make sure that we always read the templates if CustomHitFeaturizer is used, no matter what alignment we found
run_pretrained_openfold.py
- implement use_custom_templates option
README.md
- update docs about option

Tests

With the same pre-computed one row MSA, I saw that the inclusion of the template increased the pLDDT from 40ish to 80ish. I have not tested multimers or ESM single sequence mode, but it should not interfere with those. Don't know how well it plays along though.

I did a deeper dive and I think I found that using the very simple 1:1 mapping without alignment allows to use the template without any sequence information, except for the size of the sequence. As an empty use_precomputed_alignments also does not pass any information, this setup can be used for AF2Rank. The only other necessary condition is to correctly form the template to all GLY and only backbone + C-Beta. I can donate a script for the creation of this template in about one month after a deadline if you want.

The only test I did is with the randomly chosen, but short 3NIR against the AF2Rank implementation in the ColabFold notebook.

Do you think the differences is within margin of error or should I have another look?
https://gist.github.com/rostro36/42113291b644388730b17d2aac1186be

End

Please let me know what tests and adjustments I can fulfill to add this feature or if I should merge it to another branch first.
Thank you for work!

jnwei · 2024-02-29T07:51:32Z

Hi @rostro36 , thank you very much for this contribution. This is a really nice option to have for running with templates, and we would like to work with you to add this to OpenFold.

We're preparing some testcases based on previous work that we have done with templates, especially on the single sequence model. Hopefully these testcases will be helpful for adjusting the code changes to work for all inference modes.

I did run some quick integration tests with your PR. It looks like monomer inference mode using pre-computed alignments without templates has a small issue? Here's a sample command:

python $OPENFOLD_DIR/run_pretrained_openfold.py \
            $INPUT_DIR/fasta_dir \
                $MMCIF_DIR \
            --use_precomputed_alignments $PRECOMPUTED_ALIGNMENTS \
            --model_device "cuda:0" \
            --output_dir $OUTPUT_DIR \
            --config_preset "model_1_ptm" \
            --jax_param_path $CHECKPOINT_PATH

And the corresponding error:

Traceback (most recent call last):
File "/eagle/projects/MLProtein/jnwei/openfold/run_pretrained_openfold.py", line 476, in <module>
  main(args)
File "/eagle/projects/MLProtein/jnwei/openfold/run_pretrained_openfold.py", line 283, in main
  feature_dict = generate_feature_dict(
File "/eagle/projects/MLProtein/jnwei/openfold/run_pretrained_openfold.py", line 141, in generate_feature_dict
  feature_dict = data_processor.process_fasta(
File "/lus/eagle/projects/MLProtein/jnwei/openfold/openfold/data/data_pipeline.py", line 889, in process_fasta
  template_features = make_template_features(
File "/lus/eagle/projects/MLProtein/jnwei/openfold/openfold/data/data_pipeline.py", line 57, in make_template_features
  templates_result = template_featurizer.get_templates(
File "/lus/eagle/projects/MLProtein/jnwei/openfold/openfold/data/templates.py", line 1211, in get_templates
  return get_custom_template_features(
File "/lus/eagle/projects/MLProtein/jnwei/openfold/openfold/data/templates.py", line 980, in get_custom_template_features
  curr_features, curr_warnings = _extract_template_features(
File "/lus/eagle/projects/MLProtein/jnwei/openfold/openfold/data/templates.py", line 672, in _extract_template_features
  templates_all_atom_positions[k] = all_atom_positions[template_index][0]

I haven't had a chance to look into this deeply, but maybe have another look at the logic in get_custom_templates for parsing multiple mmcif files?

rostro36 · 2024-03-02T16:57:28Z

Hey @jnwei ,
Thanks for running some tests and highlighting even more test cases.

Commit 10b6838 fixes always taking use_custom_templates.
Commit e9bacd8 makes the processing less dependent on the input sequence and instead take length of template
Commit 50f8617 fixes a mistake for taking many custom templates.

On my machine, the code now works for one or two custom templates with and without the use_custom_templates argument.

Let me know if/how I can help you with the further

sachinkadyan7 · 2024-03-19T02:09:48Z

README.md

@@ -174,7 +174,10 @@ where `data` is the same directory as in the previous step. If `jackhmmer`,
 `/usr/bin`, their `binary_path` command-line arguments can be dropped.
 If you've already computed alignments for the query, you have the option to 
 skip the expensive alignment computation here with 
-`--use_precomputed_alignments`.
+`--use_precomputed_alignments`. If you wish to use a specific template as input,


Can you move this information to a different subheading of it's own? Maybe called "Custom Templates"?
Do mention that it'll be the same template(s) that will be used if multiple separate sequences are passed for inference.

Thanks for your comments. The others are fulfilled without any question.
I am not sure if I understood you correctly about this one here though.

Now, I changed the README, so custom templates stands a bit more on it's own, but still in the monomer inference section.

Should I make a new section on the same level as monomer inference, multimer inference or soloseq?

openfold/data/templates.py

jnwei · 2024-03-20T04:16:23Z

Hi @rostro36 thanks for the fixes.

I finally had a chance to take a closer look at this PR and do some more testing. I had a couple more thoughts/requests:

Sequence alignments: As stated in your OP, this PR currently supports templates that are the same length as the query sequence for a 1:1 mapping, as per the AF2Rank experiment. However, it would be more generally useful if there was an option to support templates of other lengths/chains by first performing an alignment. I can give some more pointers and share my test setup if you're interested in this direction.
Since Soloseq is designed to accept templates to make structure predictions form a single sequence, it would be awesome if this PR also supports that mode. More instructions on how to run Soloseq can be found here.

Let me know what you think, and if you'd be interested in contributing further on either of these directions. Thanks!

rostro36 · 2024-03-20T13:12:59Z

Hey @jnwei , @sachinkadyan7
Thanks for the review. I hope the changes are to your liking.

I should have a look to get some models running again soon and there is another OpenFold contribution I want to look into, but if it is nothing too big, then I can try to help with 1. the sequence alignments.

jnwei · 2024-03-21T10:17:34Z

Leaving my notes here regarding the sequence alignments in case they are helpful later.

A good test would be to compare structure prediction for a query using the default route (using msa to find template matches), and the custom template search route. To this end, I wanted to find which templates were being used by normal inference.

You can find the template matches for a query sequence by adding a breakpoint after the template features dict are generated, and looking at the values for the key template_domain_names. For my sample query: PDB:1LCI (luciferase), I found that the top four templates that matched to this structure in our databases were 6hiv_BC, 6mfw_A, 5wmm_A, 6mfy_A, where the underscores represent the chains.

To create a sequence alignment between the query and the custom template, it might be possible to adapt this function / these lines to the custom_templates route.

openfold/openfold/data/templates.py

Lines 436 to 438 in ef0c9fa

    
           parsed_a3m = parsers.parse_a3m( 
        
               aligner.align([old_template_sequence, new_template_sequence]) 
        
           )

rostro36 · 2024-03-25T17:23:17Z

I added support for sequence alignments. 6mfw, 5wmm and 6mfy that you suggested did not have a good sequence similarity to 1lci and Kalign raised an error.

When supplying 3iep_A, 4g37_A and 5gz2_A (found with FoldSeek) with high sequence similarity, but not 100, everything worked and also did get high pLDDT scores in single sequence mode.

Removing the templates still worked, but gave a bad pLDDT in single sequence.

These checks look good to me so far. Let me know if there is something else to test.

rostro36 · 2024-07-10T00:20:47Z

Dear @jnwei, @sachinkadyan7

The manuscript that utilizes this code has been submitted. As there were no roadblocks I can find on this pull request, I would like to push for feedback again to have a cleaner and wrapped up project on my side.

Thank you

jnwei · 2024-07-16T14:54:24Z

Congratulations on the manuscript!

I would be happy to accept the PR at this time.

The only remaining suggestion would be regarding the documentation of your feature. Since your original submission, we changed the way that we handle documentation. If you would like to include a description of your feature in our new documentation, I would invite you to add a section to the Inference.md

rostro36 · 2024-07-18T01:25:12Z

Thank for the congratulations @jnwei , the work of all of you on the new release of OpenFold deserves even more praise. Well done!

I updated the README and Inference.md. The text is taken more or less from my README version. Inference.md only has the one addendum for the new option and some whitespaces artifacts, but otherwise the same text. I hope this is not a hindrance.

jnwei · 2024-07-18T08:29:28Z

LGTM, thanks for your contribution!

Added custom template folder

17f24bd

rostro36 added 3 commits March 1, 2024 10:08

Fix always is_custom_template.

10b6838

Less dependent on input sequnece, use template length instead

e9bacd8

Support for multiple custom templates.

50f8617

sachinkadyan7 reviewed Mar 19, 2024

View reviewed changes

Fixed documentation according to comments.

295d0d5

Added alignment method.

6b4f167

rostro36 added 3 commits July 18, 2024 03:01

Update documentation.

0dafd62

Change path of Inference.md

b38b607

Merge branch 'main' into main

7d22739

jnwei merged commit c48f850 into aqlaboratory:main Jul 18, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add use_custom_templates option #408

Add use_custom_templates option #408

rostro36 commented Feb 20, 2024

jnwei commented Feb 29, 2024

rostro36 commented Mar 2, 2024

sachinkadyan7 Mar 19, 2024

rostro36 Mar 20, 2024

jnwei commented Mar 20, 2024

rostro36 commented Mar 20, 2024

jnwei commented Mar 21, 2024

rostro36 commented Mar 25, 2024

rostro36 commented Jul 10, 2024

jnwei commented Jul 16, 2024

rostro36 commented Jul 18, 2024

jnwei commented Jul 18, 2024

Add use_custom_templates option #408

Add use_custom_templates option #408

Conversation

rostro36 commented Feb 20, 2024

Introduction & motiviation

Changes

Tests

End

jnwei commented Feb 29, 2024

rostro36 commented Mar 2, 2024

sachinkadyan7 Mar 19, 2024

Choose a reason for hiding this comment

rostro36 Mar 20, 2024

Choose a reason for hiding this comment

jnwei commented Mar 20, 2024

rostro36 commented Mar 20, 2024

jnwei commented Mar 21, 2024

rostro36 commented Mar 25, 2024

rostro36 commented Jul 10, 2024

jnwei commented Jul 16, 2024

rostro36 commented Jul 18, 2024

jnwei commented Jul 18, 2024