Skip to content
This repository has been archived by the owner on Aug 1, 2024. It is now read-only.

ESMFold for multimer fails when using HuggingFace installation #656

Open
eliottpark opened this issue Feb 1, 2024 · 1 comment
Open

Comments

@eliottpark
Copy link

When attempting to use HuggingFace's ESMFold implementation for multimers with the suggested ':' separator (from the README) between chain sequences, I get a ValueError when submitting to the tokenizer. I am ok with using the artificial glycine linker suggested by HuggingFace's tutorial, but would like clarification if the ':' separator approach suggested by this github's README is valid. Thanks!

Reproduction steps
Install huggingface transformers module via conda (version 4.24.0)

from transformers import EsmTokenizer, AutoTokenizer, EsmForProteinFolding

tokenizer = AutoTokenizer.from_pretrained("facebook/esmfold_v1") # Download tokenizer
# OR
tokenizer = EsmTokenizer.from_pretrained("facebook/esmfold_v1")  # Download alternative tokenizer

model = EsmForProteinFolding.from_pretrained("facebook/esmfold_v1")  # Download model

seq = chain1_seq + ":" + chain2_seq # Concatenate sequences with suggested delimiter

inputs = tokenizer([seq], return_tensors="pt", add_special_tokens=False) # Tokenize seq

Expected behavior
I expect the tokenizer to be able to handle the ':' delimiter and not raise an error. I tried both tokenizers (AutoTokenizer and EsmTokenizer) and both yielded the same error.
As suggested by the ValueError, I tried to turn on padding and truncation, but the output remained the same.

Logs
Filepaths in output are truncated for privacy reasons.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File [~/.../site-packages/transformers/tokenization_utils_base.py:715](.../site-packages/transformers/tokenization_utils_base.py:715), in BatchEncoding.convert_to_tensors(self, tensor_type, prepend_batch_axis)
    [714](.../site-packages/transformers/tokenization_utils_base.py:714) if not is_tensor(value):
--> [715](.../site-packages/transformers/tokenization_utils_base.py:715)     tensor = as_tensor(value)
    [717](.../site-packages/transformers/tokenization_utils_base.py:717)     # Removing this for now in favor of controlling the shape with `prepend_batch_axis`
    [718](.../site-packages/transformers/tokenization_utils_base.py:718)     # # at-least2d
    [719](.../site-packages/transformers/tokenization_utils_base.py:719)     # if tensor.ndim > 2:
    [720](.../site-packages/transformers/tokenization_utils_base.py:720)     #     tensor = tensor.squeeze(0)
    [721](.../site-packages/transformers/tokenization_utils_base.py:721)     # elif tensor.ndim < 2:
    [722](.../site-packages/transformers/tokenization_utils_base.py:722)     #     tensor = tensor[None, :]

RuntimeError: Could not infer dtype of NoneType

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
[.../src/embed.ipynb](.../src/embed.ipynb) Cell 139 line 1
----> [1](.../src/embed.ipynb#Y253sZmlsZQ%3D%3D?line=0) inputs = tokenizer([seq], return_tensors="pt", add_special_tokens=False)

File [.../site-packages/transformers/tokenization_utils_base.py:2488](.../site-packages/transformers/tokenization_utils_base.py:2488), in PreTrainedTokenizerBase.__call__(self, text, text_pair, text_target, text_pair_target, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   [2486](.../site-packages/transformers/tokenization_utils_base.py:2486)     if not self._in_target_context_manager:
   [2487](.../site-packages/transformers/tokenization_utils_base.py:2487)         self._switch_to_input_mode()
-> [2488](.../site-packages/transformers/tokenization_utils_base.py:2488)     encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
   [2489](.../site-packages/transformers/tokenization_utils_base.py:2489) if text_target is not None:
...
    [735](.../site-packages/transformers/tokenization_utils_base.py:735)             " expected)."
    [736](.../site-packages/transformers/tokenization_utils_base.py:736)         )
    [738](.../site-packages/transformers/tokenization_utils_base.py:738) return self

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

Additional context
python version = 3.8
transformers version = 4.24.0

@tonyreina
Copy link

https://github.com/tonyreina/antibody-affinity/blob/main/esmfold_multimer.ipynb

The HuggingFace model doesn't use the ":". I've included a link to my notebook that shows how to do multimer predictions. The hack is that you include a linker sequence of G between all chains. So a single sequence linked by Gs is passed to the model. This linker output is masked when producing the PDB file. I suspect the ":" does the same thing under the hood.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants