You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 1, 2024. It is now read-only.
When attempting to use HuggingFace's ESMFold implementation for multimers with the suggested ':' separator (from the README) between chain sequences, I get a ValueError when submitting to the tokenizer. I am ok with using the artificial glycine linker suggested by HuggingFace's tutorial, but would like clarification if the ':' separator approach suggested by this github's README is valid. Thanks!
Reproduction steps
Install huggingface transformers module via conda (version 4.24.0)
from transformers import EsmTokenizer, AutoTokenizer, EsmForProteinFolding
tokenizer = AutoTokenizer.from_pretrained("facebook/esmfold_v1") # Download tokenizer
# OR
tokenizer = EsmTokenizer.from_pretrained("facebook/esmfold_v1") # Download alternative tokenizer
model = EsmForProteinFolding.from_pretrained("facebook/esmfold_v1") # Download model
seq = chain1_seq + ":" + chain2_seq # Concatenate sequences with suggested delimiter
inputs = tokenizer([seq], return_tensors="pt", add_special_tokens=False) # Tokenize seq
Expected behavior
I expect the tokenizer to be able to handle the ':' delimiter and not raise an error. I tried both tokenizers (AutoTokenizer and EsmTokenizer) and both yielded the same error.
As suggested by the ValueError, I tried to turn on padding and truncation, but the output remained the same.
Logs
Filepaths in output are truncated for privacy reasons.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
File [~/.../site-packages/transformers/tokenization_utils_base.py:715](.../site-packages/transformers/tokenization_utils_base.py:715), in BatchEncoding.convert_to_tensors(self, tensor_type, prepend_batch_axis)
[714](.../site-packages/transformers/tokenization_utils_base.py:714) if not is_tensor(value):
--> [715](.../site-packages/transformers/tokenization_utils_base.py:715) tensor = as_tensor(value)
[717](.../site-packages/transformers/tokenization_utils_base.py:717) # Removing this for now in favor of controlling the shape with `prepend_batch_axis`
[718](.../site-packages/transformers/tokenization_utils_base.py:718) # # at-least2d
[719](.../site-packages/transformers/tokenization_utils_base.py:719) # if tensor.ndim > 2:
[720](.../site-packages/transformers/tokenization_utils_base.py:720) # tensor = tensor.squeeze(0)
[721](.../site-packages/transformers/tokenization_utils_base.py:721) # elif tensor.ndim < 2:
[722](.../site-packages/transformers/tokenization_utils_base.py:722) # tensor = tensor[None, :]
RuntimeError: Could not infer dtype of NoneType
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
[.../src/embed.ipynb](.../src/embed.ipynb) Cell 139 line 1
----> [1](.../src/embed.ipynb#Y253sZmlsZQ%3D%3D?line=0) inputs = tokenizer([seq], return_tensors="pt", add_special_tokens=False)
File [.../site-packages/transformers/tokenization_utils_base.py:2488](.../site-packages/transformers/tokenization_utils_base.py:2488), in PreTrainedTokenizerBase.__call__(self, text, text_pair, text_target, text_pair_target, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
[2486](.../site-packages/transformers/tokenization_utils_base.py:2486) if not self._in_target_context_manager:
[2487](.../site-packages/transformers/tokenization_utils_base.py:2487) self._switch_to_input_mode()
-> [2488](.../site-packages/transformers/tokenization_utils_base.py:2488) encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
[2489](.../site-packages/transformers/tokenization_utils_base.py:2489) if text_target is not None:
...
[735](.../site-packages/transformers/tokenization_utils_base.py:735) " expected)."
[736](.../site-packages/transformers/tokenization_utils_base.py:736) )
[738](.../site-packages/transformers/tokenization_utils_base.py:738) return self
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
Additional context
python version = 3.8
transformers version = 4.24.0
The text was updated successfully, but these errors were encountered:
The HuggingFace model doesn't use the ":". I've included a link to my notebook that shows how to do multimer predictions. The hack is that you include a linker sequence of G between all chains. So a single sequence linked by Gs is passed to the model. This linker output is masked when producing the PDB file. I suspect the ":" does the same thing under the hood.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
When attempting to use HuggingFace's ESMFold implementation for multimers with the suggested ':' separator (from the README) between chain sequences, I get a ValueError when submitting to the tokenizer. I am ok with using the artificial glycine linker suggested by HuggingFace's tutorial, but would like clarification if the ':' separator approach suggested by this github's README is valid. Thanks!
Reproduction steps
Install huggingface transformers module via conda (version 4.24.0)
Expected behavior
I expect the tokenizer to be able to handle the ':' delimiter and not raise an error. I tried both tokenizers (
AutoTokenizer
andEsmTokenizer
) and both yielded the same error.As suggested by the ValueError, I tried to turn on padding and truncation, but the output remained the same.
Logs
Filepaths in output are truncated for privacy reasons.
Additional context
python version = 3.8
transformers version = 4.24.0
The text was updated successfully, but these errors were encountered: