Round-trip encoding of tokens [!] failed, Warning: lexer error: too many states: 10406 >= 10000; stopping #1042

Crista23 · 2024-10-05T21:15:42Z

My code is throwing the error below:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/net/scratch/user/miniconda3/envs/vllm_guidance/lib/python3.9/site-packages/guidance/models/transformers/_transformers.py:150: UserWarning: Could not build_byte tokens from the tokenizer by encoding token strings: Round-trip encoding of tokens [!] failed! Got [128000, 0]
 warnings.warn(
Warning: lexer error: too many states: 10406 >= 10000; stopping

I can see this error is thrown in the code here
https://github.com/guidance-ai/guidance/blob/main/guidance/models/transformers/_transformers.py#L233
and it looks like it's a tokenizer issue, however I am calling the guidance library without specifying a tokenizer

llm = models.Transformers(args.model_path, device_map="auto", trust_remote_code=True)

I am wondering how to fix this. Any advice appreciated, thanks!

The text was updated successfully, but these errors were encountered:

Harsha-Nori · 2024-10-06T05:34:53Z

Hi @Crista23, sorry you're dealing with this! Which version of the package are you using? Are you on our release candidate / installing from source?

Even if a tokenizer isn't explicitly specified, we do need one for guidance to work properly. For transformers based models, we try to load it automatically from the model config. However, sometimes this can act up, especially if there are new tokens added to a model's vocabulary via fine tuning (and not updated in the config...).

Are you using a public/oss model? Do you mind sharing the link to it so that we can try to debug it on our side?

Crista23 · 2024-10-06T16:45:11Z

HI @Harsha-Nori , thanks a lot for your answer! I have installed guidance --pre using pip and the version installed is 0.2.0rc1. I am using this in combination with publicly available models such as LLAMA-8B-Instruct instantiated in the code using

llm = models.Transformers(args.model_path, device_map="auto", trust_remote_code=True)

outputs = llm + prompt_eval(item[key])

It has worked for a couple examples until it crashed with this error: "Round-trip encoding of tokens [!] failed, Warning: lexer error: too many states: 10406 >= 10000; stopping". It looks like a tokenizer issue and even though I tried to replace "!" with the empty string in the input it still fails.

I would appreciate your thoughts on how to fix this, thank you!

Crista23 · 2024-10-08T02:36:34Z

@Harsha-Nori Any thoughts? Sorry to ask again, it's a pressing issue.

Harsha-Nori · 2024-10-08T05:28:47Z

Hi @Crista23, I can't seem to replicate this with a llama-8B model :(. Could you share some more details about your code, including the exact huggingface model and/or details of the prompt_eval method?

The error message can happen if the grammar you're constraining against is particularly complex, but I can't seem to replicate it on my side :(. Happy to also collaborate via email if you can't share publicly.

hudson-ai · 2024-10-14T16:57:48Z

@Crista23 if you can't share details of your prompt, would you be able to share the full traceback? Thanks!

jtbuter · 2024-10-18T07:20:21Z

@Harsha-Nori @hudson-ai I get a similar warning when initializing the llama 8b instruct model with guidance 0.1.16 and transformers 4.45.2

from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
import guidance.models
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

llama3 = guidance.models.Transformers(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

The warning is the following
UserWarning: Could not build_byte tokens from the tokenizer by encoding token strings: Round-trip encoding of tokens [!] failed! Got [128000, 0]

Can it be because the tokenizer encodes ! as [128000, 0], which decodes to <|begin_of_text|>!? Therefore this check fails?

if len(encoded_str) != 1:
    raise ValueError(f"Round-trip encoding of tokens [{token}] failed! Got {encoded_str}")```

hudson-ai · 2024-10-18T16:27:24Z

@jtbuter thanks for the repro -- I am able to reproduce the warning with transformers 4.45.2 (interestingly, not with my previously installed 4.44.0 version). We have a few methods for converting tokens into a form that we need in order to support constrained decoding, and the warning here is just saying that our preferred approach is failing and falling back to an alternative approach. Will definitely look into what's going on under the hood here -- thank you for the suggestion on where to look. I think you have the right idea.

Are you experiencing any downstream problems after seeing this warning?

This being said, the lexer error: too many states: 10406 >= 10000; stopping that @Crista23 is seeing "shouldn't" be caused by this -- it seems to be that our parser is finding the particular grammar they are constraining against to be disagreeable for whatever reason. I've seen something similar happen in some grammars where the parse tree is exceptionally ambiguous.

@Crista23 are you able to share any details about the constraints you are using? I would love to see us (1) improve robustness and (2) provide more helpful exceptions and warnings. A concrete example of what's causing this would really help to that end.

jtbuter · 2024-10-21T05:31:27Z

Thank you for the reply, I was not experiencing any other problems after this warning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Round-trip encoding of tokens [!] failed, Warning: lexer error: too many states: 10406 >= 10000; stopping #1042

Round-trip encoding of tokens [!] failed, Warning: lexer error: too many states: 10406 >= 10000; stopping #1042

Crista23 commented Oct 5, 2024

Harsha-Nori commented Oct 6, 2024

Crista23 commented Oct 6, 2024 •

edited

Loading

Crista23 commented Oct 8, 2024

Harsha-Nori commented Oct 8, 2024

hudson-ai commented Oct 14, 2024

jtbuter commented Oct 18, 2024 •

edited

Loading

hudson-ai commented Oct 18, 2024

jtbuter commented Oct 21, 2024

Round-trip encoding of tokens [!] failed, Warning: lexer error: too many states: 10406 >= 10000; stopping #1042

Round-trip encoding of tokens [!] failed, Warning: lexer error: too many states: 10406 >= 10000; stopping #1042

Comments

Crista23 commented Oct 5, 2024

Harsha-Nori commented Oct 6, 2024

Crista23 commented Oct 6, 2024 • edited Loading

Crista23 commented Oct 8, 2024

Harsha-Nori commented Oct 8, 2024

hudson-ai commented Oct 14, 2024

jtbuter commented Oct 18, 2024 • edited Loading

hudson-ai commented Oct 18, 2024

jtbuter commented Oct 21, 2024

Crista23 commented Oct 6, 2024 •

edited

Loading

jtbuter commented Oct 18, 2024 •

edited

Loading