You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I use the T5 model and a CFG grammar to guide the ouput. However the tokens in the output contain the token 3 "▁". Since tokenizer.decode([3]) returns an empty chain "" and not "▁" the method iter_valid_token_ids of the class CFGGuide (which checks whether the next token can be accepted with respect to the CFG grammar) sees an empty string (instead of "▁") and rejects the token. In consequence all logits of following tokens are different. In short, using outlines with a very simple grammar which accepts everything, produces a different output than the same model without outlines.
There is a work-around by changing in fsm/guide.py , methode _get_parser_state_token_applied
replace
if new_token_str == "":
raise ValueError("empty next token")
by
if token_id != 3 and new_token_str == "":
raise ValueError("empty next token")
For MT5, the "▁" token as the tokenid 259. So for MT5 based models 259 must be used instead of 3.
Steps/code to reproduce the bug:
importreadlinefromtransformersimportT5ForConditionalGeneration, T5Tokenizerimportoutlinestokenizer=T5Tokenizer.from_pretrained("t5-small")
model=T5ForConditionalGeneration.from_pretrained("t5-small")
model.eval()
# this grammar accepts everythingcfg_grammar="""start: NODE+NODE: /./"""olmodel=outlines.models.Transformers(model, tokenizer)
olgenerator=outlines.generate.cfg(olmodel, cfg_grammar)
deftest(sentence):
inputs=tokenizer(sentence, return_tensors="pt").input_idsprint("inputs", inputs)
outs=model.generate(inputs)
print(outs)
foridsinouts:
print (tokenizer.decode(ids, skip_special_tokens=True))
fortokidinids:
print (tokid, "<%s>"%tokenizer.decode([tokid], skip_special_tokens=True), sep="\t")
outs=olgenerator(sentence)
print(outs)
t="translate English to French: The cat has eaten the mouse"test(t)
Expected result:
inputs tensor([[13959, 1566, 12, 2379, 10, 37, 1712, 65, 16929, 8,
8429, 1]])
tensor([[ 0, 312, 3582, 3, 9, 388, 4020, 50, 78, 459, 7, 1]])
Le chat a mangé la souris
tensor(0) <>
tensor(312) <Le>
tensor(3582) <chat>
tensor(3) <>
tensor(9) <a>
tensor(388) <man>
tensor(4020) <gé>
tensor(50) <la>
tensor(78) <so>
tensor(459) <uri>
tensor(7) <s>
tensor(1) <>
Le chat est passé de la souris
Describe the issue as clearly as possible:
I use the T5 model and a CFG grammar to guide the ouput. However the tokens in the output contain the token 3 "▁". Since
tokenizer.decode([3])
returns an empty chain""
and not"▁"
the methoditer_valid_token_ids
of the classCFGGuide
(which checks whether the next token can be accepted with respect to the CFG grammar) sees an empty string (instead of"▁"
) and rejects the token. In consequence all logits of following tokens are different. In short, using outlines with a very simple grammar which accepts everything, produces a different output than the same model without outlines.There is a work-around by changing in
fsm/guide.py
, methode_get_parser_state_token_applied
replace
by
For MT5, the
"▁"
token as the tokenid 259. So for MT5 based models259
must be used instead of3
.Steps/code to reproduce the bug:
Expected result:
Error message:
Outlines/Python version information:
Version information
The text was updated successfully, but these errors were encountered: