Question on speculative sampling #73
Replies: 2 comments
-
It's not quite how it works. Here's an example (assuming one token per word for simplicity): The draft model is fed the prompt
Now the full model is fed the whole draft in one forward pass, giving us all at once:
Next we sample the full logits for Now, suppose the full model produces
So we rewind the draft model by two tokens and feed it the better token:
So all in all we've advanced two tokens with three passes of the draft model and one pass of the full model. |
Beta Was this translation helpful? Give feedback.
-
Ahh, I could not have wished for a better explanation! Thank you so so much! ^^ |
Beta Was this translation helpful? Give feedback.
-
I am super glad about the speculative sampling implementation, thank you very much
I don't quite understand this part: https://github.com/turboderp/exllamav2/blob/9385fefc00e34b0af04403ecf3dcbc89f25fe2b6/exllamav2/generator/speculative.py#L94C64-L94C64
If I understand correctly, here, the prompt+draft is being forwarded into the base model, therefore generating logits for the next future token. However, how can you score the draft tokens only by the logits for a future token?
Lets say "The weather " is the prompt, "is nice " the draft and the base will predict "today" in the future. The base model is being passed "The weather is nice ", the logits for the next token (potentially "today") are generated. How is it now possible to score and validate the tokens "is nice" with only the logits for "today"?
Beta Was this translation helpful? Give feedback.
All reactions