Question on speculative sampling #73

SinanAkkoyun · 2023-09-29T20:29:19Z

SinanAkkoyun
Sep 29, 2023

I am super glad about the speculative sampling implementation, thank you very much

I don't quite understand this part: https://github.com/turboderp/exllamav2/blob/9385fefc00e34b0af04403ecf3dcbc89f25fe2b6/exllamav2/generator/speculative.py#L94C64-L94C64

If I understand correctly, here, the prompt+draft is being forwarded into the base model, therefore generating logits for the next future token. However, how can you score the draft tokens only by the logits for a future token?

Lets say "The weather " is the prompt, "is nice " the draft and the base will predict "today" in the future. The base model is being passed "The weather is nice ", the logits for the next token (potentially "today") are generated. How is it now possible to score and validate the tokens "is nice" with only the logits for "today"?

turboderp · 2023-09-29T21:16:37Z

turboderp
Sep 29, 2023
Maintainer

It's not quite how it works. Here's an example (assuming one token per word for simplicity):

The draft model is fed the prompt The weather and sampled iteratively, giving:

Round 1: The weather -> draft logits for The weather -> is
Round 2: The weather is -> draft logits for The weather is -> nice
Round 3: The weather is nice -> draft logits for The weather is nice -> today

Now the full model is fed the whole draft in one forward pass, giving us all at once:

The weather -> full logits for The weather
The weather is -> full logits for The weather is
The weather is nice -> full logits for The weather is nice
The weather is nice today -> full logits for The weather is nice today

Next we sample the full logits for The weather, same as if we were just generating with the full model. The advantage is that if the full model also happens to pick is as the next token, we already have the full model's logits for The weather is and we don't need another forward pass, we can just go straight to the next sampling step.

Now, suppose the full model produces not at this point, disagreeing with the draft. We then have:

Draft model cache: The weather is nice today
Full model cache: The weather is not

So we rewind the draft model by two tokens and feed it the better token:

Draft model cache: The weather is not
Full model cache: The weather is not

So all in all we've advanced two tokens with three passes of the draft model and one pass of the full model.

0 replies

SinanAkkoyun · 2023-09-30T12:25:32Z

SinanAkkoyun
Sep 30, 2023
Author

Ahh, I could not have wished for a better explanation! Thank you so so much! ^^

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Question on speculative sampling #73

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Question on speculative sampling #73

Uh oh!

SinanAkkoyun Sep 29, 2023

Replies: 2 comments

Uh oh!

Uh oh!

turboderp Sep 29, 2023 Maintainer

Uh oh!

SinanAkkoyun Sep 30, 2023 Author

SinanAkkoyun
Sep 29, 2023

turboderp
Sep 29, 2023
Maintainer

SinanAkkoyun
Sep 30, 2023
Author