Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Introduces a new file, generation_algorithm.py, housing the implementation of a speculative sampling algorithm. The algorithm has been integrated into the BaseOnsiteLLM class through the addition of a speculative_sampling attribute.
The speculative sampling algorithm receives essential parameters through generation_kw_args during the initialization of the BaseOnsiteLLM class. These parameters include the draft_model_uri, along with two optional hyperparameters, k and scheduler, which influence the number of tokens generated per iteration.
The algorithm's functionality is accessed through the complete method within the BaseOnsiteLLM class when the speculative_sampling attribute is present. It returns the newly generated token IDs. Additionally, the method takes an optional parameter, "alignment," which determines the degree of similarity between the probabilities of the draft tokens and those of the target tokens.
In scenarios where alignment is set to 1 (perfect alignment, the default value), the algorithm aims to predict the same exact answers as the target model would. The implementation is designed to handle a batch size of 1, aligning with the current handling of the generate method in the BaseOnsiteLLM class.
fixes #367