Inference-Acceleration-using-speculative-decoding

This repo is inspired from the paper by Google Deepmind

# https://arxiv.org/pdf/2302.01318

Run the file using

python generate.py -h

Autoregressive Decoding

python main.py  --method autoregressive \
                --prompt "Describe how a neural network is trained using backpropagation, and explain the significance of each step." \
                --max_new_tokens 128 \
                --temperature 0.1

Speculative Decoding

python main.py  --method speculative \
                --prompt "Describe how a neural network is trained using backpropagation, and explain the significance of each step." \
                --max_new_tokens 128 \
                --temperature 0.1

How to improve the performance of your speculative decoding technique

The draft model must be significantly small compared to that of the target model.
Both models should use the same tokenizer
Efficient batching can improve performance however may lead to memory management issues.

Advantages

Speculative Sampling offers a 1.5X - 3.0X speedup on naive autoregression

Caveats

The size of the draft model if is comparable to the target model can reduce speed due to significant overhead by the draft model.
Using different tokenizers for both models will drastically decrease performance

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
inference.py		inference.py
main.py		main.py
sampling.py		sampling.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inference-Acceleration-using-speculative-decoding

This repo is inspired from the paper by Google Deepmind

Run the file using

Autoregressive Decoding

Speculative Decoding

How to improve the performance of your speculative decoding technique

Advantages

Caveats

About

Releases

Packages

Languages

ADITYA1720/Inference-Acceleration-using-speculative-decoding

Folders and files

Latest commit

History

Repository files navigation

Inference-Acceleration-using-speculative-decoding

This repo is inspired from the paper by Google Deepmind

Run the file using

Autoregressive Decoding

Speculative Decoding

How to improve the performance of your speculative decoding technique

Advantages

Caveats

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages