# https://arxiv.org/pdf/2302.01318
python generate.py -h
python main.py --method autoregressive \
--prompt "Describe how a neural network is trained using backpropagation, and explain the significance of each step." \
--max_new_tokens 128 \
--temperature 0.1
python main.py --method speculative \
--prompt "Describe how a neural network is trained using backpropagation, and explain the significance of each step." \
--max_new_tokens 128 \
--temperature 0.1
- The draft model must be significantly small compared to that of the target model.
- Both models should use the same tokenizer
- Efficient batching can improve performance however may lead to memory management issues.
- Speculative Sampling offers a 1.5X - 3.0X speedup on naive autoregression
- The size of the draft model if is comparable to the target model can reduce speed due to significant overhead by the draft model.
- Using different tokenizers for both models will drastically decrease performance