Please check lm_eval to evaluate base models.
If you use lm-evaluation-harness to evaluate generation-based tasks (e.g., GSM8K and MATH), make sure to apply the chat template (e.g., llama chat template) instead of using its default one. The distilled models are trained only on instruction-tuning data, and if you evaluate without applying the chat template, the results will be completely invalid and cause a terrible score.
Please check alpaca_eval and mt_bench to evaluate chat models.
Please check zero_eval to evaluate chat models in zero shot.
Please check Needle In A Haystack to evaluate chat models in zero shot.
Please check and change speed.sh