Evaluating base models

Please check lm_eval to evaluate base models.

If you use lm-evaluation-harness to evaluate generation-based tasks (e.g., GSM8K and MATH), make sure to apply the chat template (e.g., llama chat template) instead of using its default one. The distilled models are trained only on instruction-tuning data, and if you evaluate without applying the chat template, the results will be completely invalid and cause a terrible score.

Evaluating chat models

Please check alpaca_eval and mt_bench to evaluate chat models.

Please check zero_eval to evaluate chat models in zero shot.

Evaluating long context

Please check Needle In A Haystack to evaluate chat models in zero shot.

Speed Test

Please check and change speed.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Evaluating base models

Evaluating chat models

Evaluating long context

Speed Test

Files

README.md

Latest commit

History

README.md

File metadata and controls

Evaluating base models

Evaluating chat models

Evaluating long context

Speed Test