reproduce llama 3 evals #2557

baberabb · 2024-12-10T17:56:52Z

llama 3.{1,2,3} have released most of their eval details here in HF and also some in this repo. Would be great if we can upstream them here. I'm adding some in #2556

baberabb · 2025-01-07T15:13:39Z

added llama_arc_challenge (base) and mgsm_chat #2556, and (arc_challenge_chat, mmlu_llama) in #2615

cjluo-nv · 2025-01-29T07:48:35Z

Hi, @baberabb do you plan to add the benchmark that can reproduce the gpqa result as well?

baberabb · 2025-01-29T15:08:48Z

added as llama_gpqa. Still have to test it out! Note, looks like they use the main subset and not diamond

cjluo-nv · 2025-01-31T16:47:04Z

Also @baberabb do you know why the max_gen_toks is set to 10 for the mmlu_llama here https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/llama3/instruct/mmlu/_continuation_template_yaml? It only needs to generate max 2 tokens right? (Which will also make this test run faster)

cjluo-nv · 2025-02-07T16:43:06Z

Also @baberabb do you know if the mmlu_llama with --fewshot_as_multiturn --apply_chat_template also works for non-llama models? I try it on mixtral and it seems that the result is lower than the mmlu benchmark

baberabb · 2025-02-07T16:56:26Z

Also @baberabb do you know if the mmlu_llama with --fewshot_as_multiturn --apply_chat_template also works for non-llama models? I try it on mixtral and it seems that the result is lower than the mmlu benchmark

The prompt should work well with most instruct models, I think. Have you looked at the outputs (saved with --log_samples)? Right now it assumes the model will generate the answer letter right away, which might not be true for all models and we could make the answer extraction more flexible.

And the generation tokens are 10 as thats whats used in llama evals (they do not provide the stop tokens though, but I added . here)

baberabb added good first issue Good for newcomers validation For validation of task implementations. labels Dec 10, 2024

baberabb pinned this issue Dec 10, 2024

jon-tow unpinned this issue Jan 2, 2025

jon-tow pinned this issue Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reproduce llama 3 evals #2557

reproduce llama 3 evals #2557

baberabb commented Dec 10, 2024

baberabb commented Jan 7, 2025 •

edited

Loading

cjluo-nv commented Jan 29, 2025

baberabb commented Jan 29, 2025 •

edited

Loading

cjluo-nv commented Jan 31, 2025

cjluo-nv commented Feb 7, 2025

baberabb commented Feb 7, 2025

reproduce llama 3 evals #2557

reproduce llama 3 evals #2557

Comments

baberabb commented Dec 10, 2024

baberabb commented Jan 7, 2025 • edited Loading

cjluo-nv commented Jan 29, 2025

baberabb commented Jan 29, 2025 • edited Loading

cjluo-nv commented Jan 31, 2025

cjluo-nv commented Feb 7, 2025

baberabb commented Feb 7, 2025

baberabb commented Jan 7, 2025 •

edited

Loading

baberabb commented Jan 29, 2025 •

edited

Loading