Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reproduce llama 3 evals #2557

Open
baberabb opened this issue Dec 10, 2024 · 6 comments
Open

reproduce llama 3 evals #2557

baberabb opened this issue Dec 10, 2024 · 6 comments
Labels
good first issue Good for newcomers validation For validation of task implementations.

Comments

@baberabb
Copy link
Contributor

llama 3.{1,2,3} have released most of their eval details here in HF and also some in this repo. Would be great if we can upstream them here. I'm adding some in #2556

@baberabb baberabb added good first issue Good for newcomers validation For validation of task implementations. labels Dec 10, 2024
@baberabb baberabb pinned this issue Dec 10, 2024
@jon-tow jon-tow unpinned this issue Jan 2, 2025
@jon-tow jon-tow pinned this issue Jan 2, 2025
@baberabb
Copy link
Contributor Author

baberabb commented Jan 7, 2025

added llama_arc_challenge (base) and mgsm_chat #2556, and (arc_challenge_chat, mmlu_llama) in #2615

@cjluo-nv
Copy link
Contributor

Hi, @baberabb do you plan to add the benchmark that can reproduce the gpqa result as well?

@baberabb
Copy link
Contributor Author

baberabb commented Jan 29, 2025

added as llama_gpqa. Still have to test it out! Note, looks like they use the main subset and not diamond

@cjluo-nv
Copy link
Contributor

Also @baberabb do you know why the max_gen_toks is set to 10 for the mmlu_llama here https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/llama3/instruct/mmlu/_continuation_template_yaml? It only needs to generate max 2 tokens right? (Which will also make this test run faster)

@cjluo-nv
Copy link
Contributor

cjluo-nv commented Feb 7, 2025

Also @baberabb do you know if the mmlu_llama with --fewshot_as_multiturn --apply_chat_template also works for non-llama models? I try it on mixtral and it seems that the result is lower than the mmlu benchmark

@baberabb
Copy link
Contributor Author

baberabb commented Feb 7, 2025

Also @baberabb do you know if the mmlu_llama with --fewshot_as_multiturn --apply_chat_template also works for non-llama models? I try it on mixtral and it seems that the result is lower than the mmlu benchmark

The prompt should work well with most instruct models, I think. Have you looked at the outputs (saved with --log_samples)? Right now it assumes the model will generate the answer letter right away, which might not be true for all models and we could make the answer extraction more flexible.

And the generation tokens are 10 as thats whats used in llama evals (they do not provide the stop tokens though, but I added . here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers validation For validation of task implementations.
Projects
None yet
Development

No branches or pull requests

2 participants