Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for generative answering of multiple_choice tasks #2601

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

pasky
Copy link
Contributor

@pasky pasky commented Dec 29, 2024

As they are, multiple_choice tasks cannot be evaluated with popular API models that support only generate_until, not logprob.

This MR introduces a flag that allows these tasks to be still evaluated by introducing an emulation mode of sorts that just asks the model to generate the answer. The abcd approach and most of the prompt is borrowed from openai/evals.

@pasky
Copy link
Contributor Author

pasky commented Dec 30, 2024

For reference, some tinyBenchmark results with claude:

lm_eval  --tasks tinyArc --model local-chat-completions --model_args model=claude-3-5-sonnet-20241022,base_url=http://localhost:4000/v1/chat/completions,num_concurrent=4,max_retries=3,tokenized_requests=False --gen_kwargs until='qwelrjh',max_tokens=512 --log_samples --output_path results0 --apply_chat_template --fewshot_as_multiturn  --multiple_choice_generate abcd --hf_hub_log_args hub_repo_name=eval-claude-3-5-sonnet-20241022,push_results_to_hub,push_samples_to_hub,public_repo
local-chat-completions (model=claude-3-5-sonnet-20241022,base_url=http://localhost:4000/v1/chat/completions,num_concurrent=4,max_retries=3,tokenized_requests=False), gen_kwargs: (until=qwelrjh,max_tokens=512), limit: None, num_fewshot: None, batch_size: 1
| Tasks |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------|------:|------|-----:|--------|---|-----:|---|------|
|tinyArc|      0|none  |    25|acc_norm|↑  |0.8819|±  |   N/A|

lm_eval  --tasks tinyHellaswag --model local-chat-completions --model_args model=claude-3-5-sonnet-20241022,base_url=http://localhost:4000/v1/chat/completions,num_concurrent=4,max_retries=3,tokenized_requests=False --gen_kwargs until='qwelrjh',max_tokens=512 --log_samples --output_path results0 --apply_chat_template --fewshot_as_multiturn  --multiple_choice_generate abcd --hf_hub_log_args hub_repo_name=eval-claude-3-5-sonnet-20241022,push_results_to_hub,push_samples_to_hub,public_repo
local-chat-completions (model=claude-3-5-sonnet-20241022,base_url=http://localhost:4000/v1/chat/completions,num_concurrent=4,max_retries=3,tokenized_requests=False), gen_kwargs: (until=qwelrjh,max_tokens=512), limit: None, num_fewshot: None, batch_size: 1
|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|------|
|tinyHellaswag|      0|none  |    10|acc_norm|↑  |0.8283|±  |   N/A|

lm_eval  --tasks tinyMMLU --model local-chat-completions --model_args model=claude-3-5-sonnet-20241022,base_url=http://localhost:4000/v1/chat/completions,num_concurrent=4,max_retries=3,tokenized_requests=False --gen_kwargs until='qwelrjh',max_tokens=512 --log_samples --output_path results0 --apply_chat_template --fewshot_as_multiturn  --multiple_choice_generate abcd --hf_hub_log_args hub_repo_name=eval-claude-3-5-sonnet-20241022,push_results_to_hub,push_samples_to_hub,public_repo
local-chat-completions (model=claude-3-5-sonnet-20241022,base_url=http://localhost:4000/v1/chat/completions,num_concurrent=4,max_retries=3,tokenized_requests=False), gen_kwargs: (until=qwelrjh,max_tokens=512), limit: None, num_fewshot: None, batch_size: 1
| Tasks  |Version|Filter|n-shot| Metric |   |Value|   |Stderr|
|--------|------:|------|-----:|--------|---|----:|---|------|
|tinyMMLU|      0|none  |     0|acc_norm|↑  |0.789|±  |   N/A|

Comment on lines +444 to +445
doc_system_instruction += " "
if multiple_choice_generate == "abcd":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I suggest to not hardcode these. What if doc_system_instruction supposed to be delimited with some other delimiter? What if set of choices is not 4 letters, not these 4 letters, or not letters at all? This framework supports external tasks and also have multiple forks already, so there may be (I am not using "are" because of no intention to google proof of this idea) multiple choice tasks set up differently than "abcd".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, the "abcd" is just a (wannabe) userfriendly name for the feature, the letters aren't actually directly derived from the value.

Maybe the name is just confusing, but more modes can be added easily.

Comment on lines +446 to +448
doc_system_instruction += "Please include \"ANSWER: <letter>\" in your response with the letter of the correct last answer."
else:
doc_system_instruction += "Please answer with the letter of the correct last answer."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about non-english tasks that are already inside this repo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a great point. This approach follows the openai/evals philosophy which hardcodes the strings. Maybe there can be a way to override these instructions task by task, or forced to be provided by users during invocation. Depends on the overall philosophy of what's defined in the task and what in the harness, which I'm not sure I completely from the outside at the first look.

@FarisHijazi
Copy link

any plans to merge this? how can I help? we really need this feature

@baberabb
Copy link
Contributor

Hi! sorry for the delay in reviewing. Generally think this is a great PR, and was badly needed. I'll try looking at it in more detail this week!

@pasky
Copy link
Contributor Author

pasky commented Jan 20, 2025

Hi, I just wanna share I'm sorry but I had to move on to other things and not sure if I can keep working on this PR.

If I went back in time, I'm not 100% sure I'd do it this way again. I started with a mindset that I wanted to evaluate multiple_choice tasks with a generative chat completion without touching them at all. In reality, I realized I need to touch a lot of tasks anyway.

I think that if I did this again, I'd look harder whether I could make this more task config oriented and mainly allow multiple_choice task definitions to easily specify how they should be evaluated by chat completion generations, then explicitly add that specification to various common tasks.

@pasky
Copy link
Contributor Author

pasky commented Jan 20, 2025

(Apology 2: I accidentally pushed the commits from #2599 here too (already merged), making the changeset needlessly big.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants