Add support for generative answering of multiple_choice tasks #2601

pasky · 2024-12-29T23:57:21Z

As they are, multiple_choice tasks cannot be evaluated with popular API models that support only generate_until, not logprob.

This MR introduces a flag that allows these tasks to be still evaluated by introducing an emulation mode of sorts that just asks the model to generate the answer. The abcd approach and most of the prompt is borrowed from openai/evals.

pasky · 2024-12-30T00:02:14Z

For reference, some tinyBenchmark results with claude:

lm_eval  --tasks tinyArc --model local-chat-completions --model_args model=claude-3-5-sonnet-20241022,base_url=http://localhost:4000/v1/chat/completions,num_concurrent=4,max_retries=3,tokenized_requests=False --gen_kwargs until='qwelrjh',max_tokens=512 --log_samples --output_path results0 --apply_chat_template --fewshot_as_multiturn  --multiple_choice_generate abcd --hf_hub_log_args hub_repo_name=eval-claude-3-5-sonnet-20241022,push_results_to_hub,push_samples_to_hub,public_repo
local-chat-completions (model=claude-3-5-sonnet-20241022,base_url=http://localhost:4000/v1/chat/completions,num_concurrent=4,max_retries=3,tokenized_requests=False), gen_kwargs: (until=qwelrjh,max_tokens=512), limit: None, num_fewshot: None, batch_size: 1
| Tasks |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------|------:|------|-----:|--------|---|-----:|---|------|
|tinyArc|      0|none  |    25|acc_norm|↑  |0.8819|±  |   N/A|

lm_eval  --tasks tinyHellaswag --model local-chat-completions --model_args model=claude-3-5-sonnet-20241022,base_url=http://localhost:4000/v1/chat/completions,num_concurrent=4,max_retries=3,tokenized_requests=False --gen_kwargs until='qwelrjh',max_tokens=512 --log_samples --output_path results0 --apply_chat_template --fewshot_as_multiturn  --multiple_choice_generate abcd --hf_hub_log_args hub_repo_name=eval-claude-3-5-sonnet-20241022,push_results_to_hub,push_samples_to_hub,public_repo
local-chat-completions (model=claude-3-5-sonnet-20241022,base_url=http://localhost:4000/v1/chat/completions,num_concurrent=4,max_retries=3,tokenized_requests=False), gen_kwargs: (until=qwelrjh,max_tokens=512), limit: None, num_fewshot: None, batch_size: 1
|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|------|
|tinyHellaswag|      0|none  |    10|acc_norm|↑  |0.8283|±  |   N/A|

lm_eval  --tasks tinyMMLU --model local-chat-completions --model_args model=claude-3-5-sonnet-20241022,base_url=http://localhost:4000/v1/chat/completions,num_concurrent=4,max_retries=3,tokenized_requests=False --gen_kwargs until='qwelrjh',max_tokens=512 --log_samples --output_path results0 --apply_chat_template --fewshot_as_multiturn  --multiple_choice_generate abcd --hf_hub_log_args hub_repo_name=eval-claude-3-5-sonnet-20241022,push_results_to_hub,push_samples_to_hub,public_repo
local-chat-completions (model=claude-3-5-sonnet-20241022,base_url=http://localhost:4000/v1/chat/completions,num_concurrent=4,max_retries=3,tokenized_requests=False), gen_kwargs: (until=qwelrjh,max_tokens=512), limit: None, num_fewshot: None, batch_size: 1
| Tasks  |Version|Filter|n-shot| Metric |   |Value|   |Stderr|
|--------|------:|------|-----:|--------|---|----:|---|------|
|tinyMMLU|      0|none  |     0|acc_norm|↑  |0.789|±  |   N/A|

LSinev · 2024-12-30T10:12:37Z

lm_eval/api/task.py

+                    doc_system_instruction += " "
+                if multiple_choice_generate == "abcd":


May I suggest to not hardcode these. What if doc_system_instruction supposed to be delimited with some other delimiter? What if set of choices is not 4 letters, not these 4 letters, or not letters at all? This framework supports external tasks and also have multiple forks already, so there may be (I am not using "are" because of no intention to google proof of this idea) multiple choice tasks set up differently than "abcd".

Just to be clear, the "abcd" is just a (wannabe) userfriendly name for the feature, the letters aren't actually directly derived from the value.

Maybe the name is just confusing, but more modes can be added easily.

LSinev · 2024-12-30T10:13:35Z

lm_eval/api/task.py

+                    doc_system_instruction += "Please include \"ANSWER: <letter>\" in your response with the letter of the correct last answer."
+                else:
+                    doc_system_instruction += "Please answer with the letter of the correct last answer."


What about non-english tasks that are already inside this repo?

It's a great point. This approach follows the openai/evals philosophy which hardcodes the strings. Maybe there can be a way to override these instructions task by task, or forced to be provided by users during invocation. Depends on the overall philosophy of what's defined in the task and what in the harness, which I'm not sure I completely from the outside at the first look.

FarisHijazi · 2025-01-18T08:44:40Z

any plans to merge this? how can I help? we really need this feature

baberabb · 2025-01-20T21:07:13Z

Hi! sorry for the delay in reviewing. Generally think this is a great PR, and was badly needed. I'll try looking at it in more detail this week!

pasky · 2025-01-20T21:54:30Z

Hi, I just wanna share I'm sorry but I had to move on to other things and not sure if I can keep working on this PR.

If I went back in time, I'm not 100% sure I'd do it this way again. I started with a mindset that I wanted to evaluate multiple_choice tasks with a generative chat completion without touching them at all. In reality, I realized I need to touch a lot of tasks anyway.

I think that if I did this again, I'd look harder whether I could make this more task config oriented and mainly allow multiple_choice task definitions to easily specify how they should be evaluated by chat completion generations, then explicitly add that specification to various common tasks.

pasky · 2025-01-20T21:57:16Z

(Apology 2: I accidentally pushed the commits from #2599 here too (already merged), making the changeset needlessly big.)

pasky added 4 commits December 28, 2024 18:46

fix(zeno): Generate unique ids in case of multiple filters

c225602

fix(zeno): Report even non-aggregable metrics, just not as metrics

0bd64c2

Add a basic support for --multiple-choice-generate

5cca68f

Add support for --multiple_choice_generate abcd

d9e49af

pasky requested review from baberabb and lintangsutawika as code owners December 29, 2024 23:57

pasky mentioned this pull request Dec 30, 2024

Add Logits to OpenAI ChatCompletions model #1196

Closed

LSinev reviewed Dec 30, 2024

View reviewed changes

jonoillar mentioned this pull request Jan 14, 2025

Evaluating multiple choice questions with GPT (OpenAI Chat Completion API) #1662

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for generative answering of multiple_choice tasks #2601

Add support for generative answering of multiple_choice tasks #2601

pasky commented Dec 29, 2024

pasky commented Dec 30, 2024

LSinev Dec 30, 2024

pasky Jan 20, 2025

LSinev Dec 30, 2024

pasky Jan 20, 2025

FarisHijazi commented Jan 18, 2025

baberabb commented Jan 20, 2025

pasky commented Jan 20, 2025

pasky commented Jan 20, 2025

		doc_system_instruction += " "
		if multiple_choice_generate == "abcd":

Add support for generative answering of multiple_choice tasks #2601

Are you sure you want to change the base?

Add support for generative answering of multiple_choice tasks #2601

Conversation

pasky commented Dec 29, 2024

pasky commented Dec 30, 2024

LSinev Dec 30, 2024

Choose a reason for hiding this comment

pasky Jan 20, 2025

Choose a reason for hiding this comment

LSinev Dec 30, 2024

Choose a reason for hiding this comment

pasky Jan 20, 2025

Choose a reason for hiding this comment

FarisHijazi commented Jan 18, 2025

baberabb commented Jan 20, 2025

pasky commented Jan 20, 2025

pasky commented Jan 20, 2025