-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for generative answering of multiple_choice tasks #2601
base: main
Are you sure you want to change the base?
Conversation
For reference, some tinyBenchmark results with claude:
|
doc_system_instruction += " " | ||
if multiple_choice_generate == "abcd": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I suggest to not hardcode these. What if doc_system_instruction supposed to be delimited with some other delimiter? What if set of choices is not 4 letters, not these 4 letters, or not letters at all? This framework supports external tasks and also have multiple forks already, so there may be (I am not using "are" because of no intention to google proof of this idea) multiple choice tasks set up differently than "abcd".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, the "abcd" is just a (wannabe) userfriendly name for the feature, the letters aren't actually directly derived from the value.
Maybe the name is just confusing, but more modes can be added easily.
doc_system_instruction += "Please include \"ANSWER: <letter>\" in your response with the letter of the correct last answer." | ||
else: | ||
doc_system_instruction += "Please answer with the letter of the correct last answer." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about non-english tasks that are already inside this repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a great point. This approach follows the openai/evals philosophy which hardcodes the strings. Maybe there can be a way to override these instructions task by task, or forced to be provided by users during invocation. Depends on the overall philosophy of what's defined in the task and what in the harness, which I'm not sure I completely from the outside at the first look.
any plans to merge this? how can I help? we really need this feature |
Hi! sorry for the delay in reviewing. Generally think this is a great PR, and was badly needed. I'll try looking at it in more detail this week! |
Hi, I just wanna share I'm sorry but I had to move on to other things and not sure if I can keep working on this PR. If I went back in time, I'm not 100% sure I'd do it this way again. I started with a mindset that I wanted to evaluate multiple_choice tasks with a generative chat completion without touching them at all. In reality, I realized I need to touch a lot of tasks anyway. I think that if I did this again, I'd look harder whether I could make this more task config oriented and mainly allow multiple_choice task definitions to easily specify how they should be evaluated by chat completion generations, then explicitly add that specification to various common tasks. |
(Apology 2: I accidentally pushed the commits from #2599 here too (already merged), making the changeset needlessly big.) |
As they are, multiple_choice tasks cannot be evaluated with popular API models that support only generate_until, not logprob.
This MR introduces a flag that allows these tasks to be still evaluated by introducing an emulation mode of sorts that just asks the model to generate the answer. The abcd approach and most of the prompt is borrowed from openai/evals.