Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test and train sets to in-loop oe-eval (for ladder work) #748

Merged
merged 4 commits into from
Nov 19, 2024

Conversation

liujch1998
Copy link
Contributor

Standardizing what we should eval for the ladder work:

  • 10 tasks from OLMES
  • val sets, and (when applicable) test and/or train sets
  • No subsampling happens
  • rc_5shot, mc_5shot, and their bpb version (for MMLU, I kept a rc_var version)

Stats:
Screenshot 2024-11-18 at 15 23 03

@liujch1998 liujch1998 marked this pull request as ready for review November 18, 2024 23:23
"arc_easy_train_rc_5shot": (
OEEvalTask,
{"dataset_path": "arc_easy", "dataset_name": "train_rc_5shot", "metric_type": "len_norm"},
), # this used to be acc
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we safely change acc to len_norm here? I want to match what arc_challenge is doing.

Copy link
Contributor

@OyvindTafjord OyvindTafjord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I tried to look through the tasks, didn't spot anything suspicious. (The number of tasks is getting pretty unwieldy at this point, would be good to streamline in the future somehow)

@liujch1998 liujch1998 merged commit 9c677c9 into main Nov 19, 2024
12 checks passed
@liujch1998 liujch1998 deleted the oeeval-ladder-testtrain branch November 19, 2024 00:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants