Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prompt version 0.2.1 for JCommonsenseQA #104

Merged
merged 4 commits into from
Oct 24, 2023
Merged

Conversation

mkshing
Copy link

@mkshing mkshing commented Oct 12, 2023

Background

In principle, "base" models (trained as just language modeling and without specific prompt format) should be evaluated with prompt version 0.2. But, we were reported 0.3 outperformed 0.2, which is weird.

So, we did comparison between 0.2 and 0.3 for some models. (thank you @mrorii !) And, we found using 0.3 increased scores for all base models in JCommonsenseQA and JNLI.

Summary

JCommonsenseQA is a question answering task given 5 choices. In 0.2, the prompt looks like below. (reference link)

質問と回答の選択肢を入力として受け取り、選択肢から回答を選択してください。なお、回答は選択肢の番号(例:0)でするものとします。 

質問:街のことは?
選択肢:0.タウン,1.劇場,2.ホーム,3.ハウス,4.ニューヨークシティ
回答:

The prompt encourages to answer by "index" rather than the text itself. But, the targets are actually texts. So, I assume models were messed up somehow by this gap. (code)

Solution

  • Introduced a new prompt version 0.2.1 for base models, which outperformed 0.2 and 0.3.
Model # of shots prompt version acc
elyza/ELYZA-japanese-Llama-2-7b 0 0.2 31.64
elyza/ELYZA-japanese-Llama-2-7b 0 0.3 38.96
elyza/ELYZA-japanese-Llama-2-7b 0 0.21 (NEW!) 45.49
matsuo-lab/weblab-10b 0 0.2 23.32
matsuo-lab/weblab-10b 0 0.3 42.27
matsuo-lab/weblab-10b 0 0.21 (NEW!) 25.47
  • * all evals were performed with hf-causal-experimental and jcommonsenseqa-1.1-{prompt version}

@mkshing mkshing self-assigned this Oct 12, 2023
@mkshing mkshing changed the title Add prompt version 0.21 for JCommonsenseQA Add prompt version 0.2.1 for JCommonsenseQA Oct 12, 2023
Copy link

@mrorii mrorii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review 🙇, LGTM 👍

Copy link
Collaborator

@polm-stability polm-stability left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small change requested.

Also, this is in draft - are you actually still working on it, or is it ready?

lm_eval/tasks/ja/jcommonsenseqa.py Outdated Show resolved Hide resolved
@mkshing mkshing marked this pull request as ready for review October 24, 2023 04:40
@mkshing mkshing requested a review from jon-tow as a code owner October 24, 2023 04:40
@mkshing mkshing removed the request for review from jon-tow October 24, 2023 04:45
@mkshing
Copy link
Author

mkshing commented Oct 24, 2023

Although all scores of 0.2.1 didn't outperform 0.3, we confirmed 0.2.1 is way better than at least 0.2 and 0.3 for base models. So, I will merge this PR.

@mkshing mkshing merged commit 6af55ef into jp-stable Oct 24, 2023
1 check passed
@mkshing mkshing deleted the mkshing/jcomm-0-21 branch October 24, 2023 05:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants