Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification of details concerning the creation of the base data generation #3

Open
steve2972 opened this issue Dec 31, 2024 · 0 comments

Comments

@steve2972
Copy link

Happy new year's eve!

Thank you for your repository and congratulations on your paper 😊
I was trying to reproduce your results but got stuck while creating the base data.
I've been following along the description provided in a previous question, #1
However, I got a very different number of instances for the pairwise dataset using Ultrafeedback.

Here's what I've tried:

  • After removing flan, I created an evaluation instance for each criterion ['helpfulness', 'honesty', 'instruction_following', 'truthfulness'] leading to 43k * 4 = 172k evaluation instances.
  • For each instance, I remove only the model responses that contains a rating of "N/A" for any one of the criteria. I assume this is to calculate the annotation_rating_average. This also means that each evaluation instance may contain a different number of model responses to sample from.

For example, if the model response scores look like this:

// For criteria: ['helpfulness', 'honesty', 'instruction_following', 'truthfulness']
[[3, "N/A", 5, 5], [3, 4, 4, 4], [5, 5, 5, 5], [5, 5, 5, 5]]

I remove the model response containing "N/A" and leave the rest.

[[3, 4, 4, 4], [5, 5, 5, 5], [5, 5, 5, 5]]
  • For each instance I sample [1- the remaining # of responses] of the responses.
  • This leads to 167k instances, where I drop 5k instances where all model responses contain at least one "N/A". In other words, 5k instances exist where you can't calculate an exact annotation_rating_average for the model responses.

However, a problem arises when creating the pairwise dataset.
If 3.1K is 40% of the total number of evaluation instances satisfying the conditions, this implies 7.8k-8k total number of evaluation instances fit for pairwise evaluation. When 164k is the total number of instances used (as per your paper), this means that the total number of instances used should be almost 170k.

Furthermore, when counting the actual number of instances by condition, here's what I get:
(note: all conditions are added using AND)

  • uses instruction_following: 41.8k instances (40% = 16.7k)
  • has 2 candidate responses: 10.7k instances (40% = 4.3k)
  • the 2 scores are different: 7.9k instances (40% = a little over 3.1k) This is the desired number of instances!
  • the higher scoring sample has high quality, with overall score >= 8.0: 3k instances (40% = 1.2k)
  • the higher scoring sample has annotation rating average >= 4.75: 2.2k instances (40% = 0.9k)

Here, the 0.9k final number of instances satisfying all the required conditions seems really small, even considering the randomness of sampling.

Is there something I'm doing wrong when constructing the dataset? I was hoping you could give me some pointers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant