Skip to content

Clarification of details concerning the creation of the base data generation #3

Closed
@steve2972

Description

@steve2972

Happy new year's eve!

Thank you for your repository and congratulations on your paper 😊
I was trying to reproduce your results but got stuck while creating the base data.
I've been following along the description provided in a previous question, #1
However, I got a very different number of instances for the pairwise dataset using Ultrafeedback.

Here's what I've tried:

  • After removing flan, I created an evaluation instance for each criterion ['helpfulness', 'honesty', 'instruction_following', 'truthfulness'] leading to 43k * 4 = 172k evaluation instances.
  • For each instance, I remove only the model responses that contains a rating of "N/A" for any one of the criteria. I assume this is to calculate the annotation_rating_average. This also means that each evaluation instance may contain a different number of model responses to sample from.

For example, if the model response scores look like this:

// For criteria: ['helpfulness', 'honesty', 'instruction_following', 'truthfulness']
[[3, "N/A", 5, 5], [3, 4, 4, 4], [5, 5, 5, 5], [5, 5, 5, 5]]

I remove the model response containing "N/A" and leave the rest.

[[3, 4, 4, 4], [5, 5, 5, 5], [5, 5, 5, 5]]
  • For each instance I sample [1- the remaining # of responses] of the responses.
  • This leads to 167k instances, where I drop 5k instances where all model responses contain at least one "N/A". In other words, 5k instances exist where you can't calculate an exact annotation_rating_average for the model responses.

However, a problem arises when creating the pairwise dataset.
If 3.1K is 40% of the total number of evaluation instances satisfying the conditions, this implies 7.8k-8k total number of evaluation instances fit for pairwise evaluation. When 164k is the total number of instances used (as per your paper), this means that the total number of instances used should be almost 170k.

Furthermore, when counting the actual number of instances by condition, here's what I get:
(note: all conditions are added using AND)

  • uses instruction_following: 41.8k instances (40% = 16.7k)
  • has 2 candidate responses: 10.7k instances (40% = 4.3k)
  • the 2 scores are different: 7.9k instances (40% = a little over 3.1k) This is the desired number of instances!
  • the higher scoring sample has high quality, with overall score >= 8.0: 3k instances (40% = 1.2k)
  • the higher scoring sample has annotation rating average >= 4.75: 2.2k instances (40% = 0.9k)

Here, the 0.9k final number of instances satisfying all the required conditions seems really small, even considering the randomness of sampling.

Is there something I'm doing wrong when constructing the dataset? I was hoping you could give me some pointers.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions