You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for your repository and congratulations on your paper 😊
I was trying to reproduce your results but got stuck while creating the base data.
I've been following along the description provided in a previous question, #1
However, I got a very different number of instances for the pairwise dataset using Ultrafeedback.
Here's what I've tried:
After removing flan, I created an evaluation instance for each criterion ['helpfulness', 'honesty', 'instruction_following', 'truthfulness'] leading to 43k * 4 = 172k evaluation instances.
For each instance, I remove only the model responses that contains a rating of "N/A" for any one of the criteria. I assume this is to calculate the annotation_rating_average. This also means that each evaluation instance may contain a different number of model responses to sample from.
For example, if the model response scores look like this:
I remove the model response containing "N/A" and leave the rest.
[[3, 4, 4, 4], [5, 5, 5, 5], [5, 5, 5, 5]]
For each instance I sample [1- the remaining # of responses] of the responses.
This leads to 167k instances, where I drop 5k instances where all model responses contain at least one "N/A". In other words, 5k instances exist where you can't calculate an exact annotation_rating_average for the model responses.
However, a problem arises when creating the pairwise dataset.
If 3.1K is 40% of the total number of evaluation instances satisfying the conditions, this implies 7.8k-8k total number of evaluation instances fit for pairwise evaluation. When 164k is the total number of instances used (as per your paper), this means that the total number of instances used should be almost 170k.
Furthermore, when counting the actual number of instances by condition, here's what I get:
(note: all conditions are added using AND)
Happy new year's eve!
Thank you for your repository and congratulations on your paper 😊
I was trying to reproduce your results but got stuck while creating the base data.
I've been following along the description provided in a previous question, #1
However, I got a very different number of instances for the pairwise dataset using Ultrafeedback.
Here's what I've tried:
['helpfulness', 'honesty', 'instruction_following', 'truthfulness']
leading to43k * 4 = 172k
evaluation instances.annotation_rating_average
. This also means that each evaluation instance may contain a different number of model responses to sample from.For example, if the model response scores look like this:
I remove the model response containing "N/A" and leave the rest.
annotation_rating_average
for the model responses.However, a problem arises when creating the pairwise dataset.
If
3.1K
is 40% of the total number of evaluation instances satisfying the conditions, this implies7.8k-8k
total number of evaluation instances fit for pairwise evaluation. When 164k is the total number of instances used (as per your paper), this means that the total number of instances used should be almost 170k.Furthermore, when counting the actual number of instances by condition, here's what I get:
(note: all conditions are added using AND)
instruction_following
: 41.8k instances (40% = 16.7k)score >= 8.0
: 3k instances (40% = 1.2k)annotation rating average >= 4.75
: 2.2k instances (40% = 0.9k)Here, the 0.9k final number of instances satisfying all the required conditions seems really small, even considering the randomness of sampling.
Is there something I'm doing wrong when constructing the dataset? I was hoping you could give me some pointers.
The text was updated successfully, but these errors were encountered: