Clarification of details concerning the creation of the base data generation

Happy new year's eve!

Thank you for your repository and congratulations on your paper 😊
I was trying to reproduce your results but got stuck while creating the base data.
I've been following along the description provided in a previous question, https://github.com/ncsoft/offsetbias/issues/1 
However, I got a very different number of instances for the pairwise dataset using Ultrafeedback.


Here's what I've tried:

- After removing flan, I created an evaluation instance for each criterion `['helpfulness', 'honesty', 'instruction_following', 'truthfulness']` leading to `43k * 4 = 172k` evaluation instances.
- For each instance, I remove **only the model responses that contains a rating of "N/A" for any one of the criteria**. I assume this is to calculate the `annotation_rating_average`. This also means that each evaluation instance may contain a different number of model responses to sample from.

For example, if the model response scores look like this:
```
// For criteria: ['helpfulness', 'honesty', 'instruction_following', 'truthfulness']
[[3, "N/A", 5, 5], [3, 4, 4, 4], [5, 5, 5, 5], [5, 5, 5, 5]]
```
I remove the model response containing "N/A" and leave the rest.
```
[[3, 4, 4, 4], [5, 5, 5, 5], [5, 5, 5, 5]]
```

- For each instance I sample [1- the remaining # of responses] of the responses.
- This leads to **167k instances**, where I drop 5k instances where **all model responses contain at least one "N/A"**. In other words, 5k instances exist where you can't calculate an exact `annotation_rating_average` for the model responses.

However, a problem arises when creating the pairwise dataset.
If `3.1K` is 40% of the total number of evaluation instances satisfying the conditions, this implies `7.8k-8k` total number of evaluation instances fit for pairwise evaluation. When 164k is the total number of instances used (as per your paper), this means that the total number of instances used should be almost 170k.

Furthermore, when counting the actual number of instances by condition, here's what I get: 
(note: all conditions are added using AND)
- uses `instruction_following`: 41.8k instances (40% = 16.7k)
- has 2 candidate responses: 10.7k instances (40% = 4.3k)
- the 2 scores are different: 7.9k instances (40% = a little over 3.1k) **This is the desired number of instances!**
- the higher scoring sample has high quality, with overall `score >= 8.0`: 3k instances (40% = 1.2k)
- the higher scoring sample has `annotation rating average >= 4.75`: 2.2k instances (40% = 0.9k)

Here, the 0.9k final number of instances satisfying all the required conditions seems really small, even considering the randomness of sampling.

Is there something I'm doing wrong when constructing the dataset? I was hoping you could give me some pointers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification of details concerning the creation of the base data generation #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarification of details concerning the creation of the base data generation #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions