Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on dataset mixer #157

Open
deep-diver opened this issue Apr 18, 2024 · 5 comments
Open

Clarification on dataset mixer #157

deep-diver opened this issue Apr 18, 2024 · 5 comments

Comments

@deep-diver
Copy link
Contributor

from the README from /scripts.

datasets_mixer:
    dataset_1: 0.5  # Use 50% of the training examples
    dataset_2: 0.66 # Use 66% of the training examples
    dataset_3: 0.10 # Use 10% of the training examples
dataset_splits:
- train_xxx         # The training splits to mix
- test_xxx          # The test splits to mix

From the comments, it looks like ONLY training samples from dataset_1, dataset_2, and dataset_3 are considered. There isn't explanation how each dataset contributes to the test_xxx split.

However, the actual implementation seems like searching the test_xxx split from all datasets specified:

if "train" in split:
raw_train_datasets.append(dataset)
elif "test" in split:
raw_val_datasets.append(dataset)
else:
raise ValueError(f"Split type {split} not recognized as one of test or train.")

Could you please explain the relationships between multiple datasets and splits?
Thank you.

@shabie
Copy link

shabie commented Apr 21, 2024

From the comments, it looks like ONLY training samples from dataset_1, dataset_2, and dataset_3 are considered. There isn't explanation how each dataset contributes to the test_xxx split.

Each dataset should have a separate train and test splits. This is made clear in the docstring where the expecatation is that they start with train_ and test_ respectively. Now the percentages sample the fraction of all datapoints from the train split. The corresponding test dataset is taken in full since subsampling for validation seems pointless (unless validation is super expensive then yeah maybe).

If the confusion was that the datamixer automatically uses the "unused" part of the train split as a test dataset (like how sklearn allows us to do that) then no that doesn't happen here. I like it cuz it always keeps the test set away from being mistakenly used as training by just changing the percentages of the mix.

Anyhow, all this is based on my understanding of the code. Hope it helps or if I am wrong, please correct me :)

@deep-diver
Copy link
Contributor Author

Thank you @shabie

I think it could be common to have a test dataset in a single repo while we could have training dataset from multiple sources.

At least this is my use-case.
To do this, I ended up merging multiple datasets into a single one by myself. Just hoping it could be done in alignment handbook too.

@JIElite
Copy link

JIElite commented Jun 11, 2024

if we assign the mixed dataset to 0.0, what will happen on the test set?

  • will it use the full test set for evaluation?
  • or it won't use anything from that dataset

@deep-diver
Copy link
Contributor Author

@JIElite

AFAIK, the ratio doesn't have any impact on the test split.

@JIElite
Copy link

JIElite commented Jun 12, 2024

@deep-diver
Thanks for reply
So, it will also use test set for evaluation, right? even if we assign the mixed ratio to 0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants