Clarification on dataset mixer

from the README from `/scripts`.

```yaml
datasets_mixer:
    dataset_1: 0.5  # Use 50% of the training examples
    dataset_2: 0.66 # Use 66% of the training examples
    dataset_3: 0.10 # Use 10% of the training examples
dataset_splits:
- train_xxx         # The training splits to mix
- test_xxx          # The test splits to mix
```

From the comments, it looks like ONLY training samples from `dataset_1`, `dataset_2`, and `dataset_3` are considered. There isn't explanation how each dataset contributes to the `test_xxx` split. 

However, the actual implementation seems like searching the `test_xxx` split from all datasets specified:

https://github.com/huggingface/alignment-handbook/blob/70769f9e9ba41c7f08ba6c4ff3725441b68b7ca3/src/alignment/data.py#L225-L230

Could you please explain the relationships between multiple datasets and splits? 
Thank you.

	if "train" in split:
	raw_train_datasets.append(dataset)
	elif "test" in split:
	raw_val_datasets.append(dataset)
	else:
	raise ValueError(f"Split type {split} not recognized as one of test or train.")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification on dataset mixer #157

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarification on dataset mixer #157

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions