How are multiple datasets loaded? #1330

fxnie · 2024-12-17T12:11:02Z

local_setup.yml：
"data_path": "/workspace/mnt/cm-nfx/gpt-neox/data/mamba/mamba-2.8b/algorithmic_corpus/algorithmic_corpus_text_document,/workspace/mnt/cm-nfx/gpt-neox/data/mamba/mamba-2.8b/opencode_sft/opencode_sft_text_document,/workspace/mnt/cm-nfx/gpt-neox/data/mamba/mamba-2.8b/synthetic_code_snippet/synthetic_code_snippet_text_document,/workspace/mnt/cm-nfx/gpt-neox/data/mamba/mamba-2.8b/synthetic_qa/synthetic_qa_text_document",

Another question is how to reduce the ratio of the test set and validation set to 0 when setting the training set ratio?

iPRET · 2024-12-18T15:10:45Z

Not sure if this helps, cause I myself am a user, not a maintainer, but:
I think a nice alternative to doing the data splits is to just define the training, validation and test datasets explicitly. Like e.g.:

{
  "train_data_paths": ["/scratch/project_465001281/IP/tokenized/gpt_neox/en3_text_document", "/scratch/project_465001281/IP/tokenized/gpt_neox/lv3_text_document"],
  "valid_data_paths": ["/scratch/project_465001281/IP/tokenized/gpt_neox/en3_flores_text_document"],
  "test_data_paths": ["/scratch/project_465001281/IP/tokenized/gpt_neox/en3_flores_text_document"]
}

Also in the example above it shows how one can load two datasets in parallel ("train_data_paths").

fxnie · 2024-12-19T11:03:49Z

How should I set up using only the training set and not the test and validation sets?

iPRET · 2024-12-19T12:30:51Z

A workaround could be to just generate some small dummy datasets, like e.g. from a .jsonl file with the line {"text": "eeeeeeeeee"}, and set eval_interval to a large number, if you really don't care about testing the model.

fxnie · 2024-12-25T09:13:16Z

Is there a way to load only the data from these two files as described by you? "/cramch/project-465001281/IP/tokenized/gpt_neox/en3_text.dideo" and "/cramch/project-465001281/IP/tokenized/gpt_neox/lv3_text.dideo"

fxnie added the feature request New feature or request label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How are multiple datasets loaded? #1330

How are multiple datasets loaded? #1330

fxnie commented Dec 17, 2024

iPRET commented Dec 18, 2024

fxnie commented Dec 19, 2024

iPRET commented Dec 19, 2024 •

edited

Loading

fxnie commented Dec 25, 2024

How are multiple datasets loaded? #1330

How are multiple datasets loaded? #1330

Comments

fxnie commented Dec 17, 2024

iPRET commented Dec 18, 2024

fxnie commented Dec 19, 2024

iPRET commented Dec 19, 2024 • edited Loading

fxnie commented Dec 25, 2024

iPRET commented Dec 19, 2024 •

edited

Loading