Skip to content

Commit a7094aa

Browse files
authored
Merge pull request #177 from togethercomputer/fix-issue-47
Resolve Issue #47: Clarify training README
2 parents 6372379 + cc088bb commit a7094aa

File tree

1 file changed

+11
-4
lines changed

1 file changed

+11
-4
lines changed

training/README.md

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# OpenChatKit Training
22

3-
This directory contains code for training a chat model using OpenChatKit. The main training script is `finetune_GPT-NeoXT-Chat-Base-20B.sh`.
3+
This directory contains code for training a chat model using OpenChatKit. The main training script is `finetune_GPT-NeoXT-Chat-Base-20B.sh`.
44

55
To customize training, make a copy of the script and modify the arguments.
66

@@ -26,12 +26,13 @@ The following arguments should be carefully set:
2626
- `--net-interface`: Network interface. Should be consistent with `GLOO_SOCKET_IFNAME` and `NCCL_SOCKET_IFNAME`.
2727

2828
The following arguments can be tuned / changed:
29-
- `--train-log-backend `: How to log the training info. {print, loguru, wandb}.
29+
- `--train-log-backend `: How to log the training info. {print, loguru, wandb}.
3030
- `--optimizer`: Optimizer type. {adam, 8bit-adam} (8bit-adam requires `pip install bitsandbytes`)
3131
- `--load-pretrained-model`: Whether to load model weights. Usually `true`.
32-
- `--task-name`: The task name or the path of a `jsonl` file. For multi-task training separate task names by `,`.
33-
There is an optional sampling weight after each task name, separated by `:` (default is 1.0). Sampling weights will be normalized.
32+
- `--task-name`: The task name or the path of a `jsonl` file. For multi-task training separate task names by `,`.
33+
There is an optional sampling weight after each task name, separated by `:` (default is 1.0). Sampling weights will be normalized.
3434
E.g. it should be like `--task-name cot:0.1,/path_task0.jsonl:1.0,/path_task0.jsonl:1.0,/path_task0.jsonl:1.0`.
35+
The number after the colon indicates the sampling weight for the task during training. For example, `cot:0.1` means the `cot` task will be sampled with a weight of 0.1.
3536
- `--checkpoint-path`: Path to save fine-tuned checkpoints.
3637
- `--checkpoint-steps`: Save ckpt every `checkpoint-steps`.
3738
- `--total-steps`: Total number of steps for training. (This counts all `gradient-accumulate-step`s.)
@@ -48,3 +49,9 @@ The following arguments usually do not change:
4849
- `--fp16`: Flag to enable FP16 mixed precision training. Should always adding it for the current impl.
4950
- `--pp-mode`: always `gpipe`
5051
- `--profiling`: {no-profiling, tidy_profiling}. `tidy_profiling` will generate profile jsons.
52+
53+
## Adding Your Own Data to the DATASETS
54+
55+
To add your own data to the training process, you should create a `jsonl` file where each line is a JSON object representing a single training example. Once you have your `jsonl` file, you can include it in the `--task-name` argument with an appropriate sampling weight. For instance, if your file is located at `/path_to_your_data/your_data.jsonl` and you wish to give it a sampling weight of 0.5, you would add `/path_to_your_data/your_data.jsonl:0.5` to the `--task-name` argument.
56+
57+
If you have any questions or need further assistance, please refer to the [OpenDataHub](https://github.com/togethercomputer/OpenDataHub) repository or contact us through our [website](https://www.together.ai/contact).

0 commit comments

Comments
 (0)