This directory contains code for training a chat model using OpenChatKit. The main training script is finetune_GPT-NeoXT-Chat-Base-20B.sh
.
To customize training, make a copy of the script and modify the arguments.
Environment vars that should be set:
export GLOO_SOCKET_IFNAME=lo # this interface should be consistent to `--net-interface`
export NCCL_SOCKET_IFNAME=lo # this interface should be consistent to `--net-interface`
export WANDB_NAME=gptj-test # wandb run name
The following arguments should be carefully set:
--model-name
: The path of model ckpt sharded by layers.--tokenizer-name
: Usually the same to--model-name
. You can also use HF's model name.--model-type
: Indicate the model type. {gptj}. More model types will be added soon.--num-layers
: Number of Transformer layers for each GPU. E.g. GPT-J has 28 layers, if we use two GPUs to form a pipeline,--num-layers
should be 14.--embedding-dim
: The hidden size of the model. GPT-J-6B is 4096. This is used to create buffers.--dist-url
: URL of rank 0 worker (master). It is the same to all workers. And this URL should be accessible by all workers. For local training (single machine multiple GPUs), this can be like--dist-url tcp://127.0.0.1:7033
--world-size
: The total number of workers.world-size == pipeline-group-size * data-group-size
--pipeline-group-size
: Number of GPU workers for each pipeline--data-group-size
: Number of data parallel workers. Also the number of pipelines.--net-interface
: Network interface. Should be consistent withGLOO_SOCKET_IFNAME
andNCCL_SOCKET_IFNAME
.
The following arguments can be tuned / changed:
--train-log-backend
: How to log the training info. {print, loguru, wandb}.--optimizer
: Optimizer type. {adam, 8bit-adam} (8bit-adam requirespip install bitsandbytes
)--load-pretrained-model
: Whether to load model weights. Usuallytrue
.--task-name
: The task name or the path of ajsonl
file. For multi-task training separate task names by,
. There is an optional sampling weight after each task name, separated by:
(default is 1.0). Sampling weights will be normalized. E.g. it should be like--task-name cot:0.1,/path_task0.jsonl:1.0,/path_task0.jsonl:1.0,/path_task0.jsonl:1.0
.--checkpoint-path
: Path to save fine-tuned checkpoints.--checkpoint-steps
: Save ckpt everycheckpoint-steps
.--total-steps
: Total number of steps for training. (This counts allgradient-accumulate-step
s.)--warmup-steps
: LR warmup steps.--lr
: learning rate--seq-length
: sequence length--batch-size
: batch size for each GPU device (of each gradient accumulation step).--micro-batch-size
: micro batch size for pipeline parallelism. 1 works fine.--gradient-accumulate-step
: Accumulate gradients for several steps before updating parameters. This is another way to achieve large batch sizes when GPU memory is not enough.
The following arguments usually do not change:
--dp-backend
: {nccl, gloo}, default nccl.--dp-mode
: {allreduce}.--fp16
: Flag to enable FP16 mixed precision training. Should always adding it for the current impl.--pp-mode
: alwaysgpipe
--profiling
: {no-profiling, tidy_profiling}.tidy_profiling
will generate profile jsons.