We use wiki103 data as example, which is publicly available. It contains three text files, train.txt
, valid.txt
and text.txt
. We use train.txt
to train the model and valid.txt
to evalute the intermeidate checkpoints. We first need to run prepara_data.py
to tokenize these text files. We concatenate all documents into a single text and split it into lines of tokens, while each line has at most 510 token (2 tokens are left to special tokens [CLS]
and [SEP]
).
Run mlm.sh
to train a bert like model which uses MLM as the pre-training task. For example,
mlm.sh bert-base
will train a bert base model which uses absolute position encoding
mlm.sh deberta-base
will train a deberta base model which uses Disentangled Attention
Run rtd.sh
to train a ELECTRA like model using RTD as the pre-training task. For example,
rtd.sh deberta-v3-xsmall
will train a DeBERTaV3 XSmall model with 9M backbone parameters(12 layers, 256 hidden size) plus 32M embedding parameters(128k vocabulary size)
rtd.sh deberta-v3-base
will train a DeBERTaV3 Base model with 81M backbone parameters(12 layers, 768 hidden size) plus 96M embedding parameters(128k vocabulary size)
rtd.sh deberta-v3-large
will train a DeBERTaV3 Large model with 288M backbone parameters(24 layers, 1024 hidden size) plus 128M embedding parameters(128k vocabulary size)
Run rtd.sh deberta-v3-X-continue
to continuously train DeBERTaV3-X models with following checkpoints.
Please check the script to specify initialization models of generator and discriminator.
Model | Generator | Discriminator |
---|---|---|
DeBERTa-V3-XSmall | Hugging Face | Hugging Face |
DeBERTa-V3-Small | Hugging Face | Hugging Face |
DeBERTa-V3-Large | Hugging Face | Hugging Face |
To train with multiple node, you need to specify three environment variables,
WORLD_SIZE
- Total nodes that are used for the training
MASTER_ADDR
- The IP address or host name of the master node
MASTER_PORT
- The port of the master node
RANK
- The rank of current node
For example, to run train a model with 2 nodes,
- On node0,
export WORLD_SIZE=2
export MASTER_ADDR=node0
export MASTER_PORT=7488
export RANK=0
./rtd.sh deberta-v3-xsmall
- On node1,
export WORLD_SIZE=2
export MASTER_ADDR=node0
export MASTER_PORT=7488
export RANK=1
./rtd.sh deberta-v3-xsmall
-
embedding_sharing
The embedding sharing method- GDES Gradient Disentangled Embedding Sharing
- ES Embedding Sharing
- NES No Embedding Sharing
-
relative_attention
Whether to used relative attention -
pos_att_type
Relative position encoding type- P2C Postion to content attention
- C2P Content to position attention