Hack the nanoGPT to support mixture of experts.
pip install torch numpy transformers datasets tiktoken wandb tqdm
Dependencies:
- pytorch <3
- numpy <3
transformers
for huggingface transformers <3 (to load GPT-2 checkpoints)datasets
for huggingface datasets <3 (if you want to download + preprocess OpenWebText)tiktoken
for OpenAI's fast BPE code <3wandb
for optional logging <3tqdm
for progress bars <3
To enable moe training. Modify the config like config/train_shakespeare_char.py
with
# moe settings
use_moe = True
num_experts = 10
num_experts_per_tok = 2
If you are not a deep learning professional and you just want to feel the magic and get your feet wet, the fastest way to get started is to train a character-level GPT on the works of Shakespeare. First, we download it as a single (1MB) file and turn it from raw text into one large stream of integers:
$ python data/shakespeare_char/prepare.py
This creates a train.bin
and val.bin
in that data directory. Now it is time to train your GPT. The size of it very much depends on the computational resources of your system:
I have a GPU. Great, we can quickly train a baby GPT with the settings provided in the config/train_shakespeare_char.py config file:
$ python train.py config/train_shakespeare_char.py
If you peek inside it, you'll see that we're training a GPT with a context size of up to 256 characters, 384 feature channels, and it is a 6-layer Transformer with 6 heads in each layer. On one A100 GPU this training run takes about 3 minutes and the best validation loss is 1.4697. Based on the configuration, the model checkpoints are being written into the --out_dir
directory out-shakespeare-char
. So once the training finishes we can sample from the best model by pointing the sampling script at this directory:
$ python sample.py --out_dir=out-shakespeare-char
This generates a few samples, for example:
ANGELO:
And cowards it be strawn to my bed,
And thrust the gates of my threats,
Because he that ale away, and hang'd
An one with him.
DUKE VINCENTIO:
I thank your eyes against it.
DUKE VINCENTIO:
Then will answer him to save the malm:
And what have you tyrannous shall do this?
DUKE VINCENTIO:
If you have done evils of all disposition
To end his power, the day of thrust for a common men
That I leave, to fight with over-liking
Hasting in a roseman.
lol ¯\_(ツ)_/¯
. Not bad for a character-level model after 3 minutes of training on a GPU. Better results are quite likely obtainable by instead finetuning a pretrained GPT-2 model on this dataset (see finetuning section later).
I only have a macbook (or other cheap computer). No worries, we can still train a GPT but we want to dial things down a notch. I recommend getting the bleeding edge PyTorch nightly (select it here when installing) as it is currently quite likely to make your code more efficient. But even without it, a simple train run could look as follows:
$ python train.py config/train_shakespeare_char.py --device=cpu --compile=False --eval_iters=20 --log_interval=1 --block_size=64 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=128 --max_iters=2000 --lr_decay_iters=2000 --dropout=0.0
Here, since we are running on CPU instead of GPU we must set both --device=cpu
and also turn off PyTorch 2.0 compile with --compile=False
. Then when we evaluate we get a bit more noisy but faster estimate (--eval_iters=20
, down from 200), our context size is only 64 characters instead of 256, and the batch size only 12 examples per iteration, not 64. We'll also use a much smaller Transformer (4 layers, 4 heads, 128 embedding size), and decrease the number of iterations to 2000 (and correspondingly usually decay the learning rate to around max_iters with --lr_decay_iters
). Because our network is so small we also ease down on regularization (--dropout=0.0
). This still runs in about ~3 minutes, but gets us a loss of only 1.88 and therefore also worse samples, but it's still good fun:
$ python sample.py --out_dir=out-shakespeare-char --device=cpu
Generates samples like this:
GLEORKEN VINGHARD III:
Whell's the couse, the came light gacks,
And the for mought you in Aut fries the not high shee
bot thou the sought bechive in that to doth groan you,
No relving thee post mose the wear
Not bad for ~3 minutes on a CPU, for a hint of the right character gestalt. If you're willing to wait longer, feel free to tune the hyperparameters, increase the size of the network, the context length (--block_size
), the length of training, etc.
Finally, on Apple Silicon Macbooks and with a recent PyTorch version make sure to add --device=mps
(short for "Metal Performance Shaders"); PyTorch then uses the on-chip GPU that can significantly accelerate training (2-3X) and allow you to use larger networks. See Issue 28 for more.
This implementation is based on:
NanoGPT: https://github.com/karpathy/nanoGPT
llama-mistral: https://github.com/dzhulgakov/llama-mistral