This repository aims to cover minimal codes for generative models for an educational purpose. They basically depend on PyTorch 2.0, no HugginFace transformers.
To begin with, I included the code to train a 51M-parameter language model. I will add image generation and more features in the future.
This repository is tested on:
- Python 3.10.12
- Poetry 1.6.1
- NVIDIA V100 GPU
- CUDA 11.8
For the Python packages, please refer to pyproject.toml.
I trained a 51M-parameter language model on 1B tokens from BookCorpus. The training took around 20 hours with a single V100 GPU, which cost around $50. The final model achieved the perplexity of 0.83.
To create a tokenizer, run:
poetry run python generative_ai/scripts/create_tokenizer.py
To launch training, run:
poetry run python generative_ai/scripts/train.py
To generate sentences with pretrained model, run:
$ poetry run python generative_ai/scripts/generate.py --model generative_ai/artifacts/model.pt --prompt "life is about"
> number of parameters: 50.98M
life is about romance , and love and adrenaline , at the same time .
model.pt
can be obtained at Hugging Face Models.