This repo contains code for training and running autoregressive language models based on the transformer architecture. I mainly wrote this code to teach myself PyTorch and learn more about how large language models work. It should not be used for anything serious. Heavily inspired by Andrej Karpathy's nanoGPT.
Current features:
- Basic decoder-only transformer architecture with learned positional embeddings in the style of GPT-2
- RMSNorm
- Gated feedforward layers
- Rotary Positional Embeddings
Planned:
- Hybrid architectures like Jamba
- Mixture of Experts