In this repo, I built two types of language models: recurrent and transformer. I have a pipeline for training and doing hparam sweeps using W&B. Below the implementation notes, I give a brief intro into language models and transformer architecture, then outline the key components in the implementation.
Note: I am actively working on and improving this repo.
git clone https://github.com/khajash/language-models.git
cd language-models
python -m venv .env
source .env/bin/activate
pip install -r requirements.txt
I am using WikiText2 from torchtext
. Below is an excerpt from the dataset. As you can see, it is already preprocessed having the rare words replaced with the <unk>
token. The vocabulary has a total of 28,782 tokens it.
= Valkyria Chronicles III =
Senjō no Valkyria 3 : Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Raven " .
- Create a json config file to specify network parameters and learning rate schedule in
lmlib/configs
- Recurrent Network
python train-recurrent.py --config ./configs/simple-gru-invsqrt.json
- Select different layer types, e.g. RNN, LSTM, GRU under
cell_type
in the config file
- Select different layer types, e.g. RNN, LSTM, GRU under
- Transformer
python train.py --config ./configs/simple-transformer-cosine.json
- Command line configs include:
seed
: Random seed. (int, default = 0)n_epochs
: Number of epochs to run the training. (int, default = 50)batch_size
: Batch size for mini-batch training. (int, default = 20)eval_batch_size
: Batch size for mini-batch training. (int, default = 20)seq_len
: Max length of a sequence. (int, default = 35)save_model
: Save best model while training and last model when done.dryrun
: Run in dryrun mode without wandb.
- Go to WandB project on web console. If you don't have a project already, create a new one.
- Select Sweeps > Create Sweep
- Copy yaml config sweep from
sweep-bayes.yml
or write custom one. The autogenerated yaml is not helpful in my setup. PressInitialize Sweep
- Make sure default params in
parsers.py
are correct and especially the base config file - Set
RUN_WANDB_SWEEP = True
at beginning oftrain.py
file. This allows wandb to override hparams in the sweep, otherwise it will keep the default configuration. - In the WandB sweep console, copy launch agent
wandb agent user/project/agentID
and run in an open terminal. This will start your sweep. Do this on as many separate machines as you want for distrubuted tuning.
-
Sinusoidal Positional Encoding - same as Vaswani et al. (2017)
- In PyTorch implementation, we take advantage of the log-exp trick to make the math a bit easier.
-
Learning Rate Schedulers - configure in the json config file
- StepLr
- Inverse Square Root with Warm-up
- Cosine with Warm-up
- Generating new text is currently only setup for the transformer architecture. It supports two decoding methods:
- Greedy Search - greedily selects the word at each step with the highest probability - this method tends to generate sequences that may repeat words or subsequences multiple times
- Sampling - randomly selects the word at each timestep based on its conditional probability distribution
python generate_text.py --model path/to/model.pt --config ./configs/config-file.json --seq_len 20 --decoding sampling
A language model estimates the probability distribution over a sequence of tokens, e.g. words. Given a previous set of tokens
In practice, a langauge model takes in a sequence of tokens, feeds them through an embedding layer, decoder model and softmax function to output the probabilities over the vocabulary (vocab size = ntoken
). The decoder model is typically either a recurrent model (RNN, LSTM, GRU, etc.) or transformer. Recurrent models will process each word in sequence, while a transformer can process the sequence in parallel using a mask.
Fig.1 High-Level Diagram of Language Model
The Transformer architecture used here is similar to that employed in (Liu et al., 2018) and the original GPT (Radford et al., 2018). For a language model, we do not need the encoder-decoder architecture needed in neural machine translation (NMT) as in Vaswani et al. (2017). Instead, we can just use a decoder network. This decoder block is similar to the encoder block in Vaswani et al. (2017), as it only consists of two sublayers: Self-Attention and Feed-Forward. One key difference between the encoder block in Vaswani et al. (2017) and the decoder here is that we use Masked Self-Attention rather than unmasked.
Fig.2 Diagram of Transformer Decoder Language Model
Below, I'll dive into the three important components of the transformer architecture: positional encoding, scaled dot-product attention, and multi-head attention. Here are some of the key parameters we'll be using in this doc.
-
$d_\text{model}$ : dimension of the embedding size and the layers within the model -
$d_\text{vocab}$ : size of vocabulary - listed asntoken
in diagrams
We use positional encodings to inject information about relative or absolute position into the model. It is the size
Sinusoidal Positional Encoding
For this method, we precompute a PE matrix and store it in the buffer.
- Each row represents the encoding for a specific word at position
$i$ - Each column represents a different sinusoidal function at a different wavelength - every other column is alternating between sine and cosine - which is why there is banding in the upper dimensions because the wavelength is much larger.
Fig.3 Positional Encoding Matrix
Before looking at Multi-Head Attention, it's important to understand Scaled Dot-Product Attention.
Queries, Keys and Values
So, what are Queries
Operations
-
MatMul -
$QK^T$ - Calculate the alignment score to see how much the two word embeddings match - calculate between the each query$Q$ and key$K$ -
Scale -
$\frac{1}{\sqrt{d_k}}$ - Divide by$\sqrt{d_k}$ for more stable gradients, used for regularization and improves performance for larger models -$d_k$ is the dimension of the keys - Mask - (optional) Mask out future positions
-
Softmax - Apply softmax function to obtain the weights for the values
$V$ -
MatMul - Apply weights to values
$V$
Fig.4 Scaled Dot Product Attention. Diagram from Vaswani et al. (2017)
Rather than performing a single attention function with the scaled dot-product attention function, we linearly project QKV
Operations
-
Linear - Linearly project QKV each with its own set of weights. Do not use an activation fucntion here.
- This is where we project into different subspaces and learn alignment for different representations
- Scaled Dot-Product Attention - For each projected version, perform the scaled dot-product attention function in parallel
-
Concat - Concatenate all of the scaled dot-product attention heads
$(\text{head}_1, \dots,\text{head}_h)$ - Linear - Project the concatenated heads back to the original space to produce the final values
Why Multi-head attention?
- Word representations encode many different characteristics of the word. A single Scaled Dot-Product Attention layer would only be able to query these characteristics in one shot. E.g. maybe it determines its a verb but not that its past tense.
- Multi-Head Attention applies multiple linear transformations to the
$\mathbf{Q}, \mathbf{K}, \mathbf{V}$ - allowing the model to apply many different projections of the word representations into different subspaces, each focusing on a subset of the word’s characteristics - Vaswani et al. (2017) used
$h=8$ parallel attention layers$d_k = d_v = d_{\text{model}} / h = 64$
- Due to reduced dimension of each head, total computation cost is similar to that of a single-head attention with full dimensionality
-
Architecture
- Encoder-Decoder Transformer model for Machine Translation
- At beginning of both encoder and decoder models is an embedding layer for each relevant vocabulary and a sinusoidal positional embedding layer.
- Embedding multiply by
$\sqrt{d_{\text{model}}}$
- Embedding multiply by
- (512 dimensional states with 8 attention heads)
-
Encoder
- Stack of
$N = 6$ identical layers consisting of two sublayers: self-attention and feed-forward network. - Around each sublayer is a residual connection.
- Following the residual connection is layer normalization.
- Stack of
-
Decoder
- Stack of
$N = 6$ identical layers consisting of three sublayers: masked self-attention, encoder-decoder attention, and feed-forward network. - Around each sublayer is a residual connection.
- Following the residual connection is layer normalization.
- Stack of
-
Position-wise Feed-Forward Networks
- Two linear layers with a ReLU activation in between
$$\text{FFN}(x) = \text{max}(0, xW_1 + b_1)W_2 + b_2$$ - Input and output dims:
$d_{\text{model}}=512$ - Inner-layer dims:
$d_{ff}=2048$
- Two linear layers with a ReLU activation in between
-
Optimization
- Adam optimizer with
$\beta_1 = 0.9, \beta_2=0.98$ and$\epsilon=10^{-9}$ - Used linear warmup with inverse square root decay afterwards
- Adam optimizer with
- Used bytepair encoding (BPE) vocabulary with target vocabulary of ~37000 tokens
-
Architecture
- 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads)
- Position-wise feed forward networks - used 3072 dimensional inner states
-
Optimization
- Adam optimizer with max lr of 2.5e-4
- lr scheduler: increased linearly from zero over the first 2000 updates, annealing to 0 using a cosine schedule
- 100 epochs using minibatches of 64 randomly sampled, contiguous sequences of 512 tokens
-
Weight initialization of
$N(0, 0.02)$ is sufficient b/c Layer norm is used throughout - Used bytepair encoding (BPE) vocabulary with 40,000 merges
- Residual, embedding and attention dropouts with rate of 0.1 for regularization
- Modified version of L2 regularization with
$w=0.01$ on all non-bias or gain weights - GELU activation function
- Used learned position embeddings instead of sinusoidal
- Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn and TensorFlow. O’Reilly.
- Karpathy, A. (2023). MinGPT [Python]. https://github.com/karpathy/minGPT (Original work published 2020)
- Language Modeling with nn.Transformer and TorchText. (2022). PyTorch. https://pytorch.org/tutorials/beginner/transformer_tutorial.html
- Liu, P. J., Saleh, M., Pot, E., Goodrich, B., Sepassi, R., Kaiser, L., & Shazeer, N. (2018). Generating Wikipedia by Summarizing Long Sequences. https://doi.org/10.48550/arXiv.1801.10198
- Platen, P. (2020). How to generate text: Using different decoding methods for language generation with Transformers. Hugging Face. https://huggingface.co/blog/how-to-generate
- Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. http://arxiv.org/abs/1706.03762