Poetry sampled from a Recurrent Neural Network language model which was trained on a web-scraped dataset of 261 poems
i never write
for not hot or cold again
backwards of death by the other mistake we had
the smallest artists sit outside into the first
rocks steam and the women cry
in one night the trees without death again
i held work and one by one who not beside it for him with
or so much that still be today here
Try it for yourself!
- Install project dependencies with
pip install -r requirements.txt
- Run
predict.py
to produce a novel sample from the language model. This sample will use weights which were pre-trained on an AWS EC2 p2.xlarge instance for 5 hours. - Run
train.py
to further train the weights on your own computer.
The dataset of poems poems.txt
is divided into sequences. The RNN is trained to take a sequence of words as input and predict the next words.
Recurrent neurons differ from regular neurons because they are able to take sequences as input. The long-short term memory (LSTM) recurrent unit was specifically used in this network because of its capability to identify long-term dependencies. It does this by training specific weights, called gates, which enable the unit to store information between inputs. These gates determine which data is important to store long-term (like the gender of a subject) and which data should be removed (like the previously stored gender, once a new subject is introduced in the text).
Note: tests using the Gated Recurrent Unit (GRU) produced very similar results
Words are typically one-hot-encoded or encoded with integer indices for input into neural networks. When humans read text they already have a language model in their brains that provides them with contextual information about each word (such as how Beijing and Shenghei refer to similar locations). However, these simple encoding methods treat each word like an individual entity, forcing the network to learn this semantic information on its own. Unfortunately, the network is bad at learning this information since it is being trained on an entirely seperate task (predicting the next word in a sequence).
Encoding words with embeddings solves this issue. Word embeddings are dense vectors which encode a word's semantic meaning. Distances between word embeddings indicate semantic word similarity. Embeddings are generated through a dedicated supervised learning task, such as Word2Vec or GloVe. Although the Word2VecModel
class in rap_models.py
does have functionality to train a Word2Vec model on the 261 poems in poems.txt
, the specific weights used in the current iteration of the model were pulled from pre-trained embeddings.
I used Google's collection of 3 million vectors that were pretrained on a Google News corpus which contained billions of words. These embeddings are much higher quality than those trained only on the poetry dataset. You can find them here: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
The corpus used to train the model consisted of over 20k words. Storing the individual Word2Vec embeddings for each word requires a lot of storage space and slows down training (since each word appears many times in the vocabulary). The embedding layer (the first layer in the network) takes word's index as input and it outputs the word's Word2Vec embedding. This reduces the size of the training set by enablinng each high-dimensional word vector in the corpus to be replaced with a single integer ID.
Words that appear infrequently are encoded with the unknown word token <UKN>
. This encoding is necessary because when a word is rare, learning its meanig will result in minimal improvements. But excludign it allows for efficiency improvements. Furthermore, enabling the network to process unknown word allows any english sentence to be used as a seed for the network, even if some of its words don't appear in the training set.
To generate a novel poem like the one at the start of this page, the network is first seeded with the zero vector. Then, it produces a probability distribution indicating the likelihood that each word in the vocabularity is the proceeding next word in the sequence. This probability distribution is used to sample a word. Each word that is generated is added to the sequence which is then fed back into the network. At each timestep, the network produces a new probability distribution over the vocabulary, indicating the most likely words to appear next in the sequence. The sampling continues until the network produces the <END>
token.
Note: The probability distribution has to be randomly sampled. If the most likely word is simply used, the network might produce sequences that look like this: "when the the the but and the the the the and the"
This project was inspired by Andrej Karparthy's blog post: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ Andrej trained an RNN to read text character-by-character and predict the next character that would appear. I decided to iterate upon this method by building a network to processes text word-by-word. This allowed me to incoporate Word2Vec embeddings and produce more coherent sentences.