A Text-Generator that uses TensorFlow to train a LSTM model for a given text file and generates text in that style by predicting single chars for a desired text length. This code was created as a private project while learning machine learning concepts. Therefore not everything is perfectly implemented and fully configurable, but has enough possibilities to play around and experiment with different settings. Free ebooks to use as input dataset can for example be found at the awesome Project Gutenberg.
The architecture of the used Neural Network (NN) looks like this:
Caveat: The Dropout is not visualized here!
- Download this repository as .zip or via git clone
- Install requirements:
pip install -r stable-req.txt
(requirements were generated usingpip freeze
) - Add a text file to data (don't forget to remove header and footer e.g. used in Project Gutenberg files)
- Edit the config.json file accordingly
- Start the generator (with preprocessing, training and generation if it's your first start) using
python generator.py
After your first training you can use the previously trained checkpoints to just generate data. Therefore edit the config.json file:
- Disable preprocessing (
exec_preprocessing
tofalse
) and training (exec_training
tofalse
) - Set the weights to use to the weights checkpoint (
load_weights_filename
) using the lowest loss X in epoch Y (trainingsCheckpoints/weights_ep_YYY-loss_X.XXX.hdf5
) - Enable generation (
exec_generation
totrue
) - Execute
python generator.py
The most relevant parameters and settings are configurable via the config.json file. For most scenarios you don't actually have to edit the code yourself, just the config.
- preprocessing: Parameters specific to the preprocessing phase
- exec_preprocessing: Whether preprocessing should be exectued or not
- input_file: The file to use as input data
- sequence_chars_length: The length of the text sequence to extract the patterns from (sequence -> predicted next char)
- checkpoints:
- char2intDict_file: The file holding the checkpoint for the dictionary that converts chars to integers
- int2charDict_file: The file holding the checkpoint for the dictionary that converts integers to chars
- vocabulary_file: The file holding the extracted vocabulary (unique chars)
- X_file: The input matrix
- Y_file: The output matrix (next char)
- training: Parameters specific to the training phase
- exec_training: Whether training should be exectued or not
- load_weights_filename: If training should not be extecuted, previously trained weights are loaded from this file
- lstm_units: Dimensionality of the output space
- dropout_probability: The probability of a dropout between 0 and 1
- epochs_qty: The amount of epochs to execute while training
- batch_size: Number of samples per gradient update
- checkpoints:
- foldername: The folder where the weights should be stored
- generation:
- exec_generation: Whether generation should be exectued or not
- text_chars_length: The length of the text in chars that should be generated
- foldername: The folder where the resulting generated text should be stored
- generator.py: The class that generates text, also includes the main-function
- preprocessing.py: The class that encapsulates the preprocessing phase
- training.py: The class that encapsulates the training phase
- filehelper.py: The helper class that handles all file specific tasks (e.g. read/write checkpoints, loading config etc.)
Using this configuration:
{
"preprocessing": {
"exec_preprocessing": true,
"input_file": "data/aliceInWonderland.txt",
"sequence_chars_length": 100,
"checkpoints": {
"char2intDict_file": "preprocessingCheckpoints/char2intDict",
"int2charDict_file": "preprocessingCheckpoints/int2charDict",
"vocabulary_file": "preprocessingCheckpoints/vocab",
"X_file": "preprocessingCheckpoints/X",
"Y_file": "preprocessingCheckpoints/Y"
}
},
"training": {
"exec_training": true,
"load_weights_filename": "trainingCheckpoints/weights-ep_10-loss_2.1012.hdf5",
"lstm_units": 256,
"dropout_probability": 0.2,
"epochs_qty": 10,
"gradient_batch_size": 128,
"checkpoints":{
"foldername": "trainingCheckpoints"
}
},
"generation": {
"exec_generation": true,
"text_chars_length": 1000,
"foldername": "result"
}
}
and a random seed, this output could be generated:
‘i dan a latter ’ said the date pirelee to aerce an an anine ‘in
the koot oo the woile ’ and the woile worhe to the worle to ber toee
ano the woole an anl the woole and toee
‘i can a latter ’ said the doyphon, ‘i wonld to toe toee ’
Obviously this is no perfect english and doesn't really make sense. But regarding that the NN had no prior knowledge about languages, terms, or even that you use high quotes for indicating that someone said something, it learned in a really short time (just 10 epochs) how the main structure of text should look like.
If you would further tune the parameters you would get way better results than that. But for a basic tutorial and first hands on, I think this should be enough.