Sequence-to-sequence model with attention implemented by Torch. The convolutional attentive encoder(Rush et al.) inspired by Bahdanau et al. is provided. Additionally, the encoder can be bidirectional recurrent neural network(LSTM | GRU | RNN).
The model is implemented by torch. It requires the following packages:
- First, I pre-process the training data using the tokenizer of the Moses toolkit with the script
nmt-prep.sh
inscript/
folder.
sh nmt-prep.sh
- And then, prepare the data with
data_prep.lua
which transforms the data into tensor.
th data_prep.lua
Now, start to train the model
th main.lua -learningRate 0.001 -optim 'adam' -dropout 0.2 -rnn 'lstm'
This will run the model, which uses convolutional attentive encoder and a 1-layer LSTM decoder with 256 hidden units.
- Given a trained model, use beam search to obtain the output. Additionally, greedy search is also provided for efficiency. To do this, just run as follows:
th evaluate.lua -search 'greedy' -batch_size 32
- I use the
muti-bleu.perl
script from Moses to compute the BLEU. Given the identifier of model and gpuid, just run as follows:
sh nmt-eval.sh identifier gpuid
Note that the testing dataset is grouped by the length with nmt-prep.lua
, so the predicitons doesn't match raw source text. In addition to the predictions, add line number of source sequence in raw text. Before computing the BLEU, leverage the line number to sort the predictions, making the predicitions match the golden sequences.
When use convolutional attentive encoder and a 1-layer decoder with 256 hidden units with dropout 0.2, the BLEU curves with greedy search on test dataset are as below.
src_path
: path to pre-processed data withscript/*-prep.sh
dst_path
: path to store the dictionaries and datasets. And they are transfored into tensor format from raw text format.src_train, src_valid, src_test
: the name of source sequencestgt_train, tgt_valid, tgt_test
: the name of target sequencesmin_freq
: Replace the token occurring this value withUNK
tokenseed
: torch manual random number generator seed.
Data options
data
: path to the training data, which is output ofdata_prep.lua
src_dict, tgt_dict
: the dictionaries of source and target sequencesthresh
: the minimum tokens of target sequences at same length. If less than this value, don't use these sequence to train modelreverse
: Iftrue
, reverse the source sequencesshuff
: Iftrue
, shuff the sequencescurriculum
: the number of epochs at which perform curriculum learning. And the input is sorted by length
Model options
model
: If not empty, load the pre-train model with this nameemb
: the dimension of word embeddingsenc_rnn_size
: the size of recurrent hidden states in encoderdec_rnn_size
: the size of recurrent hidden states in decoderrnn
: the name of recurrent unit(LSTM
|GRU
|RNN
)nlayer
: the number of layers in the recurrent encoder and decoderattn_net
: the name of attention network(conv
|mlp
). use convolutional attentive encoder with optionconv
, and multi-layer perception with optionmlp
pool
: the pool size of attention convolutional encoder
Optimization options
optim
: the name of optimization algorithmdropout
: the dropout ratelearningRate
: the learning rate of optimization algorithmminLearningRate
: the minimum of learning rateshrink_factor
: decay the learning rate by this if the loss doesn't decrease to the product of the previous loss andshrink_muliplier
. And it only works when the optionoptim
is sgd.shrink_multiplier
: the shrink multiplieranneal
: Iftrue
, anneal the leanring ratestart_epoch
: the epoch at which start to anneal the learning ratesaturate_epoch
: the number of epoch after which learning rate anneals tominLearningRate
fromstart_epoch
batch_size
: the size of mini-batchsrc_seq_len, tgt_seq_len
: the minimum length of source/target sequences, and truncate the sequences whose length is bigger than this lengthgrad_clip
: clip the gradients which is bigger than this valuenepoch
: the maximum of epochs
Other options
save
: the path to save the model.name
: the identifier of the training models. And it is substring of the name of modelsseed
: torch manual random number generator seedcuda
: Iftrue
, use cudagpu
: the ID of gpu to usenprint
: the frequency of print the information
Data Options
data
: the path to testing datasetssrc_dict, tgt_dict
: the dictionaries of source and target sequencesthresh
: the minimum tokens of target sequences at same length. If less than this value, don't use these sequence to train modelreverse
: Iftrue
, reverse the source sequences
Model options
model
: the name of model to test
Search options
search
: the serach strategy('beam' | 'greedy')batch_size
: the size of min-batch ifsearch
options is 'greedy'beam_size
: the size of beam ifsearch
option is 'beam'src_seq_len, tgt_seq_len
: the minimum length of source/target sequences, and the sequences whose length is bigger than this length, truncatestrategy
: Iftrue
, the prediction is simply the top sequence of beam with an<EOS>
tail token at the first time. Otherwise, the model considers all sequences that have been generated so far that end with<EOS>
token and takes the top sequencesnbest
: Iftrue
, output the n-best list, whensearch
option is 'beam'
Other options
save
: the path to save the modeloutput
: the path save the predicions of the modelseed
: torch manual random number generator seedcuda
: Iftrue
, use cudagpu
: the ID of gpu to usenprint
: the frequency of print the information
My implementation utilizes code from the following: