This Pytorch-based repository allows to train a full-precision or quantized bidirectional LSTM to perform OCR on the included dataset. A quantized trained model can be accelerated on the LSTM-PYNQ overlay found here: https://github.com/Xilinx/LSTM-PYNQ
An Nvidia GPU with a CUDA+CuDNN installation is suggested but not required, training with quantization is supported on CPUs as well.
- Python environment, including Numpy and Pillow (tested with Python 2.7, Pillow 4.2.1, Numpy 1.13.3).
- Pytorch (tested with version 0.3.1)
- pytorch-quantization (tested with master branch)
- Warp-CTC with Pytorch bindings (tested with commit aba791f)
- python-levenshtein (tested with version 0.12.0)
- TensorboardX (tested with version 1.1)
- Tensorboard (optional)
Assuming your OS is a recent linux based distribution, such as Ubuntu 16.04, you can follow the following steps.
- Current repos have been tested against CUDA 9.0. Compatibility with earlier or later CUDA releases is not guaranteed.
- Follow the instructions here on how to install CUDA and get the meta-package
cuda-libraries-dev-9-0
.
-
Get a complete Python distribution installer from Anaconda here.
-
Run the installer in a bash shell with:
bash Anaconda2-5.1.0-Linux-x86_64.sh
and set everything to default. More information about the installation process can be found here.
- assuming a CUDA 9.0 environment (replace
cuda90
otherwise), install Pytorch 0.3.1 through conda (the Anaconda package manager) with:
conda install pytorch=0.3.1 torchvision cuda90 -c pytorch
More installer combinations are available here.
- Install a recent CMake version with:
conda install cmake
- Clone the repo:
git clone https://github.com/xilinx/pytorch-quantization
- Build the library by running
python build.py
. - Add the library to your python's path with
export PYTHONPATH=PYTHONPATH:/path/to/pytorch-quantization
, with /path/to/pytorch-quantization the full path to where you cloned the repo (with no trailing slashes, as shown here).
- Install the required conda dependencies with:
conda install -c anaconda pillow
- Install the required pip dependencies with:
pip install scikit-learn python-levensthein tensorboardX
- Compile Warp-CTC with:
git clone https://github.com/SeanNaren/warp-ctc
cd warp-ctc && git checkout aba791f
mkdir build && cd build && cmake .. && make
- Then install Warp-CTC Pytorch bindings with:
export CUDA_HOME=/usr/local/cuda
cd warp-ctc/pytorch_binding
python setup.py install
- Finally, clone the actual repo:
git clone https://github.com/xilinx/pytorch-ocr
- And create a folder for experiments inside:
cd pytorch-ocr && mkdir experiments
Besides logging to stdout, the training scripts generates (through TensorboardX) a visual trace of loss and accuracy that can been visualized with Tensorboard.
To visualize it, first install tensorboard with:
conda install -c anaconda tensorboard
And then run it on the experiments folder, such as:
tensorboard --logdir=pytorch-ocr/experiments
The UI will default to port 6006, which must be open to incoming TCP connections.
Training support a set of hyperparameters specified in .json file, plus a set of arguments specified on the command line for things such as I/O locations, picking between training or evaluation, exporting weights etc. Both are specified in their respective sections below. You will also find a few examples for different types of runs.
The supported architecture is composed of a single recurrent layer, a single fully connected layer, a batch normalization step (between the recurrent and the fully connected layers), and respectively a CTC decoder+loss layer for training and a greedy decoder for evaulation.
The naming for experiments has the following convention. For an architecture such as QLSTM128_W2B8A4I32_FC_W32B32 we have:
- Q stands for quantized neuron.
- LSTM is the type of recurrent neuron.
- 128 is the number of neurons.
- W2 is the number of bits for weights.
- B8 is the number of bits for the recurrent bias.
- A4 is the number of bits for the recurrent activation (= hidden state).
- I32 is the number of bits for the internal activations (= sigmoid and tanh).
- FC_W32A32 are the number of bits allocated to the weights and activations of the fully connected layer.
Hyperparameters, arguments, output logs and a tensorboard trace are persisted to disk as a reference during training, unless --dry_run
is specified.
Training supports resuming or retraining from a checkpoint with the argument -i
(to specify the input checkpoint path) and the argument --pretrained_policy
(to specify RESUME
or RETRAIN
).
Resuming ignores the default_trainer_params.json
file and reads the training hyperparameters from within the checkpoint, unless a different .json
file is specified with the -p
argument.
Retraining ignores the hyperparamters found within the checkpoint, as well as the optimizer state, and reads the hyperparameters from the .json
file, either the default one or one specified with the -p
argument.
Export an appropriate pretrained model to an HLS-friendly header with argument --export
, and optional arguments --simd_factor
to specify a scaling factor for the unrolling within a neuron (1
is full unrolling, 2
is half unrolling etc.) and --pe
to specify the number of processing elements allocated to compute a neuron.
More arguments and their usage can be found by invoking the helper.
The support hyperparameters with their default values are (taken from default_trainer_params.json with added comments):
"random_seed": 123456, # Seed value to init all the randomness, to enforce reproducibility
"batch_size" : 32, # Training batch size
"num_workers" : 0, # CPU workers to prepare training batch, has to be 0 on Python 2.7
"layer_size" : 128, # Number of neurons in a direction of the recurrent neuron
"neuron_type" : "QLSTM", # Type of recurrent neurons, tested with LSTM (for Pytorch default backend), or QLSTM (for pytorch-quantization backend)
"target_height" : 32, # Height to which the dataset images are resized, translates to input size of the recurrent neuron
"epochs" : 4000, # Number of training epochs
"lr" : 1e-4, # Starting learning rate
"lr_schedule" : "FIXED", # Learning rate policy, allowed values are STEP/FIXED
"lr_step" : 40, # Step size in number of epochs for STEP lr policy
"lr_gamma" : 0.5, # Gamma value for STEP lr policy
"max_norm" : 400, # Max value for gradient clipping
"seq_to_random_threshold": 20, # Number epochs after which the training batches switch from being taken in increasing order of sequence length (where sequence length means image width for OCR) to being sampled randomly
"bidirectional" : true, # Enable bidirectional recurrent layer
"reduce_bidirectional": "CONCAT", # How to reduce the two output sequences coming out of a bidirectional (if enabled) recurrent layer, allowed values are SUM/CONCAT
"recurrent_bias_enabled": true, # Enable bias in reccurent layer
"checkpoint_interval": 10, # Internal in number of epochs after which a checkpoint of the model is saved
"recurrent_weight_bit_width": 32, # Number of bits to which the recurrent layer's weights are quantized
"recurrent_weight_quantization": "FP", # Quantization strategy for the recurrent layer's weights
"recurrent_bias_bit_width": 32, # Number of bits to which the recurrent layer's bias is quantized
"recurrent_bias_quantization": "FP", # Quantization strategy for the recurrent layer's bias
"recurrent_activation_bit_width": 32, # Number of bits to which the recurrent layer's activations (along both the recurrency and the output path) are quantized
"recurrent_activation_quantization": "FP", # Quantization strategy for the recurrent layer's activation
"internal_activation_bit_width": 32, # Number of bits to which the recurrent layer internal non-linearities (sigmoid and tanh) are quantized
"fc_weight_bit_width": 32, # Number of bits to which the fully connected layer's weights are quantized
"fc_weight_quantization": "FP", # Quantization strategy for the fully connected layer's weights
"fc_bias_bit_width": 32, # Number of bits to which the fully connected layer's bias is quantized
"fc_bias_quantization": "FP", # Quantization strategy for the fully connected layer's bias
"quantize_input": true, # Quantize the input according to the recurrent_activation bit width and quantization
"mask_padded": true, # Mask output values coming from padding of the input sequence
"prefused_bn_fc": false # Signal that batch norm and the fully connected layer have to be considered as fused already, so that the batch norm step is skipped.
Currently dropout is not implemented, which is why the best set of weights (w.r.t. validation accuracy) is tracked and saved to disk at every improvement (with a single epoch granularity).
Training a model that can be reproduced accurately in hardware requires at least one retraining step. The reason is that the batch norm coefficients have to be fused into the quantized fully connected layer, since batch norm is not implemented in hardware. The suggested strategy goes as follow:
- Train with quantized input, quantized recurrent weights/activations/bias, full-precision internal activations and full-precision fully connected layer, either from scratch or from a pretrained full precision model.
- Perform batch norm-fully connected fusion and retrain with quantized internal activations and quantized fully connected layer.
To start training a full-precision model with default hyperparameters, simply run:
python main.py
python main.py -p quantized_params/QLSTM128_W2B8A4I32_FC_W32B32.json