When using the Tensorflow Python API methods above to implement model quantization, I encountered several problems with both post-training quantization and quantization-aware training. I also found online that some of the problems are very common. A lot of people were seeing the exact same errors as I was. After finally getting the codes run correctly, I decided to open this repository to post what I have learnt.
This is a tutorial of model quantization using TensorFlow, with suggestions based on personal experience. The Tensorflow version here in examples is 1.14, and models are built with tf.keras
. Since the Tensorflow team said they are working on a new package Tensorflow Model Optimiaztion
, which includes some new implementations of model quantization per their roadmap, I will keep looking for their updates and merge them into this repository if possible.
The last modification here was on 10/30/2020, where in the roadmap, the Post training quantization for dynamic-range kernels (Post-training weight quantization), Post training quantization for (8b) fixed-point kernels (Post-training integer quantization), and Quantization aware training for (8b) fixed-point kernels and experimentation for <8b are launched.
You are welcome to comment any issues, concerns, and suggestions, as well as anything regarding to Tensorflow updates. If you found this repository to be useful, I would like to thank you for your generosity to star 🌟 this repository.
There is another quantization tool Qkeras launched by a team from Google. It has a Keras like interface, and supports both common CNN and RNN. This is a good tool for advanced quantization research, but not compatible with tf.lite
for deployment.
Post Training Quantization for Hybrid Kernels now has a new official name: Post training quantization for dynamic-range kernels.
The Tensorflow Model Optimiaztion package now contains a new tool to perform quantization-aware training, and here is the guide. By default, this new tool produces a quantization-aware trained model with hybrid kernels, where only weights are stored in fixed-point value. The bias are stroed in float and the model takes float input. Based on my experiences, it only supports a subset of the layers/operations. For example, batch normalization layer(fused & unfused) is not supported. However, I think (did not try) someone can write a customized layer that perform a simplified BN operation and intergrate it with the model based on this comprehensive guide, which also includes some experimental features of the new tool.
To train a fixed-point model, consider the quantization-aware training method here in this repo. It is still working with Tensorflow 1.14, but will not work with a new version, due to the removal of tf.contrib
and some changes on tf.lite.TFLiteConverter
.
The efficiency at inference time is critial when deploying machine learning models to devices with limited resources, such as IoT edge nodes and mobile devices. Model quantization is a tool to improve inference efficiency, by converting the variable data types inside a model (usually float32) into some data types with fewer numbers of bits (uint8, int8, float16, etc.), to overcome the constraints such as energy consumption, storage capacity, and computation power.
TensorFlow supports two levels of model quantizations in general (see this link):
-
Quantization-aware training (training with quantization)
In general, quantization in Tensforflow uses tf.lite.TFLiteConverter
to convert a float-point model to a tf.lite
model. A tf.lite
model contains some/only fixed-point values. Some parameters of tf.lite.TFLiteConverter
can be tuned to indicate the expected quantization method.
Post-training quantization directly converts a trained model into a hybrid or fully-quantized tf.lite
model using tf.lite.TFLiteConverter
, with degradation in model accuracy. Post-training weight quantization only quantize model weights (convolution layer kernels, dense layer weights, etc.) to reduce model size and speedup computations by allowing hybrid operations (mix of fixed- and floating-point math). Post-training integer quantization fully quantize the model to support fixed-point-only hardware accelerators.
Quantization-aware training trains a model that can be fully quantized by tf.lite.TFLiteConverter
with minimal accuracy loss. It uses tf.contrib.quantize
to rewrite the training/eval graph to add fake quantization nodes. Fake quantization nodes simulate the errors introduced by quantization during training, so that the errors can be calibrated in the following training process. They also contain min/max values that are required by the tf.lite.TFLiteConverter
. (Why min/max matter? See here.)
Comparing with quantization-aware training, post-training quantization is simpler to use, and it only requires an already-trained floating-point mode. Based on the roadmap release above, while quantization-aware training is still expected for some models that accuracy is strict required, the Tensorflow team is expecting it to be rare as they improve post-training quantization tools to a negligible accuracy loss.
Quoting from Tensorflow:
In summary, a user should use “hybrid” post training quantization when targeting simple CPU size and latency improvements. When targeting greater CPU improvements or fixed-point accelerators, they should use this integer post training quantization tool, potentially using quantization-aware training if accuracy of a model suffers.
Please go each directory for details about the three model quantization tools.