Fisher-Aware Adaptive Mixed-Precision Ternary Hybrid Quantization
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Deployment of deep neural networks to resource-limited platforms is a difficult task due to their reliance on computationally and memory-intensive 32-bit floating point operations. Though the problem may be addressed through binary and ternary quantization approaches which allow for great memory reduction, applying the same level of precision to all layers adversely affects model performance due to different sensitivity of individual layers to quantization. In response to the forementioned concerns, Fisher-Aware Ternary Hybrid (FATH) quantization technique is presented. It builds upon binary neural networks, introducing ternary quantization while allowing for varying levels of precision applied to each of the layers. Based on estimation of layer-wise Fisher Information as a proxy of Hessian trace, FATH allows for assigning optimal precision to layers, whether it is 16-bit or 4-bit float or ternary quantization. As part of ternary representation, weights are normalized using absolute mean factor and threshold and only values −1, 0, +1 are allowed, allowing for filtering less important features. In order to ensure training stability in light of discontinuous parameters, quantization aware training is implemented alongside the straight-through estimator. In addition to weight quantization, activations are quantized down to 8 bits. The approach was evaluated using a subset of the Food-101 data set and achieved good results in terms of maintaining model performance despite significant model reduction.