Efficient Neural Networks through Systematic Quantization


At a glance

An important barrier in the deployment of deep neural networks is latency, which is a critical metric for autonomous driving. High latency translates to delayed recog- nition of a car, pedestrian, or a barrier in front of an autonomous agent in time before we can make an appropriate decision/maneuver. For this reason, practitioners have been forced to use very shallow networks with poor accuracy to avoid the latency problem. A promising approach to address this problem is quantization, so that the model can be fit into a fast custom accelerator or embedded hardware. However, the current state-of-the-art quantization methods involve random heuristics and require high computational cost associated with brute-force searching. More importantly, the majority of these methods only focus on quantization—but not for inference latency, which is actually more important than just the model size. Our goal is to systematically develop an end-to-end quan- tization framework that would include both hardware design metrics, and in particular latency and power consumption, along with an interpretable optimization routine, to determine the limit of quantization for a given model/dataset. We will consider model quantization for classification, object detection and segmentation as our target problems.

Principal investigatorsresearchersthemes
Kurt Keutzer
Michael Mahoney
Dr. Amir GholamiModel Compression, Quantization, Stochastic Second-Order, Power Consumption, Latency