Authors: Jiong Gong,Haihao Shen,Guoming Zhang,Xiaoli Liu,Shane Li,Ge Jin,Niharika Maheshwari,Evarist Fomenko,Eden Segal
ArXiv: 1805.08691
Document:
PDF
DOI
Abstract URL: http://arxiv.org/abs/1805.08691v1
High throughput and low latency inference of deep neural networks are
critical for the deployment of deep learning applications. This paper presents
the efficient inference techniques of IntelCaffe, the first Intel optimized
deep learning framework that supports efficient 8-bit low precision inference
and model optimization techniques of convolutional neural networks on Intel
Xeon Scalable Processors. The 8-bit optimized model is automatically generated
with a calibration process from FP32 model without the need of fine-tuning or
retraining. We show that the inference throughput and latency with ResNet-50,
Inception-v3 and SSD are improved by 1.38X-2.9X and 1.35X-3X respectively with
neglectable accuracy loss from IntelCaffe FP32 baseline and by 56X-75X and
26X-37X from BVLC Caffe. All these techniques have been open-sourced on
IntelCaffe GitHub1, and the artifact is provided to reproduce the result on
Amazon AWS Cloud.