Open library

This portal has been archived. Explore the next generation of this technology.

Q8BERT: Quantized 8Bit BERT

lib:7dc3143a8a1abf90 (v1.0.0)

Authors: Ofir Zafrir,Guy Boudoukh,Peter Izsak,Moshe Wasserblat
ArXiv: 1910.06188
Document: PDF DOI

Abstract URL: https://arxiv.org/abs/1910.06188v2

Recently, pre-trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing (NLP) tasks. However, these models contain a large amount of parameters. The emergence of even larger and more accurate models such as GPT2 and Megatron, suggest a trend of large pre-trained Transformer models. However, using these large models in production environments is a complex task requiring a large amount of compute, memory and power resources. In this work we show how to perform quantization-aware training during the fine-tuning phase of BERT in order to compress BERT by $4\times$ with minimal accuracy loss. Furthermore, the produced quantized model can accelerate inference speed if it is optimized for 8bit Integer supporting hardware.

Relevant initiatives

Related knowledge about this paper

Search on this portal

Reproduced results (crowd-benchmarking and competitions)

Artifact and reproducibility checklists

Common formats for research projects and shared artifacts

Collective Knowledge (organizing research projects based on FAIR principles)

Reproducibility initiatives

Comments

Please log in to add your comments!

If you notice any inapropriate content that should not be here, please report us as soon as possible and we will try to remove it within 48 hours!

Q8BERT: Quantized 8Bit BERT

Relevant initiatives Hide

Comments Hide

Relevant initiatives

Comments