Authors: Anton Bakhtin,Arthur Szlam,Marc'Aurelio Ranzato,Edouard Grave
ArXiv: 1804.07705
Document:
PDF
DOI
Abstract URL: http://arxiv.org/abs/1804.07705v2
It is often the case that the best performing language model is an ensemble
of a neural language model with n-grams. In this work, we propose a method to
improve how these two models are combined. By using a small network which
predicts the mixture weight between the two models, we adapt their relative
importance at each time step. Because the gating network is small, it trains
quickly on small amounts of held out data, and does not add overhead at scoring
time. Our experiments carried out on the One Billion Word benchmark show a
significant improvement over the state of the art ensemble without retraining
of the basic modules.