Authors: Peter Anderson,Xiaodong He,Chris Buehler,Damien Teney,Mark Johnson,Stephen Gould,Lei Zhang
Where published:
CVPR 2018 6
ArXiv: 1707.07998
Document:
PDF
DOI
Artifact development version:
GitHub
Abstract URL: http://arxiv.org/abs/1707.07998v3
Top-down visual attention mechanisms have been used extensively in image
captioning and visual question answering (VQA) to enable deeper image
understanding through fine-grained analysis and even multiple steps of
reasoning. In this work, we propose a combined bottom-up and top-down attention
mechanism that enables attention to be calculated at the level of objects and
other salient image regions. This is the natural basis for attention to be
considered. Within our approach, the bottom-up mechanism (based on Faster
R-CNN) proposes image regions, each with an associated feature vector, while
the top-down mechanism determines feature weightings. Applying this approach to
image captioning, our results on the MSCOCO test server establish a new
state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of
117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of
the method, applying the same approach to VQA we obtain first place in the 2017
VQA Challenge.