Authors: Ramakrishna Vedantam,C. Lawrence Zitnick,Devi Parikh
Where published:
CVPR 2015 6
ArXiv: 1411.5726
Document:
PDF
DOI
Artifact development version:
GitHub
Abstract URL: http://arxiv.org/abs/1411.5726v2
Automatically describing an image with a sentence is a long-standing
challenge in computer vision and natural language processing. Due to recent
progress in object detection, attribute classification, action recognition,
etc., there is renewed interest in this area. However, evaluating the quality
of descriptions has proven to be challenging. We propose a novel paradigm for
evaluating image descriptions that uses human consensus. This paradigm consists
of three main parts: a new triplet-based method of collecting human annotations
to measure consensus, a new automated metric (CIDEr) that captures consensus,
and two new datasets: PASCAL-50S and ABSTRACT-50S that contain 50 sentences
describing each image. Our simple metric captures human judgment of consensus
better than existing metrics across sentences generated by various sources. We
also evaluate five state-of-the-art image description approaches using this new
protocol and provide a benchmark for future comparisons. A version of CIDEr
named CIDEr-D is available as a part of MS COCO evaluation server to enable
systematic evaluation and benchmarking.