Authors: Peratham Wiriyathammabhum,Abhinav Shrivastava,Vlad I. Morariu,Larry S. Davis
Where published:
WS 2019 6
ArXiv: 1904.03885
Document:
PDF
DOI
Abstract URL: http://arxiv.org/abs/1904.03885v1
This paper presents a new task, the grounding of spatio-temporal identifying
descriptions in videos. Previous work suggests potential bias in existing
datasets and emphasizes the need for a new data creation schema to better model
linguistic structure. We introduce a new data collection scheme based on
grammatical constraints for surface realization to enable us to investigate the
problem of grounding spatio-temporal identifying descriptions in videos. We
then propose a two-stream modular attention network that learns and grounds
spatio-temporal identifying descriptions based on appearance and motion. We
show that motion modules help to ground motion-related words and also help to
learn in appearance modules because modular neural networks resolve task
interference between modules. Finally, we propose a future challenge and a need
for a robust system arising from replacing ground truth visual annotations with
automatic video object detector and temporal event localization.