Authors: Nils Holzenberger,Shruti Palaskar,Pranava Madhyastha,Florian Metze,Raman Arora
ArXiv: 1811.08890
Document:
PDF
DOI
Abstract URL: http://arxiv.org/abs/1811.08890v2
An increasing number of datasets contain multiple views, such as video, sound
and automatic captions. A basic challenge in representation learning is how to
leverage multiple views to learn better representations. This is further
complicated by the existence of a latent alignment between views, such as
between speech and its transcription, and by the multitude of choices for the
learning objective. We explore an advanced, correlation-based representation
learning method on a 4-way parallel, multimodal dataset, and assess the quality
of the learned representations on retrieval-based tasks. We show that the
proposed approach produces rich representations that capture most of the
information shared across views. Our best models for speech and textual
modalities achieve retrieval rates from 70.7% to 96.9% on open-domain,
user-generated instructional videos. This shows it is possible to learn
reliable representations across disparate, unaligned and noisy modalities, and
encourages using the proposed approach on larger datasets.