Open library

Check the preview of 2nd version of this platform being developed by the open MLCommons taskforce on automation and reproducibility as a free, open-source and technology-agnostic on-prem platform.

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

lib:422bb62e3ec3d501 (v1.0.0)

Vote to reproduce this paper and share portable workflows ▲ 1 ▼

Authors: Niluthpol Chowdhury Mithun,Juncheng Li,Florian Metze,Amit K. Roy-Chowdhury
Where published: ICMR 2018 6
Document: PDF DOI

Artifact development version: GitHub

Abstract URL: https://dl.acm.org/citation.cfm?id=3206064

Constructing a joint representation invariant across different modalities (e.g., video, language) is of significant importance in many multimedia applications. While there are a number of recent successes in developing effective image-text retrieval methods by learning joint representations, the video-text retrieval task, in contrast, has not been explored to its fullest extent. In this paper, we study how to effectively utilize available multi-modal cues from videos for the cross-modal video-text retrieval task. Based on our analysis, we propose a novel framework that simultaneously utilizes multimodal features (different visual characteristics, audio inputs, and text) by a fusion strategy for efficient retrieval. Furthermore, we explore several loss functions in training the joint embedding and propose a modified pairwise ranking loss for the retrieval task. Experiments on MSVD and MSR-VTT datasets demonstrate that our method achieves significant performance gain compared to the state-of-the-art approaches.

Relevant initiatives

Related knowledge about this paper

Search on this portal

Reproduced results (crowd-benchmarking and competitions)

Artifact and reproducibility checklists

Common formats for research projects and shared artifacts

Collective Knowledge (organizing research projects based on FAIR principles)

Reproducibility initiatives

Comments

Please log in to add your comments!

If you notice any inapropriate content that should not be here, please report us as soon as possible and we will try to remove it within 48 hours!

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

Relevant initiatives Hide

Comments Hide

Relevant initiatives

Comments