Authors: Rachid Riad,Corentin Dancette,Julien Karadayi,Neil Zeghidour,Thomas Schatz,Emmanuel Dupoux
ArXiv: 1804.11297
Document:
PDF
DOI
Artifact development version:
GitHub
Abstract URL: http://arxiv.org/abs/1804.11297v2
Recent studies have investigated siamese network architectures for learning
invariant speech representations using same-different side information at the
word level. Here we investigate systematically an often ignored component of
siamese networks: the sampling procedure (how pairs of same vs. different
tokens are selected). We show that sampling strategies taking into account
Zipf's Law, the distribution of speakers and the proportions of same and
different pairs of words significantly impact the performance of the network.
In particular, we show that word frequency compression improves learning across
a large range of variations in number of training pairs. This effect does not
apply to the same extent to the fully unsupervised setting, where the pairs of
same-different words are obtained by spoken term discovery. We apply these
results to pairs of words discovered using an unsupervised algorithm and show
an improvement on state-of-the-art in unsupervised representation learning
using siamese networks.