Authors: Mahdi Khademian,Mohammad Mehdi Homayounpour
ArXiv: 1610.01367
Document:
PDF
DOI
Abstract URL: http://arxiv.org/abs/1610.01367v1
A Pascal challenge entitled monaural multi-talker speech recognition was
developed, targeting the problem of robust automatic speech recognition against
speech like noises which significantly degrades the performance of automatic
speech recognition systems. In this challenge, two competing speakers say a
simple command simultaneously and the objective is to recognize speech of the
target speaker. Surprisingly during the challenge, a team from IBM research,
could achieve a performance better than human listeners on this task. The
proposed method of the IBM team, consist of an intermediate speech separation
and then a single-talker speech recognition. This paper reconsiders the task of
this challenge based on gain adapted factorial speech processing models. It
develops a joint-token passing algorithm for direct utterance decoding of both
target and masker speakers, simultaneously. Comparing it to the challenge
winner, it uses maximum uncertainty during the decoding which cannot be used in
the past two-phased method. It provides detailed derivation of inference on
these models based on general inference procedures of probabilistic graphical
models. As another improvement, it uses deep neural networks for joint-speaker
identification and gain estimation which makes these two steps easier than
before producing competitive results for these steps. The proposed method of
this work outperforms past super-human results and even the results were
achieved recently by Microsoft research, using deep neural networks. It
achieved 5.5% absolute task performance improvement compared to the first
super-human system and 2.7% absolute task performance improvement compared to
its recent competitor.