Authors: Xin Wang,Jaime Lorenzo-Trueba,Shinji Takaki,Lauri Juvela,Junichi Yamagishi
ArXiv: 1804.02549
Document:
PDF
DOI
Abstract URL: http://arxiv.org/abs/1804.02549v1
Recent advances in speech synthesis suggest that limitations such as the
lossy nature of the amplitude spectrum with minimum phase approximation and the
over-smoothing effect in acoustic modeling can be overcome by using advanced
machine learning approaches. In this paper, we build a framework in which we
can fairly compare new vocoding and acoustic modeling techniques with
conventional approaches by means of a large scale crowdsourced evaluation.
Results on acoustic models showed that generative adversarial networks and an
autoregressive (AR) model performed better than a normal recurrent network and
the AR model performed best. Evaluation on vocoders by using the same AR
acoustic model demonstrated that a Wavenet vocoder outperformed classical
source-filter-based vocoders. Particularly, generated speech waveforms from the
combination of AR acoustic model and Wavenet vocoder achieved a similar score
of speech quality to vocoded speech.