Thousands of Voices for HMM-based Speech Synthesis - Analysis and Application of TTS Systems Built on Various ASR Corpora

Yamagishi, Junichi; Usabaev, Bela; King, Simon; Watts, O.; Dines, J.; Tian, J.; Hu, R.; Guan, Y.; Oura, K.; Tokuda, K.; Karhila, R.; Kurimo, Mikko

doi:10.1109/TASL.2010.2045237

Thousands of Voices for HMM-based Speech Synthesis - Analysis and Application of TTS Systems Built on Various ASR Corpora

Simple item page

dc.contributor.author

Yamagishi, Junichi

en

dc.contributor.author

Usabaev, Bela

en

dc.contributor.author

King, Simon

en

dc.contributor.author

Watts, O.

en

dc.contributor.author

Dines, J.

en

dc.contributor.author

Tian, J.

en

dc.contributor.author

Hu, R.

en

dc.contributor.author

Guan, Y.

en

dc.contributor.author

Oura, K.

en

dc.contributor.author

Tokuda, K.

en

dc.contributor.author

Karhila, R.

en

dc.contributor.author

Kurimo, Mikko

en

dc.date.accessioned

2010-12-20T15:17:29Z

dc.date.available

2010-12-20T15:17:29Z

dc.date.issued

2010

dc.date.updated

2010-12-20T15:17:30Z

dc.description.abstract

In conventional speech synthesis, large amounts of phonetically balanced speech data recorded in highly controlled recording studio environments are typically required to build a voice. Although using such data is a straightforward solution for high quality synthesis, the number of voices available will always be limited, because recording costs are high. On the other hand, our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an ``average voice model'' plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack phonetic balance. This enables us to consider building high-quality voices on ``non-TTS'' corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper, we demonstrate the thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal (WSJ0, WSJ1, and WSJCAM0), Resource Management, Globalphone, and SPEECON databases. We also present the results of associated analysis based on perceptual evaluation, and discuss remaining issues.

en

dc.extent.pageNumbers

984--1004

en

dc.identifier.doi

10.1109/TASL.2010.2045237

dc.identifier.uri

http://dx.doi.org/10.1109/TASL.2010.2045237

dc.identifier.uri

http://hdl.handle.net/1842/4551

dc.publisher

IEEE

en

dc.title

Thousands of Voices for HMM-based Speech Synthesis - Analysis and Application of TTS Systems Built on Various ASR Corpora

en

dc.type

Article

en

rps.issue

5

en

rps.title

IEEE Transactions on Audio, Speech and Language Processing

en

rps.volume

18

en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 05431023.pdf
Size:: 2.5 MB
Format:: Adobe Portable Document Format

Download

This item appears in the following Collection(s)

CSTR publications