Unsupervised learning for text-to-speech synthesis

Watts, Oliver Samuel

Unsupervised learning for text-to-speech synthesis

Simple item page

dc.contributor.advisor

King, Simon

en

dc.contributor.advisor

Clark, Robert

en

dc.contributor.advisor

Yamagishi, Junichi

en

dc.contributor.author

Watts, Oliver Samuel

en

dc.contributor.sponsor

Engineering and Physical Sciences Research Council (EPSRC)

en

dc.date.accessioned

2013-10-22T14:32:59Z

dc.date.available

2013-10-22T14:32:59Z

dc.date.issued

2013-07-02

dc.description.abstract

This thesis introduces a general method for incorporating the distributional analysis of textual and linguistic objects into text-to-speech (TTS) conversion systems. Conventional TTS conversion uses intermediate layers of representation to bridge the gap between text and speech. Collecting the annotated data needed to produce these intermediate layers is a far from trivial task, possibly prohibitively so for languages in which no such resources are in existence. Distributional analysis, in contrast, proceeds in an unsupervised manner, and so enables the creation of systems using textual data that are not annotated. The method therefore aids the building of systems for languages in which conventional linguistic resources are scarce, but is not restricted to these languages. The distributional analysis proposed here places the textual objects analysed in a continuous-valued space, rather than specifying a hard categorisation of those objects. This space is then partitioned during the training of acoustic models for synthesis, so that the models generalise over objects' surface forms in a way that is acoustically relevant. The method is applied to three levels of textual analysis: to the characterisation of sub-syllabic units, word units and utterances. Entire systems for three languages (English, Finnish and Romanian) are built with no reliance on manually labelled data or language-specific expertise. Results of a subjective evaluation are presented.

en

dc.identifier.uri

http://hdl.handle.net/1842/7982

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.relation.hasversion

O.Watts, J. Yamagishi, and S. King. The role of higher-level linguistic features in HMM-based speech synthesis. In Proc. Interspeech, pages 841-844, Makuhari, Japan, Sept. 2010a.

en

dc.relation.hasversion

O. Watts, J. Yamagishi, and S. King. Letter-based speech synthesis. In Proc. Speech Synthesis Workshop 2010, pages 317-322, Nara, Japan, Sept. 2010b.

en

dc.relation.hasversion

O. Watts, J. Yamagishi, and S. King. Unsupervised continuous-valued word features for phrase-break prediction without a part-of-speech tagger. In Proc. Interspeech, Florence, Italy, Aug. 2011.

en

dc.relation.hasversion

J. Yamagishi and O. Watts. The CSTR/EMIME HTS System for Blizzard Challenge. In Proc. Blizzard Challenge 2010, Sept. 2010.

en

dc.subject

unsupervised learning

en

dc.subject

vector space model

en

dc.subject

speech synthesis

en

dc.subject

TTS

en

dc.subject

text-to-speech

en

dc.title

Unsupervised learning for text-to-speech synthesis

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Doctoral

en

dc.type.qualificationname

PhD Doctor of Philosophy

en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Watts2013.pdf
Size:: 1.43 MB
Format:: Adobe Portable Document Format
Description:

Download

This item appears in the following Collection(s)

Linguistics and English Language PhD thesis collection