Unsupervised learning for text-to-speech synthesis
dc.contributor.advisor
King, Simon
en
dc.contributor.advisor
Clark, Robert
en
dc.contributor.advisor
Yamagishi, Junichi
en
dc.contributor.author
Watts, Oliver Samuel
en
dc.contributor.sponsor
Engineering and Physical Sciences Research Council (EPSRC)
en
dc.date.accessioned
2013-10-22T14:32:59Z
dc.date.available
2013-10-22T14:32:59Z
dc.date.issued
2013-07-02
dc.description.abstract
This thesis introduces a general method for incorporating the distributional analysis
of textual and linguistic objects into text-to-speech (TTS) conversion systems.
Conventional TTS conversion uses intermediate layers of representation to bridge
the gap between text and speech. Collecting the annotated data needed to produce
these intermediate layers is a far from trivial task, possibly prohibitively so
for languages in which no such resources are in existence. Distributional analysis,
in contrast, proceeds in an unsupervised manner, and so enables the creation of
systems using textual data that are not annotated. The method therefore aids
the building of systems for languages in which conventional linguistic resources
are scarce, but is not restricted to these languages.
The distributional analysis proposed here places the textual objects analysed
in a continuous-valued space, rather than specifying a hard categorisation of those
objects. This space is then partitioned during the training of acoustic models for
synthesis, so that the models generalise over objects' surface forms in a way that
is acoustically relevant.
The method is applied to three levels of textual analysis: to the characterisation
of sub-syllabic units, word units and utterances. Entire systems for three
languages (English, Finnish and Romanian) are built with no reliance on manually
labelled data or language-specific expertise. Results of a subjective evaluation
are presented.
en
dc.identifier.uri
http://hdl.handle.net/1842/7982
dc.language.iso
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
O.Watts, J. Yamagishi, and S. King. The role of higher-level linguistic features in HMM-based speech synthesis. In Proc. Interspeech, pages 841-844, Makuhari, Japan, Sept. 2010a.
en
dc.relation.hasversion
O. Watts, J. Yamagishi, and S. King. Letter-based speech synthesis. In Proc. Speech Synthesis Workshop 2010, pages 317-322, Nara, Japan, Sept. 2010b.
en
dc.relation.hasversion
O. Watts, J. Yamagishi, and S. King. Unsupervised continuous-valued word features for phrase-break prediction without a part-of-speech tagger. In Proc. Interspeech, Florence, Italy, Aug. 2011.
en
dc.relation.hasversion
J. Yamagishi and O. Watts. The CSTR/EMIME HTS System for Blizzard Challenge. In Proc. Blizzard Challenge 2010, Sept. 2010.
en
dc.subject
unsupervised learning
en
dc.subject
vector space model
en
dc.subject
speech synthesis
en
dc.subject
TTS
en
dc.subject
text-to-speech
en
dc.title
Unsupervised learning for text-to-speech synthesis
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en
Files
Original bundle
1 - 1 of 1
- Name:
- Watts2013.pdf
- Size:
- 1.43 MB
- Format:
- Adobe Portable Document Format
- Description:
This item appears in the following Collection(s)

