Exploring discourse-level features for audiobook-based speech synthesis

Ribeiro, Manuel Sam

Exploring discourse-level features for audiobook-based speech synthesis

Simple item page

dc.contributor.advisor

King, Simon

en

dc.contributor.author

Ribeiro, Manuel Sam

en

dc.date.accessioned

2014-03-26T15:43:47Z

dc.date.available

2014-03-26T15:43:47Z

dc.date.issued

2013

dc.description.abstract

Audiobooks are a powerful source of rich information for speech synthesis. Recent work has been focusing on this type of data to improve synthetic speech on two essential dimensions: naturalness and expressiveness. In audiobooks, sentences are not spoken in isolation, as in traditional speech synthesis databases, which allow us to explore discourse-level effects in synthetic speech. Furthermore, audiobook readers often change their voices to impersonate certain characters or to convey particular emotions related to the text, essentially making their speech more expressive. We should be aware, however, that audiobooks are found data. They were not specifically designed for speech synthesis, and may lack the coverage that carefully designed databases would have. Also, especially with freely available materials, audiobook recordings may not have been performed in the best conditions and readers may not be professional voice talents. Regardless, recent work has shown that audiobook data can be a valuable asset. This work will conduct an exploratory analysis on audiobook data, focusing on discourse-level phenomena, with the intent of improving the naturalness and expressiveness of synthetic speech. Several topics in these two dimensions are explored considering the written paragraph as a unit of discourse. In terms of naturalness, we begin by exploring acoustic effects within the paragraph and at its boundaries, focusing on intonational (F0-based) and durational cues in speech production. We continue with the prediction of pause duration from the large audiobook corpus. As for expressiveness, we explore Cluster Adaptive Training (CAT) interpolation weight vectors. We analyze their distributions and propose several text-based features that might help explain their variability. We then build univariate and multivariate regression trees in order to predict CAT weight vectors from the suggested textual features. Given the exploratory nature of this work, some analyses and models are more successful than others, and some results inevitably lead to new hypotheses. We conclude with suggestions for future work for each of the observed topics.

en

dc.identifier.uri

http://hdl.handle.net/1842/8616

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.subject

speech synthesis

en

dc.subject

audiobook

en

dc.subject

discourse

en

dc.subject

prosody

en

dc.title

Exploring discourse-level features for audiobook-based speech synthesis

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Masters

en

dc.type.qualificationname

MSc Master of Science

en

dcterms.accessRights

RESTRICTED ACCESS

en

This item appears in the following Collection(s)

Linguistics and English Language Masters thesis collection