Edinburgh Research Archive

Exploring discourse-level features for audiobook-based speech synthesis

dc.contributor.advisor
King, Simon
en
dc.contributor.author
Ribeiro, Manuel Sam
en
dc.date.accessioned
2014-03-26T15:43:47Z
dc.date.available
2014-03-26T15:43:47Z
dc.date.issued
2013
dc.description.abstract
Audiobooks are a powerful source of rich information for speech synthesis. Recent work has been focusing on this type of data to improve synthetic speech on two essential dimensions: naturalness and expressiveness. In audiobooks, sentences are not spoken in isolation, as in traditional speech synthesis databases, which allow us to explore discourse-level effects in synthetic speech. Furthermore, audiobook readers often change their voices to impersonate certain characters or to convey particular emotions related to the text, essentially making their speech more expressive. We should be aware, however, that audiobooks are found data. They were not specifically designed for speech synthesis, and may lack the coverage that carefully designed databases would have. Also, especially with freely available materials, audiobook recordings may not have been performed in the best conditions and readers may not be professional voice talents. Regardless, recent work has shown that audiobook data can be a valuable asset. This work will conduct an exploratory analysis on audiobook data, focusing on discourse-level phenomena, with the intent of improving the naturalness and expressiveness of synthetic speech. Several topics in these two dimensions are explored considering the written paragraph as a unit of discourse. In terms of naturalness, we begin by exploring acoustic effects within the paragraph and at its boundaries, focusing on intonational (F0-based) and durational cues in speech production. We continue with the prediction of pause duration from the large audiobook corpus. As for expressiveness, we explore Cluster Adaptive Training (CAT) interpolation weight vectors. We analyze their distributions and propose several text-based features that might help explain their variability. We then build univariate and multivariate regression trees in order to predict CAT weight vectors from the suggested textual features. Given the exploratory nature of this work, some analyses and models are more successful than others, and some results inevitably lead to new hypotheses. We conclude with suggestions for future work for each of the observed topics.
en
dc.identifier.uri
http://hdl.handle.net/1842/8616
dc.language.iso
en
dc.publisher
The University of Edinburgh
en
dc.subject
speech synthesis
en
dc.subject
audiobook
en
dc.subject
discourse
en
dc.subject
prosody
en
dc.title
Exploring discourse-level features for audiobook-based speech synthesis
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Masters
en
dc.type.qualificationname
MSc Master of Science
en
dcterms.accessRights
RESTRICTED ACCESS
en

Files

This item appears in the following Collection(s)