Edinburgh Research Archive

Using local linguistic context for text-to-speech

dc.contributor.advisor
King, Simon
dc.contributor.advisor
Lai, Catherine
dc.contributor.author
Oplustil-Gallegos, Pilar
dc.contributor.sponsor
other
en
dc.date.accessioned
2023-03-15T16:02:40Z
dc.date.available
2023-03-15T16:02:40Z
dc.date.issued
2023-03-15
dc.description.abstract
Synthetic speech generated by state-of-the-art Text-to-Speech (TTS) models achieves unprecedented levels of naturalness. Training, inference and evaluation of TTS models has consistently been performed on isolated utterances stripped of contextual information, despite evidence from linguistics that context can affect speech. In this thesis, we hypothesize that we can further improve synthetic speech naturalness by leveraging local linguistic context, which we define as the utterance that immediately precedes another one, considering both its textual and acoustic contents, with a focus on the latter. The experimental work on this thesis is divided into three parts. In the first part, we develop and test a method to condition sequence-to-sequence TTS models on representations of the context utterance. Preliminary results conditioning on an acoustic representation show that it is possible to improve synthetic speech with our method, when evaluating single utterances through listening tests. Next, we systematically compare different context representations, and we find significantly better naturalness scores when combining acoustic and textual representations from context to condition TTS systems. In the second part, we explore alternative methods to incorporate contextual information. We do not find improvements by conditioning inference only on context representations, or by augmenting the TTS input with features extracted from textual context. In the last part of this thesis we analyse and evaluate the best method proposed in part one. We begin by testing our method on several challenging data sets of diverse nature, establishing its limitations. Subsequently, we evaluate our method by applying an in-context listening test design proposed in previous work. Unexpectedly, we see that ground-truth speech might not be considered more natural when listened to in-context than as isolated utterances, contrary to previous results. We finish by proposing to apply local coherence models, trained on sequences of natural speech data, as an objective evaluation of synthetic speech in-context. Through this evaluation, we see that our method, using ground-truth acoustic context, provides improvements in-context, only when trained with speech from a speaker with high predictability at the local linguistic context level, encoded through acoustic features alone.
en
dc.identifier.uri
https://hdl.handle.net/1842/40415
dc.identifier.uri
http://dx.doi.org/10.7488/era/3183
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Dubiel, M., Halvey, M., Oplustil-Gallegos, P., & King, S. (2020). Persuasive syn thetic speech: Voice perception and user behaviour. In Proc. cui.
en
dc.relation.hasversion
O’Mahony, J., Oplustil-Gallegos, P., Lai, C., & King, S. (2021). Factors affecting the evaluation of synthetic speech in context. In Proc. SSW.
en
dc.relation.hasversion
Oplustil-Gallegos, P., & King, S. (2020). Using previous acoustic context to improve text-to-speech synthesis. ArXiv preprint.
en
dc.relation.hasversion
Oplustil-Gallegos, P., O’Mahony, J., & King, S. (2021). Comparing acoustic and textual representations of previous linguistic context for improving Text-to Speech. In Proc. SSW
en
dc.relation.hasversion
Oplustil-Gallegos, P., Williams, J., Rownicka, J., & King, S. (2020). An unsu pervised method to select a speaker subset from large multi-speaker speech synthesis datasets. In Proc. Interspeech
en
dc.relation.hasversion
Williams, J., Rownicka, J., Oplustil-Gallegos, P., & King, S. (2020). Comparison of speech representations for automatic quality estimation in multi-speaker Text-to-Speech synthesis. In Proc. Speaker Odyssey
en
dc.subject
Text-to-Speech
en
dc.subject
contextual information
en
dc.subject
sequence-to-sequence TTS models
en
dc.subject
synthetic speech
en
dc.subject
context representations
en
dc.subject
local coherence models
en
dc.title
Using local linguistic context for text-to-speech
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en

Files

Original bundle

Now showing 1 - 1 of 1
Name:
Oplustil Gallegos2023.pdf
Size:
7.21 MB
Format:
Adobe Portable Document Format
Description:

This item appears in the following Collection(s)