Edinburgh Research Archive logo

Edinburgh Research Archive

University of Edinburgh homecrest
View Item 
  •   ERA Home
  • Informatics, School of
  • Informatics thesis and dissertation collection
  • View Item
  •   ERA Home
  • Informatics, School of
  • Informatics thesis and dissertation collection
  • View Item
  • Login
JavaScript is disabled for your browser. Some features of this site may not work without it.

Using local linguistic context for text-to-speech

View/Open
Oplustil Gallegos2023.pdf (7.205Mb)
Date
15/03/2023
Author
Oplustil-Gallegos, Pilar
Metadata
Show full item record
Abstract
Synthetic speech generated by state-of-the-art Text-to-Speech (TTS) models achieves unprecedented levels of naturalness. Training, inference and evaluation of TTS models has consistently been performed on isolated utterances stripped of contextual information, despite evidence from linguistics that context can affect speech. In this thesis, we hypothesize that we can further improve synthetic speech naturalness by leveraging local linguistic context, which we define as the utterance that immediately precedes another one, considering both its textual and acoustic contents, with a focus on the latter. The experimental work on this thesis is divided into three parts. In the first part, we develop and test a method to condition sequence-to-sequence TTS models on representations of the context utterance. Preliminary results conditioning on an acoustic representation show that it is possible to improve synthetic speech with our method, when evaluating single utterances through listening tests. Next, we systematically compare different context representations, and we find significantly better naturalness scores when combining acoustic and textual representations from context to condition TTS systems. In the second part, we explore alternative methods to incorporate contextual information. We do not find improvements by conditioning inference only on context representations, or by augmenting the TTS input with features extracted from textual context. In the last part of this thesis we analyse and evaluate the best method proposed in part one. We begin by testing our method on several challenging data sets of diverse nature, establishing its limitations. Subsequently, we evaluate our method by applying an in-context listening test design proposed in previous work. Unexpectedly, we see that ground-truth speech might not be considered more natural when listened to in-context than as isolated utterances, contrary to previous results. We finish by proposing to apply local coherence models, trained on sequences of natural speech data, as an objective evaluation of synthetic speech in-context. Through this evaluation, we see that our method, using ground-truth acoustic context, provides improvements in-context, only when trained with speech from a speaker with high predictability at the local linguistic context level, encoded through acoustic features alone.
URI
https://hdl.handle.net/1842/40415

http://dx.doi.org/10.7488/era/3183
Collections
  • Informatics thesis and dissertation collection

Library & University Collections HomeUniversity of Edinburgh Information Services Home
Privacy & Cookies | Takedown Policy | Accessibility | Contact
Privacy & Cookies
Takedown Policy
Accessibility
Contact
feed RSS Feeds

RSS Feed not available for this page

 

 

All of ERACommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsPublication TypeSponsorSupervisorsThis CollectionBy Issue DateAuthorsTitlesSubjectsPublication TypeSponsorSupervisors
LoginRegister

Library & University Collections HomeUniversity of Edinburgh Information Services Home
Privacy & Cookies | Takedown Policy | Accessibility | Contact
Privacy & Cookies
Takedown Policy
Accessibility
Contact
feed RSS Feeds

RSS Feed not available for this page