Using local linguistic context for text-to-speech

Oplustil-Gallegos, Pilar

Using local linguistic context for text-to-speech

Simple item page

dc.contributor.advisor

King, Simon

dc.contributor.advisor

Lai, Catherine

dc.contributor.author

Oplustil-Gallegos, Pilar

dc.contributor.sponsor

other

en

dc.date.accessioned

2023-03-15T16:02:40Z

dc.date.available

2023-03-15T16:02:40Z

dc.date.issued

2023-03-15

dc.description.abstract

Synthetic speech generated by state-of-the-art Text-to-Speech (TTS) models achieves unprecedented levels of naturalness. Training, inference and evaluation of TTS models has consistently been performed on isolated utterances stripped of contextual information, despite evidence from linguistics that context can affect speech. In this thesis, we hypothesize that we can further improve synthetic speech naturalness by leveraging local linguistic context, which we define as the utterance that immediately precedes another one, considering both its textual and acoustic contents, with a focus on the latter. The experimental work on this thesis is divided into three parts. In the first part, we develop and test a method to condition sequence-to-sequence TTS models on representations of the context utterance. Preliminary results conditioning on an acoustic representation show that it is possible to improve synthetic speech with our method, when evaluating single utterances through listening tests. Next, we systematically compare different context representations, and we find significantly better naturalness scores when combining acoustic and textual representations from context to condition TTS systems. In the second part, we explore alternative methods to incorporate contextual information. We do not find improvements by conditioning inference only on context representations, or by augmenting the TTS input with features extracted from textual context. In the last part of this thesis we analyse and evaluate the best method proposed in part one. We begin by testing our method on several challenging data sets of diverse nature, establishing its limitations. Subsequently, we evaluate our method by applying an in-context listening test design proposed in previous work. Unexpectedly, we see that ground-truth speech might not be considered more natural when listened to in-context than as isolated utterances, contrary to previous results. We finish by proposing to apply local coherence models, trained on sequences of natural speech data, as an objective evaluation of synthetic speech in-context. Through this evaluation, we see that our method, using ground-truth acoustic context, provides improvements in-context, only when trained with speech from a speaker with high predictability at the local linguistic context level, encoded through acoustic features alone.

en

dc.identifier.uri

https://hdl.handle.net/1842/40415

dc.identifier.uri

http://dx.doi.org/10.7488/era/3183

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.relation.hasversion

Dubiel, M., Halvey, M., Oplustil-Gallegos, P., & King, S. (2020). Persuasive syn thetic speech: Voice perception and user behaviour. In Proc. cui.

en

dc.relation.hasversion

O’Mahony, J., Oplustil-Gallegos, P., Lai, C., & King, S. (2021). Factors affecting the evaluation of synthetic speech in context. In Proc. SSW.

en

dc.relation.hasversion

Oplustil-Gallegos, P., & King, S. (2020). Using previous acoustic context to improve text-to-speech synthesis. ArXiv preprint.

en

dc.relation.hasversion

Oplustil-Gallegos, P., O’Mahony, J., & King, S. (2021). Comparing acoustic and textual representations of previous linguistic context for improving Text-to Speech. In Proc. SSW

en

dc.relation.hasversion

Oplustil-Gallegos, P., Williams, J., Rownicka, J., & King, S. (2020). An unsu pervised method to select a speaker subset from large multi-speaker speech synthesis datasets. In Proc. Interspeech

en

dc.relation.hasversion

Williams, J., Rownicka, J., Oplustil-Gallegos, P., & King, S. (2020). Comparison of speech representations for automatic quality estimation in multi-speaker Text-to-Speech synthesis. In Proc. Speaker Odyssey

en

dc.subject

Text-to-Speech

en

dc.subject

contextual information

en

dc.subject

sequence-to-sequence TTS models

en

dc.subject

synthetic speech

en

dc.subject

context representations

en

dc.subject

local coherence models

en

dc.title

Using local linguistic context for text-to-speech

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Doctoral

en

dc.type.qualificationname

PhD Doctor of Philosophy

en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Oplustil Gallegos2023.pdf
Size:: 7.21 MB
Format:: Adobe Portable Document Format
Description:

Download

This item appears in the following Collection(s)

Informatics thesis and dissertation collection