Control strategies for expressive text-to-speech
dc.contributor.advisor
King, Simon
dc.contributor.advisor
Goldwater, Sharon
dc.contributor.author
Sigurgeirsson, Atli Thor
dc.date.accessioned
2026-03-02T15:06:57Z
dc.date.issued
2026-01-05
dc.description.abstract
When we speak, we convey information beyond the choice of vocabulary—we enhance the message through prosodic augmentation. In addition to its pragmatic functions, such as clarifying intent and guiding the listener’s interpretation, prosody is also deployed in expressive speech to communicate emotions, attitudes, and speaking styles. The decision to speak expressively, and how to do so, is shaped by the context in which we speak. Often, multiple renditions may be suitable for a given situation, but choosing the wrong one can be perceived as inappropriate or lead to a misinterpretation of the speaker’s intent. Text-To-Speech (TTS) models aim to replicate speech production by mapping written language to speech.
Although natural speech can be expressed in various ways, typical TTS models learn a single, likely mapping based solely on the text they are prompted to speak. Through this process, TTS models suppress vocal patterns that deviate from the default speaking manner learned from the training data.
Therefore, TTS systems often fail to produce appropriate expressiveness, resulting in overly monotonous speech to the point of impacting overall naturalness.
Instead of relying on text alone to determine a suitable prosodic rendition, controllable TTS models enable users to influence this choice by augmenting various aspects of the speech generation process. In my thesis, I investigate a broad range of strategies for controlling expressive TTS systems: (1) reference-conditioning, in which the generated speech is guided via a speech reference sample; (2) Acoustic Feature Control (AFC), which involves annotators directly manipulating individual features predicted and used by the model; and (3) prompt-based control, where the rendition is described in a natural language text-based instruction. The control strategies I investigate differ regarding the conditioning signal they require and, consequently, how the user interacts with it. The choice of a control method involves trade-offs as the different approaches vary across key aspects of controllability: interpretability of modelled representations, responsiveness to the conditioning signal, specificity of control, and the usability of the chosen method.
Control methods that rely on learning latent representations of prosody, like referencebased models, fail to separate prosody from other factors — such as the identity of the original speaker and linguistic content — from the reference utterance used to create the representation. Resolving this entanglement is challenging due to the opaque nature of the representations. As a result, these methods are often tied to specific input conditions, while sampling new representations from the latent prosody space is unreliable, leading to reduced naturalness. In contrast, directly manipulating acoustic features provides a more interpretable form of control and can improve the perceived quality of prosody transfer.
However, this approach is resource-intensive and technically complex. Experiments also indicate that there is a limit to how complex the task can be if annotators are to improve the predicted rendition.
Describing the rendition using natural language offers more accessible control over a limited set of features.
Yet, such models typically exhibit uncontrollable variance for the same input text and instruction prompt. I demonstrate how this variability can be reduced by leveraging the model’s output distribution through fine-tuning.
Drawing on the differences and limitations of these three control methods, I provide recommendations for the appropriate use of each strategy and suggest ways to address the challenges associated with their implementation.
dc.identifier.uri
https://era.ed.ac.uk/handle/1842/44458
dc.identifier.uri
https://doi.org/10.7488/era/6975
dc.language.iso
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Himanshu Maurya and Atli Thor Sigurgeirsson. A Human-In-The-Loop approach to improving cross-text prosody transfer. In Interspeech, 2024
dc.relation.hasversion
Gunnar Thor Örnólfsson, Atli Thor Sigurgeirsson, Anna Björk Nikulásdóttir, and Daniel Schnell. Talrómur 3 v0.1 (24.09), 2024. Clarin-IS
dc.relation.hasversion
Atli Thor Sigurgeirsson and Simon King. Do prosody transfer models transfer prosody? In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023
dc.relation.hasversion
Atli Thor Sigurgeirsson and Simon King. Controllable speaking styles using a large language model. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024
dc.relation.hasversion
Atli Thor Sigurgeirsson and Simon King. RepeaTTS: Towards feature discovery through repeated fine-tuning, 2025
dc.relation.hasversion
Atli Thor Sigurgeirsson and Eddie L Ungless. Just because we camp, doesn’t mean we should: The ethics of modelling queer voices. In Interspeech, 2024
dc.relation.hasversion
Atli Thor Sigurgeirsson, Þorsteinn Daði Gunnarsson, Gunnar Thor Örnólfsson, Eydís Huld Magnúsdottir, Kr. Þórhallsdóttir, Ragnheiður, Stefán Gunnlaugur Jónsson, and Jón Guðnason. Talrómur: A large Icelandic TTS corpus, 2021
dc.rights.license
Creative Commons: Attribution 4.0 International (CC-BY 4.0)
dc.rights.uri
https://creativecommons.org/licenses/by/4.0/
dc.subject
text-to-speech
dc.subject
TTS
dc.subject
controllable TTS system
dc.subject
changing values
dc.subject
prosodic rendition
dc.subject
reference-conditioning
dc.subject
AFC
dc.subject
acoustic feature control
dc.title
Control strategies for expressive text-to-speech
dc.type
Thesis
dc.type.qualificationlevel
Doctoral
dc.type.qualificationname
PhD Doctor of Philosophy
Files
Original bundle
1 - 1 of 1
- Name:
- SigurgeirssonAT_2025.pdf
- Size:
- 14.31 MB
- Format:
- Adobe Portable Document Format
This item appears in the following Collection(s)

