Control strategies for expressive text-to-speech
Item Status
Embargo End Date
Date
Authors
Abstract
When we speak, we convey information beyond the choice of vocabulary—we enhance the message through prosodic augmentation. In addition to its pragmatic functions, such as clarifying intent and guiding the listener’s interpretation, prosody is also deployed in expressive speech to communicate emotions, attitudes, and speaking styles. The decision to speak expressively, and how to do so, is shaped by the context in which we speak. Often, multiple renditions may be suitable for a given situation, but choosing the wrong one can be perceived as inappropriate or lead to a misinterpretation of the speaker’s intent. Text-To-Speech (TTS) models aim to replicate speech production by mapping written language to speech.
Although natural speech can be expressed in various ways, typical TTS models learn a single, likely mapping based solely on the text they are prompted to speak. Through this process, TTS models suppress vocal patterns that deviate from the default speaking manner learned from the training data.
Therefore, TTS systems often fail to produce appropriate expressiveness, resulting in overly monotonous speech to the point of impacting overall naturalness.
Instead of relying on text alone to determine a suitable prosodic rendition, controllable TTS models enable users to influence this choice by augmenting various aspects of the speech generation process. In my thesis, I investigate a broad range of strategies for controlling expressive TTS systems: (1) reference-conditioning, in which the generated speech is guided via a speech reference sample; (2) Acoustic Feature Control (AFC), which involves annotators directly manipulating individual features predicted and used by the model; and (3) prompt-based control, where the rendition is described in a natural language text-based instruction. The control strategies I investigate differ regarding the conditioning signal they require and, consequently, how the user interacts with it. The choice of a control method involves trade-offs as the different approaches vary across key aspects of controllability: interpretability of modelled representations, responsiveness to the conditioning signal, specificity of control, and the usability of the chosen method.
Control methods that rely on learning latent representations of prosody, like referencebased models, fail to separate prosody from other factors — such as the identity of the original speaker and linguistic content — from the reference utterance used to create the representation. Resolving this entanglement is challenging due to the opaque nature of the representations. As a result, these methods are often tied to specific input conditions, while sampling new representations from the latent prosody space is unreliable, leading to reduced naturalness. In contrast, directly manipulating acoustic features provides a more interpretable form of control and can improve the perceived quality of prosody transfer.
However, this approach is resource-intensive and technically complex. Experiments also indicate that there is a limit to how complex the task can be if annotators are to improve the predicted rendition.
Describing the rendition using natural language offers more accessible control over a limited set of features.
Yet, such models typically exhibit uncontrollable variance for the same input text and instruction prompt. I demonstrate how this variability can be reduced by leveraging the model’s output distribution through fine-tuning.
Drawing on the differences and limitations of these three control methods, I provide recommendations for the appropriate use of each strategy and suggest ways to address the challenges associated with their implementation.
This item appears in the following Collection(s)

