Fast and controllable speech generation

Webber, Jacob Josiah

Fast and controllable speech generation

Files

Webber2025.pdf (8.04 MB)

Date

2025-07-11

Authors

Webber, Jacob Josiah

Full item page

Abstract

The work described in this thesis was completed shortly after the emergence of deep learning as the state-of-the art for speech generation tasks. In 2016 WaveNET showed that autoregressive generative modelling could generate high quality speech waveforms, and in 2017 the emergence of Tacotron showed the power of cutting edge sequence-to-sequence models (then recurrent encoder-decoder nets with attention) for speech synthesis. However, while these models showed a large improvement in output quality when compared with the statistical-parametric models that preceded them, they came with new problems of their own. Firstly these models were extremely expensive in terms of the compute required at both training and synthesis time. Secondly, while these models excelled at creating samples similar to those upon which they were trained, the lack of interpretable intermediate features, such as F0 or durations, meant that it was no longer possible to have fine-grained control over synthesis. This thesis tackles these two problems in turn, presenting approaches which generate speech in a way that is fast on low-end hardware or controllable by arbitrary parameters. In the case of the former, a range of techniques are presented that combine digital signal processing techniques with machine learning to deliver high-quality audio with considerably less computational expense than state-of-the-art neural vocoders. In order to enable controllable speech generation, I present Hider-Finder-Combiner: a system of adversarial information hiding used to derive disentangled representations of speech. The common thread uniting these approaches is a reliance on learned representations of speech. By developing the Hider-Finder-Combiner architecture, which was introduced in a MScRes thesis also derived from the same project as this thesis, I show the power of learned representations to control speech prosody and enable privacy-preserving uses of speech technology. Through Autovocoder I show that representations of speech can be learned that are efficiently converted into audio waveforms. Together these streams of work show that learned representations, derived using deep learning techniques, can enable fast, controllable speech generation.

URI

https://hdl.handle.net/1842/43669
http://dx.doi.org/10.7488/era/6201

This item appears in the following Collection(s)

Informatics thesis and dissertation collection