Fast and controllable speech generation

Webber, Jacob Josiah

Fast and controllable speech generation

Simple item page

dc.contributor.advisor

King, Simon

dc.contributor.advisor

Yamagishi, Junichi

dc.contributor.advisor

Nagarajan, Vijayanand

dc.contributor.advisor

Valentini-Botinhao, Cassia

dc.contributor.author

Webber, Jacob Josiah

dc.date.accessioned

2025-07-11T15:01:19Z

dc.date.available

2025-07-11T15:01:19Z

dc.date.issued

2025-07-11

dc.description.abstract

The work described in this thesis was completed shortly after the emergence of deep learning as the state-of-the art for speech generation tasks. In 2016 WaveNET showed that autoregressive generative modelling could generate high quality speech waveforms, and in 2017 the emergence of Tacotron showed the power of cutting edge sequence-to-sequence models (then recurrent encoder-decoder nets with attention) for speech synthesis. However, while these models showed a large improvement in output quality when compared with the statistical-parametric models that preceded them, they came with new problems of their own. Firstly these models were extremely expensive in terms of the compute required at both training and synthesis time. Secondly, while these models excelled at creating samples similar to those upon which they were trained, the lack of interpretable intermediate features, such as F0 or durations, meant that it was no longer possible to have fine-grained control over synthesis. This thesis tackles these two problems in turn, presenting approaches which generate speech in a way that is fast on low-end hardware or controllable by arbitrary parameters. In the case of the former, a range of techniques are presented that combine digital signal processing techniques with machine learning to deliver high-quality audio with considerably less computational expense than state-of-the-art neural vocoders. In order to enable controllable speech generation, I present Hider-Finder-Combiner: a system of adversarial information hiding used to derive disentangled representations of speech. The common thread uniting these approaches is a reliance on learned representations of speech. By developing the Hider-Finder-Combiner architecture, which was introduced in a MScRes thesis also derived from the same project as this thesis, I show the power of learned representations to control speech prosody and enable privacy-preserving uses of speech technology. Through Autovocoder I show that representations of speech can be learned that are efficiently converted into audio waveforms. Together these streams of work show that learned representations, derived using deep learning techniques, can enable fast, controllable speech generation.

en

dc.identifier.uri

https://hdl.handle.net/1842/43669

dc.identifier.uri

http://dx.doi.org/10.7488/era/6201

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.relation.hasversion

Webber, J. J., Perrotin, O., and King, S. (2020). Hider-finder-combiner: An adversarial architecture for general speech signal modification. In Proc. Interspeech 2020, pages 3206–3210

en

dc.relation.hasversion

Webber, J. J., Valentini-Botinhao, C., Williams, E., Henter, G. E., and King, S. (2023). Autovocoder: Fast waveform generation from a learned speech representation using differentiable digital signal processing. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5

en

dc.relation.hasversion

Webber, J. J., Watts, O., Henter, G. E., Williams, J., and King, S. (2024). Voice conversion-based privacy through adversarial information hiding. In Proc. 4th Symposium on Security and Privacy in Speech Communication

en

dc.rights.license

Attribution 4.0 International CC BY 4.0 Deed

en

dc.rights.uri

https://creativecommons.org/licenses/by/4.0/

en

dc.subject

speech synthesis

en

dc.subject

machine learning

en

dc.subject

GAN

en

dc.subject

audio

en

dc.subject

generative models

en

dc.subject

voice conversion

en

dc.subject

speech

en

dc.title

Fast and controllable speech generation

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Doctoral

en

dc.type.qualificationname

PhD Doctor of Philosophy

en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Webber2025.pdf
Size:: 8.04 MB
Format:: Adobe Portable Document Format
Description:

Download

This item appears in the following Collection(s)

Informatics thesis and dissertation collection