Fast and controllable speech generation
dc.contributor.advisor
King, Simon
dc.contributor.advisor
Yamagishi, Junichi
dc.contributor.advisor
Nagarajan, Vijayanand
dc.contributor.advisor
Valentini-Botinhao, Cassia
dc.contributor.author
Webber, Jacob Josiah
dc.date.accessioned
2025-07-11T15:01:19Z
dc.date.available
2025-07-11T15:01:19Z
dc.date.issued
2025-07-11
dc.description.abstract
The work described in this thesis was completed shortly after the emergence of deep learning as the state-of-the art for speech generation tasks. In 2016 WaveNET showed that autoregressive generative modelling could generate high quality speech waveforms, and in 2017 the emergence of Tacotron showed the power of cutting edge sequence-to-sequence models (then recurrent encoder-decoder nets with attention) for speech synthesis.
However, while these models showed a large improvement in output quality when compared with the statistical-parametric models that preceded them, they came with new problems of their own. Firstly these models were extremely expensive in terms of the compute required at both training and synthesis time. Secondly, while these models excelled at creating samples similar to those upon which they were trained, the lack of interpretable intermediate features, such as F0 or durations, meant that it was no longer possible to have fine-grained control over synthesis.
This thesis tackles these two problems in turn, presenting approaches which generate speech in a way that is fast on low-end hardware or controllable by arbitrary parameters.
In the case of the former, a range of techniques are presented that combine digital signal processing techniques with machine learning to deliver high-quality audio with considerably less computational expense than state-of-the-art neural vocoders.
In order to enable controllable speech generation, I present Hider-Finder-Combiner: a system of adversarial information hiding used to derive disentangled representations of speech. The common thread uniting these approaches is a reliance on learned representations of speech.
By developing the Hider-Finder-Combiner architecture, which was introduced in a MScRes thesis also derived from the same project as this thesis, I show the power of learned representations to control speech prosody and enable privacy-preserving uses of speech technology. Through Autovocoder I show that representations of speech can be learned that are efficiently converted into audio waveforms.
Together these streams of work show that learned representations, derived using deep learning techniques, can enable fast, controllable speech generation.
en
dc.identifier.uri
https://hdl.handle.net/1842/43669
dc.identifier.uri
http://dx.doi.org/10.7488/era/6201
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Webber, J. J., Perrotin, O., and King, S. (2020). Hider-finder-combiner: An adversarial architecture for general speech signal modification. In Proc. Interspeech 2020, pages 3206–3210
en
dc.relation.hasversion
Webber, J. J., Valentini-Botinhao, C., Williams, E., Henter, G. E., and King, S. (2023). Autovocoder: Fast waveform generation from a learned speech representation using differentiable digital signal processing. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5
en
dc.relation.hasversion
Webber, J. J., Watts, O., Henter, G. E., Williams, J., and King, S. (2024). Voice conversion-based privacy through adversarial information hiding. In Proc. 4th Symposium on Security and Privacy in Speech Communication
en
dc.rights.license
Attribution 4.0 International CC BY 4.0 Deed
en
dc.rights.uri
https://creativecommons.org/licenses/by/4.0/
en
dc.subject
speech synthesis
en
dc.subject
machine learning
en
dc.subject
GAN
en
dc.subject
audio
en
dc.subject
generative models
en
dc.subject
voice conversion
en
dc.subject
speech
en
dc.title
Fast and controllable speech generation
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en
Files
Original bundle
1 - 1 of 1
- Name:
- Webber2025.pdf
- Size:
- 8.04 MB
- Format:
- Adobe Portable Document Format
- Description:
This item appears in the following Collection(s)

