Controlling text-to-speech pronunciation using limited linguistic resources
dc.contributor.advisor
King, Simon
dc.contributor.advisor
Tang, Hao
dc.contributor.author
Fong, Jason
dc.date.accessioned
2024-10-29T12:31:22Z
dc.date.available
2024-10-29T12:31:22Z
dc.date.issued
2024-10-29
dc.description.abstract
Correct pronunciation is essential for high-quality text-to-speech (TTS) systems. To achieve this, the majority of TTS systems rely on phonemes as an intermediate representation between input graphemes and output speech. Phonemes are generated by a pronunciation lexicon and grapheme-to-phoneme model. Both of these resources, however, are costly to create and consequently are only available for a limited number of well-resourced languages.
In recent years, end-to-end TTS models, which can be learned entirely from scratch in a data-driven manner, have become more prevalent because they remove the need for creating and maintaining heuristically engineered modules. Although they are often used with phoneme inputs, another advantage of such models is that they enable grapheme-input TTS. While this is a promising step forward for creating TTS voices for under-resourced languages, a significant drawback is the resulting compromise in maintaining precise control over pronunciation.
The primary focus of this thesis is to address these limitations by developing novel methods that enable pronunciation control for grapheme-input TTS systems, without depending on a large phoneme lexicon. I present two main contributions: Firstly, I demonstrate that pronunciation control can be achieved using small phoneme-based pronunciation lexica. Secondly, I demonstrate that ground-truth speech exemplars of word pronunciations can be used to directly control the pronunciations of a TTS system, or to retrieve novel spellings that are correctly pronounced.
The methods presented in this thesis hold the potential to revolutionise the landscape of TTS system development. Through a reduction in the time and expenses involved in creating and maintaining pronunciation resources, this research paves the way for the implementation of high-quality TTS in languages that lack extensive pronunciation resources. Furthermore, these methodologies empower language communities by removing the necessity for linguistic expertise and enabling crowd participation, thereby advancing the universal accessibility of speech technologies across diverse languages worldwide.
en
dc.identifier.uri
https://hdl.handle.net/1842/42379
dc.identifier.uri
http://dx.doi.org/10.7488/era/5073
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Fong, J., Lyth, D., Henter, G. E., Tang, H., and King, S. (2022a). Speech audio corrector: using speech from non-target speakers for one-off correction of mispronunciations in grapheme-input text-to-speech. Proc. Interspeech 2022, pages 1213–1217
en
dc.relation.hasversion
Fong, J., Tang, H., and King, S. (2023). Spell4TTS: Acoustically-informed spellings for improving text-to-speech pronunciations. In 12th Speech Synthesis Workshop (SSW) 2023
en
dc.relation.hasversion
Fong, J., Taylor, J., and King, S. (2020). Testing the limits of representation mixing for pronunciation correction in end-to-end speech synthesis. In INTERSPEECH, pages 4019–4023
en
dc.relation.hasversion
Fong, J., Taylor, J., Richmond, K., and King, S. (2019a). A comparison between letters and phones as input to sequence-to-sequence models for speech synthesis. In 10th ISCA Speech Synthesis Workshop.
en
dc.relation.hasversion
Fong, J., Taylor, J., Richmond, K., and King, S. (2019b). Investigating the robustness of sequence-to-sequence text-to-speech models to imperfectly-transcribed training data. In Interspeech 2019
en
dc.relation.hasversion
Fong, J., Wang, Y., Agrawal, P., Manohar, V., Wu, J., Köhler, T., and He, Q. (2022). Towards zero-shot text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders. arXiv preprint arXiv:2210.16045.
en
dc.relation.hasversion
Fong, J., Williams, J., and King, S. Analysing Temporal Sensitivity of VQ-VAE Sub-Phone Codebooks.
en
dc.relation.hasversion
Fong, J., Williams, J., and King, S. (2021). Analysing temporal sensitivity of vq-vae sub-phone codebooks. In The 11th ISCA Speech Synthesis Workshop (SSW11), pages 27–231.
en
dc.relation.hasversion
Fong, J., Wu, J., Agrawal, P., Gibiansky, A., Koehler, T., and He, Q. Improving polyglot speech synthesis through multi-task and adversarial learning
en
dc.relation.hasversion
Watts, O., Henter, G. E., Fong, J., and Valentini-Botinhao, C. (2019). Where do the improvements come from in sequence-to-sequence neural tts? In 2019 ISCA Speech Synthesis Workshop (SSW), volume 10, pages 217–222.
en
dc.relation.hasversion
Williams, J., Fong, J., Cooper, E., and Yamagishi, J. (2021). Exploring disentanglement with multilingual and monolingual vq-vae. arXiv preprint arXiv:2105.01573
en
dc.subject
text-to-speech
en
dc.subject
TTS systems
en
dc.subject
pronunciation control
en
dc.subject
phoneme-based pronunciation lexica
en
dc.title
Controlling text-to-speech pronunciation using limited linguistic resources
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en
Files
Original bundle
1 - 1 of 1
- Name:
- Fong2024.pdf
- Size:
- 5.54 MB
- Format:
- Adobe Portable Document Format
- Description:
This item appears in the following Collection(s)

