Controlling text-to-speech pronunciation using limited linguistic resources

Fong, Jason

Controlling text-to-speech pronunciation using limited linguistic resources

Simple item page

dc.contributor.advisor

King, Simon

dc.contributor.advisor

Tang, Hao

dc.contributor.author

Fong, Jason

dc.date.accessioned

2024-10-29T12:31:22Z

dc.date.available

2024-10-29T12:31:22Z

dc.date.issued

2024-10-29

dc.description.abstract

Correct pronunciation is essential for high-quality text-to-speech (TTS) systems. To achieve this, the majority of TTS systems rely on phonemes as an intermediate representation between input graphemes and output speech. Phonemes are generated by a pronunciation lexicon and grapheme-to-phoneme model. Both of these resources, however, are costly to create and consequently are only available for a limited number of well-resourced languages. In recent years, end-to-end TTS models, which can be learned entirely from scratch in a data-driven manner, have become more prevalent because they remove the need for creating and maintaining heuristically engineered modules. Although they are often used with phoneme inputs, another advantage of such models is that they enable grapheme-input TTS. While this is a promising step forward for creating TTS voices for under-resourced languages, a significant drawback is the resulting compromise in maintaining precise control over pronunciation. The primary focus of this thesis is to address these limitations by developing novel methods that enable pronunciation control for grapheme-input TTS systems, without depending on a large phoneme lexicon. I present two main contributions: Firstly, I demonstrate that pronunciation control can be achieved using small phoneme-based pronunciation lexica. Secondly, I demonstrate that ground-truth speech exemplars of word pronunciations can be used to directly control the pronunciations of a TTS system, or to retrieve novel spellings that are correctly pronounced. The methods presented in this thesis hold the potential to revolutionise the landscape of TTS system development. Through a reduction in the time and expenses involved in creating and maintaining pronunciation resources, this research paves the way for the implementation of high-quality TTS in languages that lack extensive pronunciation resources. Furthermore, these methodologies empower language communities by removing the necessity for linguistic expertise and enabling crowd participation, thereby advancing the universal accessibility of speech technologies across diverse languages worldwide.

en

dc.identifier.uri

https://hdl.handle.net/1842/42379

dc.identifier.uri

http://dx.doi.org/10.7488/era/5073

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.relation.hasversion

Fong, J., Lyth, D., Henter, G. E., Tang, H., and King, S. (2022a). Speech audio corrector: using speech from non-target speakers for one-off correction of mispronunciations in grapheme-input text-to-speech. Proc. Interspeech 2022, pages 1213–1217

en

dc.relation.hasversion

Fong, J., Tang, H., and King, S. (2023). Spell4TTS: Acoustically-informed spellings for improving text-to-speech pronunciations. In 12th Speech Synthesis Workshop (SSW) 2023

en

dc.relation.hasversion

Fong, J., Taylor, J., and King, S. (2020). Testing the limits of representation mixing for pronunciation correction in end-to-end speech synthesis. In INTERSPEECH, pages 4019–4023

en

dc.relation.hasversion

Fong, J., Taylor, J., Richmond, K., and King, S. (2019a). A comparison between letters and phones as input to sequence-to-sequence models for speech synthesis. In 10th ISCA Speech Synthesis Workshop.

en

dc.relation.hasversion

Fong, J., Taylor, J., Richmond, K., and King, S. (2019b). Investigating the robustness of sequence-to-sequence text-to-speech models to imperfectly-transcribed training data. In Interspeech 2019

en

dc.relation.hasversion

Fong, J., Wang, Y., Agrawal, P., Manohar, V., Wu, J., Köhler, T., and He, Q. (2022). Towards zero-shot text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders. arXiv preprint arXiv:2210.16045.

en

dc.relation.hasversion

Fong, J., Williams, J., and King, S. Analysing Temporal Sensitivity of VQ-VAE Sub-Phone Codebooks.

en

dc.relation.hasversion

Fong, J., Williams, J., and King, S. (2021). Analysing temporal sensitivity of vq-vae sub-phone codebooks. In The 11th ISCA Speech Synthesis Workshop (SSW11), pages 27–231.

en

dc.relation.hasversion

Fong, J., Wu, J., Agrawal, P., Gibiansky, A., Koehler, T., and He, Q. Improving polyglot speech synthesis through multi-task and adversarial learning

en

dc.relation.hasversion

Watts, O., Henter, G. E., Fong, J., and Valentini-Botinhao, C. (2019). Where do the improvements come from in sequence-to-sequence neural tts? In 2019 ISCA Speech Synthesis Workshop (SSW), volume 10, pages 217–222.

en

dc.relation.hasversion

Williams, J., Fong, J., Cooper, E., and Yamagishi, J. (2021). Exploring disentanglement with multilingual and monolingual vq-vae. arXiv preprint arXiv:2105.01573

en

dc.subject

text-to-speech

en

dc.subject

TTS systems

en

dc.subject

pronunciation control

en

dc.subject

phoneme-based pronunciation lexica

en

dc.title

Controlling text-to-speech pronunciation using limited linguistic resources

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Doctoral

en

dc.type.qualificationname

PhD Doctor of Philosophy

en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Fong2024.pdf
Size:: 5.54 MB
Format:: Adobe Portable Document Format
Description:

Download

This item appears in the following Collection(s)

Informatics thesis and dissertation collection