Controlling text-to-speech pronunciation using limited linguistic resources

Fong, Jason

Controlling text-to-speech pronunciation using limited linguistic resources

Files

Fong2024.pdf (5.54 MB)

Date

2024-10-29

Authors

Fong, Jason

Full item page

Abstract

Correct pronunciation is essential for high-quality text-to-speech (TTS) systems. To achieve this, the majority of TTS systems rely on phonemes as an intermediate representation between input graphemes and output speech. Phonemes are generated by a pronunciation lexicon and grapheme-to-phoneme model. Both of these resources, however, are costly to create and consequently are only available for a limited number of well-resourced languages. In recent years, end-to-end TTS models, which can be learned entirely from scratch in a data-driven manner, have become more prevalent because they remove the need for creating and maintaining heuristically engineered modules. Although they are often used with phoneme inputs, another advantage of such models is that they enable grapheme-input TTS. While this is a promising step forward for creating TTS voices for under-resourced languages, a significant drawback is the resulting compromise in maintaining precise control over pronunciation. The primary focus of this thesis is to address these limitations by developing novel methods that enable pronunciation control for grapheme-input TTS systems, without depending on a large phoneme lexicon. I present two main contributions: Firstly, I demonstrate that pronunciation control can be achieved using small phoneme-based pronunciation lexica. Secondly, I demonstrate that ground-truth speech exemplars of word pronunciations can be used to directly control the pronunciations of a TTS system, or to retrieve novel spellings that are correctly pronounced. The methods presented in this thesis hold the potential to revolutionise the landscape of TTS system development. Through a reduction in the time and expenses involved in creating and maintaining pronunciation resources, this research paves the way for the implementation of high-quality TTS in languages that lack extensive pronunciation resources. Furthermore, these methodologies empower language communities by removing the necessity for linguistic expertise and enabling crowd participation, thereby advancing the universal accessibility of speech technologies across diverse languages worldwide.

URI

https://hdl.handle.net/1842/42379
http://dx.doi.org/10.7488/era/5073

This item appears in the following Collection(s)

Informatics thesis and dissertation collection