Edinburgh Research Archive

Sequence-to-sequence linguistic frontend modelling for text-to-speech and its improvement

Item Status

Embargo End Date

Authors

Sun, Siqi

Abstract

With the development of deep learning techniques, the last decade has witnessed the transition of Text-to-Speech (TTS) from Statistical Parametric Speech Synthesis (SPSS) to neural TTS, and more recently to large-scale neural TTS. Loosely speaking, this evolution is caused by the increased modelling capability of the acoustic model. As acoustic models become increasingly powerful, the burden of contextualising the input text sequence (i.e., deriving and predicting various types of context from the text sequence) that was typically taken by the linguistic frontend has been largely shifted to the acoustic model. As a result, the linguistic frontend and its output pronunciation sequence become increasingly simplified. Despite this, some recent studies have shown that for languages with irregular pronunciation patterns (e.g., English), pronunciation sequences remain an effective input representation for acoustic models to ensure the pronunciation accuracy of synthesized speech. In other words, a high-quality explicit linguistic frontend is still a necessary part of (both conventional and large-scale) neural TTS for these languages. However, a conventional pipeline-based frontend is difficult to build and is vulnerable to compounding errors (a.k.a., accumulated errors). Recently, Sequence-to-Sequence (Seq2Seq) frontends have become a new paradigm for linguistic frontends, directly converting the text sequence to a corresponding pronunciation sequence at the sentence level. Due to unified modelling, Seq2Seq frontends can greatly mitigate the drawbacks of pipeline-based frontends. Following this line of research, this thesis aims to facilitate the initial building and subsequent improvement of Seq2Seq frontends. To overcome the lack of annotated pronunciation training targets for initialising the Seq2Seq frontend, we opt to apply a bootstrapping method, where a pre-existing pipeline-based frontend is utilised to generate pronunciation sequences for large amounts of unlabelled text to form the bootstrapping training dataset for training the Seq2Seq frontend. The bootstrapped Seq2Seq frontend is shown to achieve impressive memorisation performance (99.9% word accuracy) and generalisation performance (82% word accuracy). Nevertheless, the fixed lexical coverage of bootstrapping training dataset (due to the fixed-sized pronunciation dictionary built within the pipeline-based frontend) poses a major limitation to the bootstrapped Seq2Seq frontend. To overcome this limitation, we opt to resort to some easily-accessible extra training source covering certain word types (and their pronunciation knowledge) that have not yet been covered in the original bootstrapping training dataset. Based on this, three different methods are proposed in this work, including (i) a Forced-Alignment (FA)-based method that leverages transcribed speech audio as an extra training source, (ii) a Multi-Accent Bootstrapping (MAB) method that leverages bootstrapping training data of some extra accent(s) other than the main accent we aim to model as an extra training source, and (iii) a Multi-Task Learning (MTL)-based method that also leverages transcribed speech audio as an extra training source, but has a simpler implementation process compared to the FA-based method. The proposed methods are shown to be effective in acquiring novel pronunciation knowledge of those previously uncovered word types from their corresponding extra training sources. Our experimental results show that, for those previously uncovered word types, the FA-based method can improve the word accuracy by more than 3% absolute, the MAB method can improve the word accuracy by more than 12% absolute and the MTL-based method can improve the word accuracy by more than 3% absolute.

This item appears in the following Collection(s)