Sequence-to-sequence linguistic frontend modelling for text-to-speech and its improvement
Abstract
With the development of deep learning techniques, the last decade has witnessed the
transition of Text-to-Speech (TTS) from Statistical Parametric Speech Synthesis (SPSS)
to neural TTS, and more recently to large-scale neural TTS. Loosely speaking, this
evolution is caused by the increased modelling capability of the acoustic model. As
acoustic models become increasingly powerful, the burden of contextualising the input
text sequence (i.e., deriving and predicting various types of context from the text
sequence) that was typically taken by the linguistic frontend has been largely shifted to
the acoustic model. As a result, the linguistic frontend and its output pronunciation sequence
become increasingly simplified. Despite this, some recent studies have shown
that for languages with irregular pronunciation patterns (e.g., English), pronunciation
sequences remain an effective input representation for acoustic models to ensure the
pronunciation accuracy of synthesized speech. In other words, a high-quality explicit
linguistic frontend is still a necessary part of (both conventional and large-scale) neural
TTS for these languages. However, a conventional pipeline-based frontend is difficult
to build and is vulnerable to compounding errors (a.k.a., accumulated errors).
Recently, Sequence-to-Sequence (Seq2Seq) frontends have become a new paradigm
for linguistic frontends, directly converting the text sequence to a corresponding pronunciation
sequence at the sentence level. Due to unified modelling, Seq2Seq frontends
can greatly mitigate the drawbacks of pipeline-based frontends. Following this
line of research, this thesis aims to facilitate the initial building and subsequent improvement
of Seq2Seq frontends. To overcome the lack of annotated pronunciation
training targets for initialising the Seq2Seq frontend, we opt to apply a bootstrapping
method, where a pre-existing pipeline-based frontend is utilised to generate pronunciation
sequences for large amounts of unlabelled text to form the bootstrapping training
dataset for training the Seq2Seq frontend. The bootstrapped Seq2Seq frontend is
shown to achieve impressive memorisation performance (99.9% word accuracy) and
generalisation performance (82% word accuracy).
Nevertheless, the fixed lexical coverage of bootstrapping training dataset (due to
the fixed-sized pronunciation dictionary built within the pipeline-based frontend) poses
a major limitation to the bootstrapped Seq2Seq frontend. To overcome this limitation,
we opt to resort to some easily-accessible extra training source covering certain word
types (and their pronunciation knowledge) that have not yet been covered in the original
bootstrapping training dataset. Based on this, three different methods are proposed
in this work, including (i) a Forced-Alignment (FA)-based method that leverages transcribed
speech audio as an extra training source, (ii) a Multi-Accent Bootstrapping
(MAB) method that leverages bootstrapping training data of some extra accent(s) other
than the main accent we aim to model as an extra training source, and (iii) a Multi-Task
Learning (MTL)-based method that also leverages transcribed speech audio as an extra
training source, but has a simpler implementation process compared to the FA-based
method. The proposed methods are shown to be effective in acquiring novel pronunciation
knowledge of those previously uncovered word types from their corresponding
extra training sources. Our experimental results show that, for those previously uncovered
word types, the FA-based method can improve the word accuracy by more
than 3% absolute, the MAB method can improve the word accuracy by more than 12%
absolute and the MTL-based method can improve the word accuracy by more than 3%
absolute.
This item appears in the following Collection(s)

