Synthesising conversational speech using found data
Item Status
Embargo End Date
Date
Authors
O’Mahony, Johannah
Abstract
End-to-end speech synthesis models perform well when trained with clean read speech
data. Modelling conversational speech, the form of speech that we use every day, however,
is more challenging. First, we lack high-quality conversational datasets that are suitable for
training speech synthesis models. Second, conversational data is highly variable, containing
challenging spontaneous phenomena, such as overlapping speech and laughter. Third,
each conversational utterance is embedded in a communicative context, and there are many
contextual factors which must be accounted for. Finally, there exists a significant
knowledge gap with respect to our understanding of both speech perception and speech
production in context, both in the fields of speech technology and speech science.
In this work, we addressed these issues in three parts. In Part 1, we examined three
factors that potentially affect the evaluation of speech synthesis output in context, namely
the task instructions, between-sentence textual dependency and the prosodic realisation of
the utterances. We found that task instructions can affect ratings, and we found that
presenting speech in context narrows the gap of Mean Opinion Scores between the
contextually appropriate utterance and the non-appropriate utterance. This suggests that
MOS might not be sufficiently sensitive to evaluate speech synthesis in context. We
conclude that more targeted evaluation is necessary to capture contextual effects.
In Part 2, we present two studies on improving conversational prosody using found
data and controllable synthesis. In the first study, we find that training a model on a data
mixture of found conversational speech (questions and answers) and read speech can
improve the realisation of questions as measured by an increase in preferences for our
datamix model over the baseline, which was only trained on read speech. For answers, no
significant difference between the systems was found. In the second study, we used a
linguistically-motivated word-level F0 representations based on Legendre Polynomial
coefficients to condition a FastPitch model, allowing us to control the intonation of an
utterance. We found that conditioning a model on these representations increases to
similarity of the F0 contours between the system output and the target output over the
baseline and a categorically-conditioned model. The proposed representations can then be
used to explore patterns in conversational speech.
In Part 3, we present two case studies investigating the impact of context on an
utterance’s prosodic realisation. In the first study, we used found data and our intonation
representations from Part 2 to explore prosodic variation on the discourse marker “well”.
Using clusters from the data exploration, we synthesised 20 different renditions of a
positive polarity utterance, well yes, and a negative polarity utterance, well no, and
performed a listening test to assess the degree of agreement perceived by listeners. We found
that the prosodic rendition of the utterance can affect the perceived agreement or
ix
disagreement of the speaker highlighting an example of the prosody-pragmatics interface.
In the second study, we used found data to explore turn-taking cues in conversation. We
found that conditioning a FastPitch speech synthesis model on turn-taking information
leads to perceptible differences in the turn-finality of an utterance as measured in subjective
listening tests. We showed that we can use speech synthesis to generate stimuli which reflect
the global trends in the training data and that this method can complement corpus research
in phonetics.
This item appears in the following Collection(s)

