Edinburgh Research Archive

Synthesising conversational speech using found data

Item Status

Embargo End Date

Authors

O’Mahony, Johannah

Abstract

End-to-end speech synthesis models perform well when trained with clean read speech data. Modelling conversational speech, the form of speech that we use every day, however, is more challenging. First, we lack high-quality conversational datasets that are suitable for training speech synthesis models. Second, conversational data is highly variable, containing challenging spontaneous phenomena, such as overlapping speech and laughter. Third, each conversational utterance is embedded in a communicative context, and there are many contextual factors which must be accounted for. Finally, there exists a significant knowledge gap with respect to our understanding of both speech perception and speech production in context, both in the fields of speech technology and speech science. In this work, we addressed these issues in three parts. In Part 1, we examined three factors that potentially affect the evaluation of speech synthesis output in context, namely the task instructions, between-sentence textual dependency and the prosodic realisation of the utterances. We found that task instructions can affect ratings, and we found that presenting speech in context narrows the gap of Mean Opinion Scores between the contextually appropriate utterance and the non-appropriate utterance. This suggests that MOS might not be sufficiently sensitive to evaluate speech synthesis in context. We conclude that more targeted evaluation is necessary to capture contextual effects. In Part 2, we present two studies on improving conversational prosody using found data and controllable synthesis. In the first study, we find that training a model on a data mixture of found conversational speech (questions and answers) and read speech can improve the realisation of questions as measured by an increase in preferences for our datamix model over the baseline, which was only trained on read speech. For answers, no significant difference between the systems was found. In the second study, we used a linguistically-motivated word-level F0 representations based on Legendre Polynomial coefficients to condition a FastPitch model, allowing us to control the intonation of an utterance. We found that conditioning a model on these representations increases to similarity of the F0 contours between the system output and the target output over the baseline and a categorically-conditioned model. The proposed representations can then be used to explore patterns in conversational speech. In Part 3, we present two case studies investigating the impact of context on an utterance’s prosodic realisation. In the first study, we used found data and our intonation representations from Part 2 to explore prosodic variation on the discourse marker “well”. Using clusters from the data exploration, we synthesised 20 different renditions of a positive polarity utterance, well yes, and a negative polarity utterance, well no, and performed a listening test to assess the degree of agreement perceived by listeners. We found that the prosodic rendition of the utterance can affect the perceived agreement or ix disagreement of the speaker highlighting an example of the prosody-pragmatics interface. In the second study, we used found data to explore turn-taking cues in conversation. We found that conditioning a FastPitch speech synthesis model on turn-taking information leads to perceptible differences in the turn-finality of an utterance as measured in subjective listening tests. We showed that we can use speech synthesis to generate stimuli which reflect the global trends in the training data and that this method can complement corpus research in phonetics.

This item appears in the following Collection(s)