Synthesis and evaluation of conversational characteristics in speech synthesis
Andersson, Johan Sebastian
Conventional synthetic voices can synthesise neutral read aloud speech well. But, to make synthetic speech more suitable for a wider range of applications, the voices need to express more than just the word identity. We need to develop voices that can partake in a conversation and express, e.g. agreement, disagreement, hesitation, in a natural and believable manner. In speech synthesis there are currently two dominating frameworks: unit selection and HMM-based speech synthesis. Both frameworks utilise recordings of human speech to build synthetic voices. Despite the fact that the content of the recordings determines the segmental and prosodic phenomena that can be synthesised, surprisingly little research has been made on utilising the corpus to extend the limited behaviour of conventional synthetic voices. In this thesis we will show how natural sounding conversational characteristics can be added to both unit selection and HMM-based synthetic voices, by adding speech from a spontaneous conversation to the voices. We recorded a spontaneous conversation, and by manually transcribing and selecting utterances we obtained approximately two thousand utterances from it. These conversational utterances were rich in conversational speech phenomena, but they lacked the general coverage that allows unit selection and HMM-based synthesis techniques to synthesise high quality speech. Therefore we investigated a number of blending approaches in the synthetic voices, where the conversational utterances were augmented with conventional read aloud speech. The synthetic voices that contained conversational speech were contrasted with conventional voices without conversational speech. The perceptual evaluations showed that the conversational voices were generally perceived by listeners as having a more conversational style than the conventional voices. This conversational style was largely due to the conversational voices’ ability to synthesise utterances that contained conversational speech phenomena in a more natural manner than the conventional voices. Additionally, we conducted an experiment that showed that natural sounding conversational characteristics in synthetic speech can convey pragmatic information, in our case an impression of certainty or uncertainty, about a topic to a listener. The conclusion drawn is that the limited behaviour of conventional synthetic voices can be enriched by utilising conversational speech in both unit selection and HMM-based speech synthesis.