Vocal Attractiveness Of Statistical Speech Synthesisers
View/ Open
Date
2011Author
Andraszewicz, Sandra
Yamagishi, Junichi
King, Simon
Metadata
Abstract
Our previous analysis of speaker-adaptive HMM-based speech synthesis methods suggested that there are two possible reasons why average voices can obtain higher subjective scores than any individual adapted voice: 1) model adaptation degrades speech quality proportionally to the distance ‘moved’ by the transforms, and 2) psychoacoustic
effects relating to the attractiveness of the voice. This paper is a follow-on from that analysis and aims to separate these effects out. Our latest perceptual experiments focus on attractiveness, using
average voices and speaker-dependent voices without model transformation, and show that using several speakers to create a voice
improves smoothness (measured by Harmonics-to-Noise Ratio), reduces distance from the the average voice in the log F0-F1 space of
the final voice and hence makes it more attractive at the segmental level. However, this is weakened or overridden at supra-segmental or sentence levels.