Multidimensional scaling of listener responses to synthetic speech
Clark, Robert A J
The move to unit-selection in speech synthesis has resulted in system improvements being made at subtle sub- and suprasegmental levels. Human perceptual evaluation of such subtle improvements requires a highly sophisticated level of perceptual attention to specific acoustic characteristics or cues. However, it is not well understood what acoustic cues listeners attend to by default when asked to evaluate synthetic speech. It may, therefore, be potentially quite difficult to design an evaluation method that allows listeners to concentrate on only one dimension of the signal, while ignoring others that are perceptually more important to them. This paper describes a pilot study which aims to evaluate multidimensional scaling (MDS) as a possible method of determining what acoustic characteristics of synthetic speech influence listeners’ judgements of the naturalness of the speech. Using distance measures (either real or perceived distances), MDS techniques represent stimuli as points in n-dimensional space. The space is configured so that similar stimuli are close together, while different stimuli are farther apart. Additionally, the dimensions of the space correspond to characteristics of the stimuli which influenced the perceived distances. Our results indicate that MDS techniques should be a useful tool in understanding the complex psychoacoustic processes that listeners undergo when evaluating synthetic speech. This method has allowed us to identify a number of cues that appear to be particularly perceptually salient to listeners evaluating synthetic speech naturalness, namely prosodic cues (in terms of duration and/or intonation) and segmental or unit level cues (in terms of appropriateness of units, or number of units).