Multidimensional scaling of listener responses to synthetic speech
Item Status
Embargo End Date
Date
Abstract
The move to unit-selection in speech synthesis has resulted
in system improvements being made at subtle sub- and suprasegmental
levels. Human perceptual evaluation of such subtle
improvements requires a highly sophisticated level of perceptual
attention to specific acoustic characteristics or cues. However,
it is not well understood what acoustic cues listeners attend
to by default when asked to evaluate synthetic speech. It may,
therefore, be potentially quite difficult to design an evaluation
method that allows listeners to concentrate on only one dimension
of the signal, while ignoring others that are perceptually
more important to them.
This paper describes a pilot study which aims to evaluate
multidimensional scaling (MDS) as a possible method of determining
what acoustic characteristics of synthetic speech influence
listeners’ judgements of the naturalness of the speech.
Using distance measures (either real or perceived distances),
MDS techniques represent stimuli as points in n-dimensional
space. The space is configured so that similar stimuli are close
together, while different stimuli are farther apart. Additionally,
the dimensions of the space correspond to characteristics of the
stimuli which influenced the perceived distances.
Our results indicate that MDS techniques should be a useful
tool in understanding the complex psychoacoustic processes
that listeners undergo when evaluating synthetic speech. This
method has allowed us to identify a number of cues that appear
to be particularly perceptually salient to listeners evaluating
synthetic speech naturalness, namely prosodic cues (in
terms of duration and/or intonation) and segmental or unit level
cues (in terms of appropriateness of units, or number of units).
This item appears in the following Collection(s)

