Parameter tuning for unit selection speech synthesis
Item statusRestricted Access
This project aims to contribute to current research on the quality of speech synthesis by conducting a perceptual experiment to discover a better set of target cost weights for the Festival speech synthesis system. From the experiment, the acoustic parameters that listeners use when judging synthetic speech will become clearer, as will the importance that each parameter has. The project uses unit selection synthesis, which chooses units for concatenation using a series of target and join costs. Each cost is assigned a weight value which indicates its importance in the overall cost. This project manipulates the target cost weight values in order to find a set of values that better represents the listeners' perception of the quality of the synthetic speech. Previous research shows that perceptual experiments are a common way of evaluating the quality of speech synthesis, and this project uses a listening experiment consisting of paired comparisons to reveal information about how listeners judge synthetic speech. The results from the experiment were analysed using multidimensional scaling to show the structure of the data and provide insight into the processes involved in speech perception. The results showed that when judging synthetic speech, participants pay attention to position in phrase, position in syllable, and stress parameters. It was also found that participants grouped the stimuli on the basis of which of these parameters was given the weight value of 1. The results also showed that a lack of weight on these parameters has more effect on the selection of units from the database than a large amount of weight. Through analysis of the results it was shown that position in syllable was the most important parameter for high quality speech.