Join Cost for Unit Selection Speech Synthesis
Undoubtedly, state-of-the-art unit selection-based concatenative speech systems produce very high quality synthetic speech. this is due to a large speech database containing many instances of each speech unit, with a varied and natural distribution of prosodic and spectral characteristics. the join cost, which measures how well two units can be joined together is one of the main criteria for selecting appropriate units from this large speech database. The ideal join cost is one that measures percieved discontinuity based on easily measurable spectral properties of the units being joined, inorder to ensure smooth and natural sounding synthetic speech. During first part of my research, I have investigated various spectrally based distance measures for use in computation of the join cost by designing a perceptual listening experiment. A variation to the usual perceptual test paradigm is proposed in this thesis by deliberately including a wide range of qualities of join in polysyllabic words. The test stimuli are obtained using a state-of-the-art unit-selection text-to-speech system: rVoice from Rhetorical Systems Ltd. Three spectral features Mel-frequency cepstral coefficients (MFCC), line spectral frequencies (LSF) and multiple centroid analysis (MCA) parameters and various statistical distances - Euclidean, Kullback-Leibler, Mahalanobis - are used to obtain distance measures. Based on the correlations between perceptual scores and these spectral distances. I proposed new spectral distance measures, which have good correlation with human perception to concatenation discontinuities. The second part of my research concentrates on combining join cost computation and the smoothing operation, which is required to disguise joins, by learning an underlying representation from the acoustic signal. In order to accomplish this task, I have chosen linear dynamic models (LDM), sometimes known as Kalman filters. Three different initialisation schemes are used prior to Expectation-Maximisation (KM) in LDM training. Once the models are trained, the join cost is computed based on the error between model predictions and actual observations. Analytical measures are derived based on the shape of this error plot. These measures and initialisation schemes are compared by computing correlations using the perceptual data.. The LDMs are also able to smooth the observations which are then used to synthesise speech. To evaluate the LDM smoothing operation, another listening test is performed where it is compared with the standard methods (simple linear interpolation). I have compared the best three join cost functions, chosen from the first and second parts of my research, subjectively using a listening test in the third part of my research. in this test, I also evaluated different smoothing methods: no smoothing, linear smoothing and smoothing achieved using LDMs.