Edinburgh Research Archive

Join Cost for Unit Selection Speech Synthesis

dc.contributor.advisor
King, Simon
en
dc.contributor.advisor
Taylor, Paul
en
dc.contributor.author
Vepa, Jithendra
en
dc.date.accessioned
2006-10-18T10:32:51Z
dc.date.available
2006-10-18T10:32:51Z
dc.date.issued
2004-07
dc.description.abstract
Undoubtedly, state-of-the-art unit selection-based concatenative speech systems produce very high quality synthetic speech. this is due to a large speech database containing many instances of each speech unit, with a varied and natural distribution of prosodic and spectral characteristics. the join cost, which measures how well two units can be joined together is one of the main criteria for selecting appropriate units from this large speech database. The ideal join cost is one that measures percieved discontinuity based on easily measurable spectral properties of the units being joined, inorder to ensure smooth and natural sounding synthetic speech. During first part of my research, I have investigated various spectrally based distance measures for use in computation of the join cost by designing a perceptual listening experiment. A variation to the usual perceptual test paradigm is proposed in this thesis by deliberately including a wide range of qualities of join in polysyllabic words. The test stimuli are obtained using a state-of-the-art unit-selection text-to-speech system: rVoice from Rhetorical Systems Ltd. Three spectral features Mel-frequency cepstral coefficients (MFCC), line spectral frequencies (LSF) and multiple centroid analysis (MCA) parameters and various statistical distances - Euclidean, Kullback-Leibler, Mahalanobis - are used to obtain distance measures. Based on the correlations between perceptual scores and these spectral distances. I proposed new spectral distance measures, which have good correlation with human perception to concatenation discontinuities. The second part of my research concentrates on combining join cost computation and the smoothing operation, which is required to disguise joins, by learning an underlying representation from the acoustic signal. In order to accomplish this task, I have chosen linear dynamic models (LDM), sometimes known as Kalman filters. Three different initialisation schemes are used prior to Expectation-Maximisation (KM) in LDM training. Once the models are trained, the join cost is computed based on the error between model predictions and actual observations. Analytical measures are derived based on the shape of this error plot. These measures and initialisation schemes are compared by computing correlations using the perceptual data.. The LDMs are also able to smooth the observations which are then used to synthesise speech. To evaluate the LDM smoothing operation, another listening test is performed where it is compared with the standard methods (simple linear interpolation). I have compared the best three join cost functions, chosen from the first and second parts of my research, subjectively using a listening test in the third part of my research. in this test, I also evaluated different smoothing methods: no smoothing, linear smoothing and smoothing achieved using LDMs.
en
dc.format.extent
1842877 bytes
en
dc.format.mimetype
application/pdf
en
dc.identifier.uri
http://hdl.handle.net/1842/1452
dc.language.iso
en
dc.publisher
The University of Edinburgh. College of Science and Engineering. School of Informatics
en
dc.subject.other
unit selection
en
dc.subject.other
join cost
en
dc.subject.other
speech synthesis
en
dc.subject.other
polysyllabic words
en
dc.subject.other
line specral frequencies
en
dc.subject.other
multiple centroid analysis
en
dc.subject.other
kalman filters
en
dc.title
Join Cost for Unit Selection Speech Synthesis
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en

Files

Original bundle

Now showing 1 - 1 of 1
Name:
JithendraVepa_PhDthesis.pdf
Size:
1.76 MB
Format:
Adobe Portable Document Format

This item appears in the following Collection(s)