Optimising Join Cost Weights For Unit Selection Speech Synthesis
Item statusRestricted Access
Unit selection synthesis predominates today, but is not yet of a quality to rival natural speech. Unit selection can be inconsistent in quality and one of the causes are the joins. Earlier research suggested joins are perceived differently according to category. We investigated whether synthesis was perceived as more natural if join costs were calculated with reference to phonetic category. The join cost in the Festival multisyn synthesis system was extended beyond purely acoustic measures to categorise joins phonetically. 2 methods were used to optimise the join subcosts for each category: hand tuned heuristic, and an automated data-centric approach. For this task the data-centric approach ultimately proved more suitable. Default synthesis was compared to the ‘optimised’ synthesis in a perceptual experiment. Results were mixed; some syntheses were perceived as better, some worse and participants expressed no preference for others. There was no significant overall preference for the optimised synthesis. The results indicated our optimised join cost was not yet a good model. No attempt to optimise the Festival multisyn join cost had been made prior to this investigation. This suggests further studies, in which varying the model, and/or use of more sophisticated optimisation methods, may yet produce synthesis that is perceived as more natural for any input text.
The following license files are associated with this item: