The CSTR/EMIME HTS system for Blizzard Challenge 2010
View/ Open
Date
2010Author
Yamagishi, Junichi
Watts, Oliver
Metadata
Abstract
In the 2010 Blizzard Challenge, we focused on improving steps relating to feature extraction and labeling in the procedures
for training HMM-based speech synthesis systems. New auditory scales were used for spectral features and F0 representation.
We have also adopted finer frequency bands motivated by an auditory-scale for aperiodicity measures, which determine
the level of noise in each band for mixed excitation. Further for tighter coupling of the HMM training and automatic labeling
processes, we have studied methods for stepwise bootstrap training. The listeners’ evaluation scores were much better than
those of HTS-benchmark systems. More importantly, we can see some improvements even in speaker similarity, which was
known to be the acknowledged weakness of this method. In fact, speaker similarity is not a weak point of this method on the
tasks using smaller databases. In terms of naturalness, the new systems outperformed or competed with unit selection systems
regardless of the size of speech databases used and moreover competed with hybrid systems on smaller databases.