Roles of the Average Voice in Speaker-adaptive HMM-based Speech Synthesis
View/ Open
Date
2010Author
Yamagishi, Junichi
Watts, Oliver
King, Simon
Usabaev, Bela
Metadata
Abstract
In speaker-adaptive HMM-based speech synthesis, there are a few speakers whose synthetic speech sounds worse than that
of other speakers, despite having the same amount of adaptation data from within the same corpus. This paper investigates
these fluctuations in quality and found that as mel-cepstral distance from the average voice becomes larger, the MOS scores
generally become worse. Although the negative correlation obtained is not strong enough, this helps us improve the training and adaptation strategies for average voice models. Furthermore we remark that this correlation is strongly linked to “vocal attractiveness.”