Show simple item record

dc.contributor.advisorRenals, Steve
dc.contributor.advisorRichmond, Korin
dc.contributor.advisorYamagishi, Junichi
dc.contributor.authorCabral, Joao P
dc.date.accessioned2011-05-24T13:07:20Z
dc.date.available2011-05-24T13:07:20Z
dc.date.issued2011
dc.identifier.urihttp://hdl.handle.net/1842/4877
dc.description.abstractParametric speech synthesis has received increased attention in recent years following the development of statistical HMM-based speech synthesis. However, the speech produced using this method still does not sound as natural as human speech and there is limited parametric flexibility to replicate voice quality aspects, such as breathiness. The hypothesis of this thesis is that speech naturalness and voice quality can be more accurately replicated by a HMM-based speech synthesiser using an acoustic glottal source model, the Liljencrants-Fant (LF) model, to represent the source component of speech instead of the traditional impulse train. Two different analysis-synthesis methods were developed during this thesis, in order to integrate the LF-model into a baseline HMM-based speech synthesiser, which is based on the popular HTS system and uses the STRAIGHT vocoder. The first method, which is called Glottal Post-Filtering (GPF), consists of passing a chosen LF-model signal through a glottal post-filter to obtain the source signal and then generating speech, by passing this source signal through the spectral envelope filter. The system which uses the GPF method (HTS-GPF system) is similar to the baseline system, but it uses a different source signal instead of the impulse train used by STRAIGHT. The second method, called Glottal Spectral Separation (GSS), generates speech by passing the LF-model signal through the vocal tract filter. The major advantage of the synthesiser which incorporates the GSS method, named HTS-LF, is that the acoustic properties of the LF-model parameters are automatically learnt by the HMMs. In this thesis, an initial perceptual experiment was conducted to compare the LFmodel to the impulse train. The results showed that the LF-model was significantly better, both in terms of speech naturalness and replication of two basic voice qualities (breathy and tense). In a second perceptual evaluation, the HTS-LF system was better than the baseline system, although the difference between the two had been expected to be more significant. A third experiment was conducted to evaluate the HTS-GPF system and an improved HTS-LF system, in terms of speech naturalness, voice similarity and intelligibility. The results showed that the HTS-GPF system performed similarly to the baseline. However, the HTS-LF system was significantly outperformed by the baseline. Finally, acoustic measurements were performed on the synthetic speech to investigate the speech distortion in the HTS-LF system. The results indicated that a problem in replicating the rapid variations of the vocal tract filter parameters at transitions between voiced and unvoiced sounds is the most significant cause of speech distortion. This problem encourages future work to further improve the system.en
dc.contributor.sponsorMarie Curie Early Stage Training Site EdSST (MEST-CT-2005-020568)en
dc.language.isoenen
dc.publisherThe University of Edinburghen
dc.relation.hasversionCabral, J. P. and Oliveira, L. C. (2005). Pitch-synchronous time-scaling for prosodic and voice quality transformations. In Proc. of INTERSPEECH, pages 1137–1140, Lisbon, Portugal.en
dc.subjectHMM-based speech synthesisen
dc.subjectglottal source modellingen
dc.subjectLF-modelen
dc.titleHMM-based speech synthesis using an acoustic glottal source modelen
dc.typeThesis or Dissertationen
dc.type.qualificationlevelDoctoralen
dc.type.qualificationnamePhD Doctor of Philosophyen


Files in this item

This item appears in the following Collection(s)

Show simple item record