Edinburgh Research Archive

HMM-based speech synthesis using an acoustic glottal source model

dc.contributor.advisor
Renals, Steve
en
dc.contributor.advisor
Richmond, Korin
en
dc.contributor.advisor
Yamagishi, Junichi
en
dc.contributor.author
Cabral, Joao P
en
dc.contributor.sponsor
Marie Curie Early Stage Training Site EdSST (MEST-CT-2005-020568)
en
dc.date.accessioned
2011-05-24T13:07:20Z
dc.date.available
2011-05-24T13:07:20Z
dc.date.issued
2011
dc.description.abstract
Parametric speech synthesis has received increased attention in recent years following the development of statistical HMM-based speech synthesis. However, the speech produced using this method still does not sound as natural as human speech and there is limited parametric flexibility to replicate voice quality aspects, such as breathiness. The hypothesis of this thesis is that speech naturalness and voice quality can be more accurately replicated by a HMM-based speech synthesiser using an acoustic glottal source model, the Liljencrants-Fant (LF) model, to represent the source component of speech instead of the traditional impulse train. Two different analysis-synthesis methods were developed during this thesis, in order to integrate the LF-model into a baseline HMM-based speech synthesiser, which is based on the popular HTS system and uses the STRAIGHT vocoder. The first method, which is called Glottal Post-Filtering (GPF), consists of passing a chosen LF-model signal through a glottal post-filter to obtain the source signal and then generating speech, by passing this source signal through the spectral envelope filter. The system which uses the GPF method (HTS-GPF system) is similar to the baseline system, but it uses a different source signal instead of the impulse train used by STRAIGHT. The second method, called Glottal Spectral Separation (GSS), generates speech by passing the LF-model signal through the vocal tract filter. The major advantage of the synthesiser which incorporates the GSS method, named HTS-LF, is that the acoustic properties of the LF-model parameters are automatically learnt by the HMMs. In this thesis, an initial perceptual experiment was conducted to compare the LFmodel to the impulse train. The results showed that the LF-model was significantly better, both in terms of speech naturalness and replication of two basic voice qualities (breathy and tense). In a second perceptual evaluation, the HTS-LF system was better than the baseline system, although the difference between the two had been expected to be more significant. A third experiment was conducted to evaluate the HTS-GPF system and an improved HTS-LF system, in terms of speech naturalness, voice similarity and intelligibility. The results showed that the HTS-GPF system performed similarly to the baseline. However, the HTS-LF system was significantly outperformed by the baseline. Finally, acoustic measurements were performed on the synthetic speech to investigate the speech distortion in the HTS-LF system. The results indicated that a problem in replicating the rapid variations of the vocal tract filter parameters at transitions between voiced and unvoiced sounds is the most significant cause of speech distortion. This problem encourages future work to further improve the system.
en
dc.identifier.uri
http://hdl.handle.net/1842/4877
dc.language.iso
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Cabral, J. P. and Oliveira, L. C. (2005). Pitch-synchronous time-scaling for prosodic and voice quality transformations. In Proc. of INTERSPEECH, pages 1137–1140, Lisbon, Portugal.
en
dc.subject
HMM-based speech synthesis
en
dc.subject
glottal source modelling
en
dc.subject
LF-model
en
dc.title
HMM-based speech synthesis using an acoustic glottal source model
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en

Files

Original bundle

Now showing 1 - 1 of 1
Name:
Cabral2011.pdf
Size:
3.12 MB
Format:
Adobe Portable Document Format

This item appears in the following Collection(s)