Edinburgh Research Archive

Statistical parametric speech synthesis using conversational data and phenomena

dc.contributor.advisor
King, Simon
en
dc.contributor.advisor
Wester, Mirjam
en
dc.contributor.advisor
Yamagishi, Junichi
en
dc.contributor.author
Dall, Rasmus
en
dc.contributor.sponsor
other
en
dc.date.accessioned
2018-03-28T09:56:02Z
dc.date.available
2018-03-28T09:56:02Z
dc.date.issued
2017-07-07
dc.description.abstract
Statistical parametric text-to-speech synthesis currently relies on predefined and highly controlled prompts read in a “neutral” voice. This thesis presents work on utilising recordings of free conversation for the purpose of filled pause synthesis and as an inspiration for improved general modelling of speech for text-to-speech synthesis purposes. A corpus of both standard prompts and free conversation is presented and the potential usefulness of conversational speech as the basis for text-to-speech voices is validated. Additionally, through psycholinguistic experimentation it is shown that filled pauses can have potential subconscious benefits to the listener but that current text-to-speech voices cannot replicate these effects. A method for pronunciation variant forced alignment is presented in order to obtain a more accurate automatic speech segmentation something which is particularly bad for spontaneously produced speech. This pronunciation variant alignment is utilised not only to create a more accurate underlying acoustic model, but also as the driving force behind creating more natural pronunciation prediction at synthesis time. While this improves both the standard and spontaneous voices the naturalness of spontaneous speech based voices still lags behind the quality of voices based on standard read prompts. Thus, the synthesis of filled pauses is investigated in relation to specific phonetic modelling of filled pauses and through techniques for the mixing of standard prompts with spontaneous utterances in order to retain the higher quality of standard speech based voices while still utilising the spontaneous speech for filled pause modelling. A method for predicting where to insert filled pauses in the speech stream is also developed and presented, relying on an analysis of human filled pause usage and a mix of language modelling methods. The method achieves an insertion accuracy in close agreement with human usage. The various approaches are evaluated and their improvements documented throughout the thesis, however, at the end the resulting filled pause quality is assessed through a repetition of the psycholinguistic experiments and an evaluation of the compilation of all developed methods.
en
dc.identifier.uri
http://hdl.handle.net/1842/29016
dc.language.iso
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Dall, R., Yamagishi, J., and King, S. (2014). Rating Naturalness in Speech Synthesis: The Effect of Style and Expectation. In Proc. Speech Prosody, Dublin, Ireland.
en
dc.relation.hasversion
Dall, R., Wester, M., and Corley, M. (2014). The Effect of Filled Pauses and Speaking Rate on Speech Comprehension in Natural, Vocoded and Synthetic Speech. In Proc. Interspeech, Singapore.
en
dc.relation.hasversion
Dall, R., Tomalin, M., Wester, M., Byrne, W., and King, S. (2014). Investigating Automatic & Human Filled Pause Insertion for Speech Synthesis. In Proc. Interspeech, Singapore.
en
dc.relation.hasversion
Dall, R.,Wester, M. and Corley, M. (2015). Disfluencies in Change Detection in Natural, Vocoded and Synthetic Speech. In Proc. Disfluencies in Spontaneous Speech, Edinburgh, Scotland, UK.
en
dc.relation.hasversion
Dall, R., Brognaux, S., Richmond, K., Valentini-Botinhao, C., Henter, G. E., Hirschberg, J., Yamagishi, J., and King, S. (2016). Testing the Consistency Assumption: Pronunciation Variant Forced Alignment in Read and Spontaneous Speech Synthesis. In Proc. ICASSP, Shanghai, China.
en
dc.relation.hasversion
Dall, R., and Gonzalvo, X. (2016). JNDSLAM: A SLAM extension for Speech Synthesis. In Proc. Speech Prosody, Boston, USA.
en
dc.relation.hasversion
Dall, R., Hashimoto, K., Oura, K., Nankaku, Y. and Tokuda, K. (2016). Redefining the Linguistic Context Feature Set for HMM and DNN TTS Through Position and Parsing. In Proc. Interspeech, San Francisco, USA.
en
dc.relation.hasversion
Dall, R., Tomalin, M. and Wester, M. (2016). Synthesising Filled Pauses: Representation and Datamixing. In Proc. SSW 9, Sunnyvale, USA.
en
dc.relation.hasversion
d’Alessandro, N., Tilmanne, J., Astrinaki, M., Hueber, T., Dall, R., Ravet, T., Moinet, A., Cakmak, H., Babacan, H., Barbulescu, A., Parfait, V., Huguenin, V., Kalayc, S., and Hu, Q. (2013). Reactive Statistical Mapping: Towards the Sketching of Performative Control with Data. In Innovative and Creative Developments in Multimodal Interaction Systems, Rybarczyk, Y., Cardoso, T., Rosas, J., and Camarinha-Matos, L. M. (eds.), Springer, New York.
en
dc.relation.hasversion
Aylett, M., Dall, R., Ghoshal, A., Henter, G. E., and Merritt, T. (2014). A Flexible Front-End for HTS. In Proc. Interspeech, Singapore.
en
dc.relation.hasversion
Tomalin, M., Wester, M., Dall, R., Byrne, B., and King, S. (2015). A Lattice- Based Approach to Automatic Filled Pause Insertion. In Proc. Disfluencies in Spontaneous Speech, Edinburgh, Scotland, UK.
en
dc.relation.hasversion
Wester, M., Corley, M., and Dall, R. (2015). The Temporal Delay Hypothesis: Natural, Vocoded and Synthetic Speech. In Proc. Disfluencies in Spontaneous Speech, Edinburgh, Scotland, UK.
en
dc.relation.hasversion
Wester, M., Aylett, M., Tomalin, M., and Dall, R. (2015). Artificial Personality and Disfluency. In Proc. Interspeech, Dresden, Germany.
en
dc.subject
text-to-speech synthesis
en
dc.subject
filled pause synthesis
en
dc.subject
psycholinguistic
en
dc.subject
neutral voice
en
dc.subject
pronunciation variant alignment
en
dc.subject
phonetic modelling
en
dc.title
Statistical parametric speech synthesis using conversational data and phenomena
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en

Files

Original bundle

Now showing 1 - 1 of 1
Name:
Dall2017.pdf
Size:
2.54 MB
Format:
Adobe Portable Document Format

This item appears in the following Collection(s)