Synthesising conversational speech using found data

O’Mahony, Johannah

Synthesising conversational speech using found data

Simple item page

dc.contributor.advisor

King, Simon

dc.contributor.advisor

Lai, Catherine

dc.contributor.author

O’Mahony, Johannah

dc.date.accessioned

2025-07-16T15:01:52Z

dc.date.available

2025-07-16T15:01:52Z

dc.date.issued

2025-07-16

dc.description.abstract

End-to-end speech synthesis models perform well when trained with clean read speech data. Modelling conversational speech, the form of speech that we use every day, however, is more challenging. First, we lack high-quality conversational datasets that are suitable for training speech synthesis models. Second, conversational data is highly variable, containing challenging spontaneous phenomena, such as overlapping speech and laughter. Third, each conversational utterance is embedded in a communicative context, and there are many contextual factors which must be accounted for. Finally, there exists a significant knowledge gap with respect to our understanding of both speech perception and speech production in context, both in the fields of speech technology and speech science. In this work, we addressed these issues in three parts. In Part 1, we examined three factors that potentially affect the evaluation of speech synthesis output in context, namely the task instructions, between-sentence textual dependency and the prosodic realisation of the utterances. We found that task instructions can affect ratings, and we found that presenting speech in context narrows the gap of Mean Opinion Scores between the contextually appropriate utterance and the non-appropriate utterance. This suggests that MOS might not be sufficiently sensitive to evaluate speech synthesis in context. We conclude that more targeted evaluation is necessary to capture contextual effects. In Part 2, we present two studies on improving conversational prosody using found data and controllable synthesis. In the first study, we find that training a model on a data mixture of found conversational speech (questions and answers) and read speech can improve the realisation of questions as measured by an increase in preferences for our datamix model over the baseline, which was only trained on read speech. For answers, no significant difference between the systems was found. In the second study, we used a linguistically-motivated word-level F0 representations based on Legendre Polynomial coefficients to condition a FastPitch model, allowing us to control the intonation of an utterance. We found that conditioning a model on these representations increases to similarity of the F0 contours between the system output and the target output over the baseline and a categorically-conditioned model. The proposed representations can then be used to explore patterns in conversational speech. In Part 3, we present two case studies investigating the impact of context on an utterance’s prosodic realisation. In the first study, we used found data and our intonation representations from Part 2 to explore prosodic variation on the discourse marker “well”. Using clusters from the data exploration, we synthesised 20 different renditions of a positive polarity utterance, well yes, and a negative polarity utterance, well no, and performed a listening test to assess the degree of agreement perceived by listeners. We found that the prosodic rendition of the utterance can affect the perceived agreement or ix disagreement of the speaker highlighting an example of the prosody-pragmatics interface. In the second study, we used found data to explore turn-taking cues in conversation. We found that conditioning a FastPitch speech synthesis model on turn-taking information leads to perceptible differences in the turn-finality of an utterance as measured in subjective listening tests. We showed that we can use speech synthesis to generate stimuli which reflect the global trends in the training data and that this method can complement corpus research in phonetics.

en

dc.identifier.uri

https://hdl.handle.net/1842/43687

dc.identifier.uri

http://dx.doi.org/10.7488/era/6219

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.relation.hasversion

O’Mahony, J., Oplustil Gallegos, P., Lai, C., & King, S. (2021). Factors Affecting the Evaluation of Synthetic Speech in Context. In Proceedings of the 11th ISCA Speech Synthesis Workshop (SSW 11) 148-153. doi:10.21437/SSW.2021-26

en

dc.relation.hasversion

O’Mahony, J., Lai, C., & King, S. (2022). Combining conversational speech with read speech to improve prosody in Text-to-Speech synthesis. InProceedings of Interspeech 2022 3388-3392 doi:10.21437/Interspeech.2022-10167

en

dc.relation.hasversion

O’Mahony, J., Corkey, N. Lai, C., Klabbers, E., King, S. (2024) Hierarchical Intonation Modelling for Speech Synthesis using Legendre Polynomial Coefficients. In Proceedings of Speech Prosody 2024, 1030-1034, doi:10.21437/SpeechProsody.2024-208

en

dc.relation.hasversion

O’Mahony, J., Lai, C., & Szekely, ´ E. (2024). “Well”, what can you do with messy data? ´ Exploring the prosody and pragmatic function of the discourse marker “well” with found data and speech synthesis, Proceedings of Interspeech 2024 (pp 4084-4088) doi:10.21437/Interspeech.2024-2122

en

dc.relation.hasversion

O’Mahony, J., Lai, C., King, S. (2023) Synthesising turn-taking cues using natural conversational data. Proceedings of the 12th ISCA Speech Synthesis Workshop (SSW12), 75-80, doi:10.21437/SSW.2023-12

en

dc.relation.hasversion

Oplustil Gallegos, P., O’Mahony, J., & King, S. (2021). Comparing acoustic and textual representations of previous linguistic context for improving Text-to-Speech. In Proc. 11th ISCA Speech Synthesis Workshop (SSW 11) (pp. 205-210). doi:10.21437/SSW.2021-36

en

dc.relation.hasversion

Stan, A., & O’Mahony, J. (2023) An analysis on the effects of speaker embedding choice in non auto-regressive TTS. In Proc. 12th ISCA Speech Synthesis Workshop (SSW2023), 134-138, doi:10.21437/SSW.2023-21

en

dc.relation.hasversion

Kakouros, S., & O’Mahony, J. (2023). What does BERT learn about prosody? In R. Skarnitzl, & J. Volın (Eds.), ´ Proceedings of the 20th International Congress of Phonetic Sciences (pp. 1454-1458). https://guarant.cz/icphs2023/622.pdf

en

dc.relation.hasversion

Kruyt, J., Huttner, L., & O’Mahony, J. (2023). Investigating the relationship between prosodic entrainment and interaction style. In R. Skarnitzl, & J. Volın (Eds.), ´ Proceedings of the 20th International Congress of Phonetic Sciences (pp. 3492-3496). https://guarant.cz/icphs2023/514.pdf

en

dc.relation.hasversion

Elmers, M., O’Mahony, J., & Szekely, ´ E. (2023). Synthesis after a couple PINTs: ´ Investigating the role of pause-internal phonetic particles in speech synthesis and perception. Proceedings of Interspeech 2023 4843-4847. doi:10.21437/Interspeech.2023-2178

en

dc.relation.hasversion

Corkey, N., O’Mahony, J., and King, S. (2023). Intonation Control for Neural Text-to- Speech Synthesis with PolynomialModels of F0. In Proc. Interspeech 2023, pages 2014– 2015. ISCA

en

dc.subject

synthetic speech

en

dc.subject

synthesising conversational speech

en

dc.subject

conversational data

en

dc.subject

dataset

en

dc.subject

quality of synthetic speech

en

dc.subject

conversational prosody

en

dc.subject

turn-taking in conversation

en

dc.title

Synthesising conversational speech using found data

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Doctoral

en

dc.type.qualificationname

PhD Doctor of Philosophy

en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: O'Mahony2025.pdf
Size:: 2.27 MB
Format:: Adobe Portable Document Format
Description:

Download

This item appears in the following Collection(s)

Informatics thesis and dissertation collection