Synthesising conversational speech using found data
dc.contributor.advisor
King, Simon
dc.contributor.advisor
Lai, Catherine
dc.contributor.author
O’Mahony, Johannah
dc.date.accessioned
2025-07-16T15:01:52Z
dc.date.available
2025-07-16T15:01:52Z
dc.date.issued
2025-07-16
dc.description.abstract
End-to-end speech synthesis models perform well when trained with clean read speech
data. Modelling conversational speech, the form of speech that we use every day, however,
is more challenging. First, we lack high-quality conversational datasets that are suitable for
training speech synthesis models. Second, conversational data is highly variable, containing
challenging spontaneous phenomena, such as overlapping speech and laughter. Third,
each conversational utterance is embedded in a communicative context, and there are many
contextual factors which must be accounted for. Finally, there exists a significant
knowledge gap with respect to our understanding of both speech perception and speech
production in context, both in the fields of speech technology and speech science.
In this work, we addressed these issues in three parts. In Part 1, we examined three
factors that potentially affect the evaluation of speech synthesis output in context, namely
the task instructions, between-sentence textual dependency and the prosodic realisation of
the utterances. We found that task instructions can affect ratings, and we found that
presenting speech in context narrows the gap of Mean Opinion Scores between the
contextually appropriate utterance and the non-appropriate utterance. This suggests that
MOS might not be sufficiently sensitive to evaluate speech synthesis in context. We
conclude that more targeted evaluation is necessary to capture contextual effects.
In Part 2, we present two studies on improving conversational prosody using found
data and controllable synthesis. In the first study, we find that training a model on a data
mixture of found conversational speech (questions and answers) and read speech can
improve the realisation of questions as measured by an increase in preferences for our
datamix model over the baseline, which was only trained on read speech. For answers, no
significant difference between the systems was found. In the second study, we used a
linguistically-motivated word-level F0 representations based on Legendre Polynomial
coefficients to condition a FastPitch model, allowing us to control the intonation of an
utterance. We found that conditioning a model on these representations increases to
similarity of the F0 contours between the system output and the target output over the
baseline and a categorically-conditioned model. The proposed representations can then be
used to explore patterns in conversational speech.
In Part 3, we present two case studies investigating the impact of context on an
utterance’s prosodic realisation. In the first study, we used found data and our intonation
representations from Part 2 to explore prosodic variation on the discourse marker “well”.
Using clusters from the data exploration, we synthesised 20 different renditions of a
positive polarity utterance, well yes, and a negative polarity utterance, well no, and
performed a listening test to assess the degree of agreement perceived by listeners. We found
that the prosodic rendition of the utterance can affect the perceived agreement or
ix
disagreement of the speaker highlighting an example of the prosody-pragmatics interface.
In the second study, we used found data to explore turn-taking cues in conversation. We
found that conditioning a FastPitch speech synthesis model on turn-taking information
leads to perceptible differences in the turn-finality of an utterance as measured in subjective
listening tests. We showed that we can use speech synthesis to generate stimuli which reflect
the global trends in the training data and that this method can complement corpus research
in phonetics.
en
dc.identifier.uri
https://hdl.handle.net/1842/43687
dc.identifier.uri
http://dx.doi.org/10.7488/era/6219
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
O’Mahony, J., Oplustil Gallegos, P., Lai, C., & King, S. (2021). Factors Affecting the Evaluation of Synthetic Speech in Context. In Proceedings of the 11th ISCA Speech Synthesis Workshop (SSW 11) 148-153. doi:10.21437/SSW.2021-26
en
dc.relation.hasversion
O’Mahony, J., Lai, C., & King, S. (2022). Combining conversational speech with read speech to improve prosody in Text-to-Speech synthesis. InProceedings of Interspeech 2022 3388-3392 doi:10.21437/Interspeech.2022-10167
en
dc.relation.hasversion
O’Mahony, J., Corkey, N. Lai, C., Klabbers, E., King, S. (2024) Hierarchical Intonation Modelling for Speech Synthesis using Legendre Polynomial Coefficients. In Proceedings of Speech Prosody 2024, 1030-1034, doi:10.21437/SpeechProsody.2024-208
en
dc.relation.hasversion
O’Mahony, J., Lai, C., & Szekely, ´ E. (2024). “Well”, what can you do with messy data? ´ Exploring the prosody and pragmatic function of the discourse marker “well” with found data and speech synthesis, Proceedings of Interspeech 2024 (pp 4084-4088) doi:10.21437/Interspeech.2024-2122
en
dc.relation.hasversion
O’Mahony, J., Lai, C., King, S. (2023) Synthesising turn-taking cues using natural conversational data. Proceedings of the 12th ISCA Speech Synthesis Workshop (SSW12), 75-80, doi:10.21437/SSW.2023-12
en
dc.relation.hasversion
Oplustil Gallegos, P., O’Mahony, J., & King, S. (2021). Comparing acoustic and textual representations of previous linguistic context for improving Text-to-Speech. In Proc. 11th ISCA Speech Synthesis Workshop (SSW 11) (pp. 205-210). doi:10.21437/SSW.2021-36
en
dc.relation.hasversion
Stan, A., & O’Mahony, J. (2023) An analysis on the effects of speaker embedding choice in non auto-regressive TTS. In Proc. 12th ISCA Speech Synthesis Workshop (SSW2023), 134-138, doi:10.21437/SSW.2023-21
en
dc.relation.hasversion
Kakouros, S., & O’Mahony, J. (2023). What does BERT learn about prosody? In R. Skarnitzl, & J. Volın (Eds.), ´ Proceedings of the 20th International Congress of Phonetic Sciences (pp. 1454-1458). https://guarant.cz/icphs2023/622.pdf
en
dc.relation.hasversion
Kruyt, J., Huttner, L., & O’Mahony, J. (2023). Investigating the relationship between prosodic entrainment and interaction style. In R. Skarnitzl, & J. Volın (Eds.), ´ Proceedings of the 20th International Congress of Phonetic Sciences (pp. 3492-3496). https://guarant.cz/icphs2023/514.pdf
en
dc.relation.hasversion
Elmers, M., O’Mahony, J., & Szekely, ´ E. (2023). Synthesis after a couple PINTs: ´ Investigating the role of pause-internal phonetic particles in speech synthesis and perception. Proceedings of Interspeech 2023 4843-4847. doi:10.21437/Interspeech.2023-2178
en
dc.relation.hasversion
Corkey, N., O’Mahony, J., and King, S. (2023). Intonation Control for Neural Text-to- Speech Synthesis with PolynomialModels of F0. In Proc. Interspeech 2023, pages 2014– 2015. ISCA
en
dc.subject
synthetic speech
en
dc.subject
synthesising conversational speech
en
dc.subject
conversational data
en
dc.subject
dataset
en
dc.subject
quality of synthetic speech
en
dc.subject
conversational prosody
en
dc.subject
turn-taking in conversation
en
dc.title
Synthesising conversational speech using found data
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en
Files
Original bundle
1 - 1 of 1
- Name:
- O'Mahony2025.pdf
- Size:
- 2.27 MB
- Format:
- Adobe Portable Document Format
- Description:
This item appears in the following Collection(s)

