Speech emotion recognition with ASR integration
dc.contributor.advisor
Lai, Catherine
dc.contributor.advisor
Bell, Peter
dc.contributor.author
Li, Yuanchao
dc.contributor.sponsor
Microsoft Research Audio and Acoustics Research Group
en
dc.date.accessioned
2025-11-11T10:54:32Z
dc.date.available
2025-11-11T10:54:32Z
dc.date.issued
2025-11-11
dc.description.abstract
Speech Emotion Recognition (SER) plays a pivotal role in understanding human communication,
enabling emotionally intelligent systems, and serving as a fundamental component in
the development of Artificial General Intelligence (AGI). However, deploying SER in real-world,
spontaneous, and low-resource scenarios remains a significant challenge due to the complexity
of emotional expression and the limitations of current speech and language technologies.
This thesis investigates the integration of Automatic Speech Recognition (ASR) into SER,
with the goal of enhancing the robustness, scalability, and practical applicability of emotion
recognition from spoken language.
As a starting point, we explore the interplay between ASR and SER by conducting an in-depth
analysis of speech foundation models on emotionally expressive speech, from both acoustic
and linguistic perspectives. Our findings uncover inherent limitations of these models: while
they achieve strong performance in ASR, they often fail to capture paralinguistic cues essential
for emotion recognition, and exhibit emotion-dependent biases. Additionally, we examine how
linguistic properties, such as part-of-speech distributions, affective word ratings, and utterance
length, interact with ASR performance across different emotion categories. Extensive
experiments reveal that ASR errors are not random but follow systematic patterns associated
with specific word types and emotional expressions across diverse speaking conditions.
To overcome paralinguistic limitations while utilizing the hierarchical encoding capabilities of
speech foundation models, we propose a joint training framework that integrates ASR hidden
representations and lexical outputs in a hierarchical manner for SER. To mitigate transcription
quality issues, we introduce two complementary ASR error correction strategies that
require only a small amount of emotional speech data: a Large Language Model (LLM)-based
method that refines N-best hypotheses using emotion-specific prompts, and a Sequence-to-
Sequence (S2S) model that leverages discrete acoustic units to correct 1-best hypothesis.
Both approaches substantially improve the emotional accuracy of ASR transcriptions with
low-resource speech data and enhance downstream SER performance.
Beyond transcription correction, this thesis introduces robust acoustic-lexical fusion frameworks
to handle emotion mismatch across modalities and reduce ASR error propagation. We
conduct a comprehensive investigation into cross-modal incongruity and propose incongruityaware
fusion strategies that dynamically adapt to modality conflicts, extending solutions beyond
SER to include sarcasm and humor detection. Building on this, we develop an ASRii
aware, modality-gated fusion mechanism that integrates ASR error correction with dynamic
modality selection in a sequential manner. These fusion strategies achieve strong performance
on both controlled and real-world datasets, even under partially erroneous conditions
(i.e. ASR transcriptions).
Finally, to reduce reliance on costly labeled emotion data, we present a novel semi-supervised
learning framework based on multi-view pseudo-labeling. By leveraging both acoustic similarity
metrics and LLM-based confidence estimation, this approach selects high-quality unlabeled
samples to augment training. The proposed framework not only outperforms traditional
pseudo-labeling methods for SER but also demonstrates generalizability to speechbased
Alzheimer’s dementia detection.
en
dc.identifier.uri
https://hdl.handle.net/1842/44163
dc.identifier.uri
http://dx.doi.org/10.7488/era/6687
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Li, Y. (2021, August 31). Feeling estimation device, feeling estimation method, and storage medium. Google Patents. (US Patent 11,107,464)
en
dc.relation.hasversion
Li, Y. (2021). Semi-supervised learning for multimodal speech and emotion recognition. In Proceedings of the 2021 International Conference on Multimodal Interaction (pp. 817– 821).
en
dc.relation.hasversion
Li, Y. (2022, September 13). Information processing apparatus, information processing method, and storage medium. Google Patents. (US Patent 11,443,759)
en
dc.relation.hasversion
Li, Y. (2023). Enhancing Speech Emotion Recognition for Real-World Applications via ASR Integration. In 2023 11th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW) (pp. 1–5)
en
dc.relation.hasversion
Li, Y. (2023, July 25). Information-processing device, vehicle, computer-readable storage medium, and information-processing method. Google Patents. (US Patent 11,710,499)
en
dc.relation.hasversion
Li, Y., Bell, P., & Lai, C. (2022). Fusing ASR outputs in joint training for speech emotion recognition. In 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7362–7366).
en
dc.relation.hasversion
Li, Y., Bell, P., & Lai, C. (2023). Multimodal dyadic impression recognition via listener adaptive cross-domain fusion. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–5).
en
dc.relation.hasversion
Li, Y., Bell, P., & Lai, C. (2023). Transfer learning for personality perception via speech emotion recognition. In Proceedings of interspeech 2023 (pp. 5197–5201).
en
dc.relation.hasversion
Li, Y., Bell, P., & Lai, C. (2024). Speech emotion recognition with ASR transcripts: A comprehensive study on word error rate and fusion techniques. 2024 IEEE Spoken Language Technology Workshop (SLT). IEEE.
en
dc.relation.hasversion
Li, Y., Chen, P., Bell, P., & Lai, C. (2024). Crossmodal ASR error correction with discrete speech units. 2024 IEEE Spoken Language Technology Workshop (SLT). IEEE.
en
dc.relation.hasversion
Li, Y., Gong, Y., Yang, C.-H. H., Bell, P., & Lai, C. (2025). Revise, Reason, and Recognize: LLM-Based Emotion Recognition via Emotion-Specific Prompts and ASR Error Correction. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–5)
en
dc.relation.hasversion
Li, Y., Gui, A., Emmanouilidou, D., & Gamper, H. (2025). Addressing emotion bias in music emotion recognition and generation with frechet audio distance. 2025 IEEE International Conference on Multimedia and Expo (ICME)
en
dc.relation.hasversion
Li, Y., Inoue, K., Tian, L., Fu, C., Ishi, C. T., Ishiguro, H., . . . Lai, C. (2023). I know your feelings before you do: Predicting future affective reactions in human-computer dialogue. In Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems (pp. 1–7)
en
dc.relation.hasversion
Li, Y., Ishi, C. T., Inoue, K., Nakamura, S., & Kawahara, T. (2019). Expressing reactive emotion based on multimodal emotion recognition for natural conversation in human– robot interaction. Advanced Robotics, 33(20), 1030–1041.
en
dc.relation.hasversion
Li, Y., Kollias, D., Chanel, G., Fanourakis, M., Muszynski, M., M. Booth, B., . . . Chen, H. (2025). Hrai 2025: The 1st workshop on holistic and responsible affective intelligence. In Proceedings of the 27th international conference on multimodal interaction (pp. 814– 817)
en
dc.relation.hasversion
Li, Y., & Lai, C. (2022). A cross-domain approach for continuous impression recognition from dyadic audio-visual-physio signals. arXiv preprint arXiv:2203.13932.
en
dc.relation.hasversion
Li, Y., & Lai, C. (2022). Robotic speech synthesis: Perspectives on interactions, scenarios, and ethics. arXiv preprint arXiv:2203.09599
en
dc.relation.hasversion
Li, Y., & Lai, C. (2023). Empowering dialogue systems with affective and adaptive interaction: Integrating social intelligence. In 2023 11th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW) (pp. 1–8).
en
dc.relation.hasversion
Li, Y., Lai, C., Lala, D., Inoue, K., & Kawahara, T. (2022). Alzheimer’s Dementia Detection through Spontaneous Dialogue with Proactive Robotic Listeners. In 2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI) (pp. 875–879).
en
dc.relation.hasversion
Li, Y., Mohamied, Y., Bell, P., & Lai, C. (2023). Exploration of a self-supervised speech model: A study on emotional corpora. In 2022 IEEE Spoken Language Technology Workshop (SLT) (pp. 868–875).
en
dc.relation.hasversion
Li, Y., Urquhart, L., Karatas, N., Shao, S., Ishiguro, H., & Shen, X. (2024). Beyond voice assistants: Exploring advantages and risks of an in-car social robot in real driving scenarios. arXiv preprint arXiv:2402.11853.
en
dc.relation.hasversion
Li, Y., Wang, Y., & Cui, Z. (2023). Decoupled multimodal distilling for emotion recognition. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 6631–6640).
en
dc.relation.hasversion
Li, Y., Williams, J., Feng, T., Mitra, V., Gong, Y., Shi, B., . . . Bell, P. (2023). Responsible speech foundation models. Interspeech.
en
dc.relation.hasversion
Li, Y., Zhang, Z., Han, J., Bell, P., & Lai, C. (2025). Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–5)
en
dc.relation.hasversion
Li, Y., Zhao, T., Kawahara, T., et al. (2019). Improved end-to-end speech emotion recognition using self attention mechanism and multitask learning. In Interspeech (pp. 2803–2807)
en
dc.relation.hasversion
Li, Y., Zhao, T., & Shen, X. (2020). Attention-Based Multimodal Fusion for Estimating Human Emotion in Real-World HRI. In Companion of the 2020 ACM/IEEE International Conference on Human-Robot Interaction (pp. 340–342).
en
dc.relation.hasversion
Li, Y., Zhao, Z., Klejch, O., Bell, P., & Lai, C. (2023). ASR and emotional speech: A word-level investigation of the mutual impact of speech and emotion recognition. In Proceedings of interspeech 2023.
en
dc.relation.hasversion
Han, Z., Geng, T., Feng, H., Yuan, J., Richmond, K., & Li, Y. (2025). Cross-lingual speech emotion recognition: Humans vs. self-supervised models. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–5).
en
dc.relation.hasversion
Chen, M., Zhang, H., Li, Y., Luo, J., Wu, W., Ma, Z., . . . others (2024). 1st place solution to odyssey emotion recognition challenge task1: Tackling class imbalance problem. In Proceedings of odyssey 2024 (pp. 260–265).
en
dc.relation.hasversion
He, L., Li, Y., Feng, R., Han, X., Liu, Y.-L., Yang, Y., . . . Yuan, J. (2025). Exploring gender bias in alzheimer’s disease detection: Insights from mandarin and greek speech perception. Proceedings of Interspeech 2025
en
dc.relation.hasversion
Saliba, A., Li, Y., Sanabria, R., & Lai, C. (2024). Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study on Speech Emotion Recognition. In 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW).
en
dc.relation.hasversion
Sanders, N., Li, Y., Richmond, K., & King, S. (2025). Segmentation-variant codebooks for preservation of paralinguistic and prosodic information. Proceedings of Interspeech 2025
en
dc.relation.hasversion
Sun, Y., Zhao, Z., Richmond, K., & Li, Y. (2025). Exploring Acoustic Similarity in Emotional Speech and Music via Self-Supervised Representations. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–5)
en
dc.relation.hasversion
Shen, X., Luo, Z., Li, Y., Ouyang, T., & Wu, Y. (2024). Chance-constrained abnormal data cleaning for robust classification with noisy labels. IEEE Transactions on Emerging Topics in Computational Intelligence
en
dc.relation.hasversion
Yang, C.-H. H., Park, T., Gong, Y., Li, Y., Chen, Z., Lin, Y.-T., . . . others (2024). Large language model based generative error correction: A challenge and baselines for speech recognition, speaker tagging, and emotion recognition. In 2024 IEEE Spoken Language Technology Workshop (SLT) (pp. 371–378).
en
dc.relation.hasversion
Liu, Y.-L., Li, Y., Feng, R., He, L., Chen, J.-X., Wang, Y.-M., . . . Ling, Z.-H. (2025). Leveraging cascaded binary classification and multimodal fusion for dementia detection through spontaneous speech. arXiv preprint arXiv:2505.19446.
en
dc.relation.hasversion
Rethinking Emotion Bias in Music via Frechet Audio Distance (Y. Li, Gui, et al., 2025) arXiv:2409.15545v1 [eess.AS] 23 Sep 2024
en
dc.subject
Speech Emotion Recognition
en
dc.subject
Automatic Speech Recognition
en
dc.subject
emotional tone
en
dc.subject
acoustic features of speech
en
dc.subject
linguistic features of speech
en
dc.subject
speech-to-text conversion
en
dc.subject
confidence scoring
en
dc.subject
self-training
en
dc.subject
emotion recognition
en
dc.subject
Alzheimer’s disease detection
en
dc.subject
human-computer interactions
en
dc.title
Speech emotion recognition with ASR integration
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en
Files
Original bundle
1 - 1 of 1
- Name:
- Li2025.pdf
- Size:
- 6.71 MB
- Format:
- Adobe Portable Document Format
- Description:
This item appears in the following Collection(s)

