Speech emotion recognition with ASR integration

Li, Yuanchao

Speech emotion recognition with ASR integration

Simple item page

dc.contributor.advisor

Lai, Catherine

dc.contributor.advisor

Bell, Peter

dc.contributor.author

Li, Yuanchao

dc.contributor.sponsor

Microsoft Research Audio and Acoustics Research Group

en

dc.date.accessioned

2025-11-11T10:54:32Z

dc.date.available

2025-11-11T10:54:32Z

dc.date.issued

2025-11-11

dc.description.abstract

Speech Emotion Recognition (SER) plays a pivotal role in understanding human communication, enabling emotionally intelligent systems, and serving as a fundamental component in the development of Artificial General Intelligence (AGI). However, deploying SER in real-world, spontaneous, and low-resource scenarios remains a significant challenge due to the complexity of emotional expression and the limitations of current speech and language technologies. This thesis investigates the integration of Automatic Speech Recognition (ASR) into SER, with the goal of enhancing the robustness, scalability, and practical applicability of emotion recognition from spoken language. As a starting point, we explore the interplay between ASR and SER by conducting an in-depth analysis of speech foundation models on emotionally expressive speech, from both acoustic and linguistic perspectives. Our findings uncover inherent limitations of these models: while they achieve strong performance in ASR, they often fail to capture paralinguistic cues essential for emotion recognition, and exhibit emotion-dependent biases. Additionally, we examine how linguistic properties, such as part-of-speech distributions, affective word ratings, and utterance length, interact with ASR performance across different emotion categories. Extensive experiments reveal that ASR errors are not random but follow systematic patterns associated with specific word types and emotional expressions across diverse speaking conditions. To overcome paralinguistic limitations while utilizing the hierarchical encoding capabilities of speech foundation models, we propose a joint training framework that integrates ASR hidden representations and lexical outputs in a hierarchical manner for SER. To mitigate transcription quality issues, we introduce two complementary ASR error correction strategies that require only a small amount of emotional speech data: a Large Language Model (LLM)-based method that refines N-best hypotheses using emotion-specific prompts, and a Sequence-to- Sequence (S2S) model that leverages discrete acoustic units to correct 1-best hypothesis. Both approaches substantially improve the emotional accuracy of ASR transcriptions with low-resource speech data and enhance downstream SER performance. Beyond transcription correction, this thesis introduces robust acoustic-lexical fusion frameworks to handle emotion mismatch across modalities and reduce ASR error propagation. We conduct a comprehensive investigation into cross-modal incongruity and propose incongruityaware fusion strategies that dynamically adapt to modality conflicts, extending solutions beyond SER to include sarcasm and humor detection. Building on this, we develop an ASRii aware, modality-gated fusion mechanism that integrates ASR error correction with dynamic modality selection in a sequential manner. These fusion strategies achieve strong performance on both controlled and real-world datasets, even under partially erroneous conditions (i.e. ASR transcriptions). Finally, to reduce reliance on costly labeled emotion data, we present a novel semi-supervised learning framework based on multi-view pseudo-labeling. By leveraging both acoustic similarity metrics and LLM-based confidence estimation, this approach selects high-quality unlabeled samples to augment training. The proposed framework not only outperforms traditional pseudo-labeling methods for SER but also demonstrates generalizability to speechbased Alzheimer’s dementia detection.

en

dc.identifier.uri

https://hdl.handle.net/1842/44163

dc.identifier.uri

http://dx.doi.org/10.7488/era/6687

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.relation.hasversion

Li, Y. (2021, August 31). Feeling estimation device, feeling estimation method, and storage medium. Google Patents. (US Patent 11,107,464)

en

dc.relation.hasversion

Li, Y. (2021). Semi-supervised learning for multimodal speech and emotion recognition. In Proceedings of the 2021 International Conference on Multimodal Interaction (pp. 817– 821).

en

dc.relation.hasversion

Li, Y. (2022, September 13). Information processing apparatus, information processing method, and storage medium. Google Patents. (US Patent 11,443,759)

en

dc.relation.hasversion

Li, Y. (2023). Enhancing Speech Emotion Recognition for Real-World Applications via ASR Integration. In 2023 11th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW) (pp. 1–5)

en

dc.relation.hasversion

Li, Y. (2023, July 25). Information-processing device, vehicle, computer-readable storage medium, and information-processing method. Google Patents. (US Patent 11,710,499)

en

dc.relation.hasversion

Li, Y., Bell, P., & Lai, C. (2022). Fusing ASR outputs in joint training for speech emotion recognition. In 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7362–7366).

en

dc.relation.hasversion

Li, Y., Bell, P., & Lai, C. (2023). Multimodal dyadic impression recognition via listener adaptive cross-domain fusion. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–5).

en

dc.relation.hasversion

Li, Y., Bell, P., & Lai, C. (2023). Transfer learning for personality perception via speech emotion recognition. In Proceedings of interspeech 2023 (pp. 5197–5201).

en

dc.relation.hasversion

Li, Y., Bell, P., & Lai, C. (2024). Speech emotion recognition with ASR transcripts: A comprehensive study on word error rate and fusion techniques. 2024 IEEE Spoken Language Technology Workshop (SLT). IEEE.

en

dc.relation.hasversion

Li, Y., Chen, P., Bell, P., & Lai, C. (2024). Crossmodal ASR error correction with discrete speech units. 2024 IEEE Spoken Language Technology Workshop (SLT). IEEE.

en

dc.relation.hasversion

Li, Y., Gong, Y., Yang, C.-H. H., Bell, P., & Lai, C. (2025). Revise, Reason, and Recognize: LLM-Based Emotion Recognition via Emotion-Specific Prompts and ASR Error Correction. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–5)

en

dc.relation.hasversion

Li, Y., Gui, A., Emmanouilidou, D., & Gamper, H. (2025). Addressing emotion bias in music emotion recognition and generation with frechet audio distance. 2025 IEEE International Conference on Multimedia and Expo (ICME)

en

dc.relation.hasversion

Li, Y., Inoue, K., Tian, L., Fu, C., Ishi, C. T., Ishiguro, H., . . . Lai, C. (2023). I know your feelings before you do: Predicting future affective reactions in human-computer dialogue. In Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems (pp. 1–7)

en

dc.relation.hasversion

Li, Y., Ishi, C. T., Inoue, K., Nakamura, S., & Kawahara, T. (2019). Expressing reactive emotion based on multimodal emotion recognition for natural conversation in human– robot interaction. Advanced Robotics, 33(20), 1030–1041.

en

dc.relation.hasversion

Li, Y., Kollias, D., Chanel, G., Fanourakis, M., Muszynski, M., M. Booth, B., . . . Chen, H. (2025). Hrai 2025: The 1st workshop on holistic and responsible affective intelligence. In Proceedings of the 27th international conference on multimodal interaction (pp. 814– 817)

en

dc.relation.hasversion

Li, Y., & Lai, C. (2022). A cross-domain approach for continuous impression recognition from dyadic audio-visual-physio signals. arXiv preprint arXiv:2203.13932.

en

dc.relation.hasversion

Li, Y., & Lai, C. (2022). Robotic speech synthesis: Perspectives on interactions, scenarios, and ethics. arXiv preprint arXiv:2203.09599

en

dc.relation.hasversion

Li, Y., & Lai, C. (2023). Empowering dialogue systems with affective and adaptive interaction: Integrating social intelligence. In 2023 11th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW) (pp. 1–8).

en

dc.relation.hasversion

Li, Y., Lai, C., Lala, D., Inoue, K., & Kawahara, T. (2022). Alzheimer’s Dementia Detection through Spontaneous Dialogue with Proactive Robotic Listeners. In 2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI) (pp. 875–879).

en

dc.relation.hasversion

Li, Y., Mohamied, Y., Bell, P., & Lai, C. (2023). Exploration of a self-supervised speech model: A study on emotional corpora. In 2022 IEEE Spoken Language Technology Workshop (SLT) (pp. 868–875).

en

dc.relation.hasversion

Li, Y., Urquhart, L., Karatas, N., Shao, S., Ishiguro, H., & Shen, X. (2024). Beyond voice assistants: Exploring advantages and risks of an in-car social robot in real driving scenarios. arXiv preprint arXiv:2402.11853.

en

dc.relation.hasversion

Li, Y., Wang, Y., & Cui, Z. (2023). Decoupled multimodal distilling for emotion recognition. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 6631–6640).

en

dc.relation.hasversion

Li, Y., Williams, J., Feng, T., Mitra, V., Gong, Y., Shi, B., . . . Bell, P. (2023). Responsible speech foundation models. Interspeech.

en

dc.relation.hasversion

Li, Y., Zhang, Z., Han, J., Bell, P., & Lai, C. (2025). Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–5)

en

dc.relation.hasversion

Li, Y., Zhao, T., Kawahara, T., et al. (2019). Improved end-to-end speech emotion recognition using self attention mechanism and multitask learning. In Interspeech (pp. 2803–2807)

en

dc.relation.hasversion

Li, Y., Zhao, T., & Shen, X. (2020). Attention-Based Multimodal Fusion for Estimating Human Emotion in Real-World HRI. In Companion of the 2020 ACM/IEEE International Conference on Human-Robot Interaction (pp. 340–342).

en

dc.relation.hasversion

Li, Y., Zhao, Z., Klejch, O., Bell, P., & Lai, C. (2023). ASR and emotional speech: A word-level investigation of the mutual impact of speech and emotion recognition. In Proceedings of interspeech 2023.

en

dc.relation.hasversion

Han, Z., Geng, T., Feng, H., Yuan, J., Richmond, K., & Li, Y. (2025). Cross-lingual speech emotion recognition: Humans vs. self-supervised models. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–5).

en

dc.relation.hasversion

Chen, M., Zhang, H., Li, Y., Luo, J., Wu, W., Ma, Z., . . . others (2024). 1st place solution to odyssey emotion recognition challenge task1: Tackling class imbalance problem. In Proceedings of odyssey 2024 (pp. 260–265).

en

dc.relation.hasversion

He, L., Li, Y., Feng, R., Han, X., Liu, Y.-L., Yang, Y., . . . Yuan, J. (2025). Exploring gender bias in alzheimer’s disease detection: Insights from mandarin and greek speech perception. Proceedings of Interspeech 2025

en

dc.relation.hasversion

Saliba, A., Li, Y., Sanabria, R., & Lai, C. (2024). Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study on Speech Emotion Recognition. In 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW).

en

dc.relation.hasversion

Sanders, N., Li, Y., Richmond, K., & King, S. (2025). Segmentation-variant codebooks for preservation of paralinguistic and prosodic information. Proceedings of Interspeech 2025

en

dc.relation.hasversion

Sun, Y., Zhao, Z., Richmond, K., & Li, Y. (2025). Exploring Acoustic Similarity in Emotional Speech and Music via Self-Supervised Representations. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–5)

en

dc.relation.hasversion

Shen, X., Luo, Z., Li, Y., Ouyang, T., & Wu, Y. (2024). Chance-constrained abnormal data cleaning for robust classification with noisy labels. IEEE Transactions on Emerging Topics in Computational Intelligence

en

dc.relation.hasversion

Yang, C.-H. H., Park, T., Gong, Y., Li, Y., Chen, Z., Lin, Y.-T., . . . others (2024). Large language model based generative error correction: A challenge and baselines for speech recognition, speaker tagging, and emotion recognition. In 2024 IEEE Spoken Language Technology Workshop (SLT) (pp. 371–378).

en

dc.relation.hasversion

Liu, Y.-L., Li, Y., Feng, R., He, L., Chen, J.-X., Wang, Y.-M., . . . Ling, Z.-H. (2025). Leveraging cascaded binary classification and multimodal fusion for dementia detection through spontaneous speech. arXiv preprint arXiv:2505.19446.

en

dc.relation.hasversion

Rethinking Emotion Bias in Music via Frechet Audio Distance (Y. Li, Gui, et al., 2025) arXiv:2409.15545v1 [eess.AS] 23 Sep 2024

en

dc.subject

Speech Emotion Recognition

en

dc.subject

Automatic Speech Recognition

en

dc.subject

emotional tone

en

dc.subject

acoustic features of speech

en

dc.subject

linguistic features of speech

en

dc.subject

speech-to-text conversion

en

dc.subject

confidence scoring

en

dc.subject

self-training

en

dc.subject

emotion recognition

en

dc.subject

Alzheimer’s disease detection

en

dc.subject

human-computer interactions

en

dc.title

Speech emotion recognition with ASR integration

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Doctoral

en

dc.type.qualificationname

PhD Doctor of Philosophy

en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Li2025.pdf
Size:: 6.71 MB
Format:: Adobe Portable Document Format
Description:

Download

This item appears in the following Collection(s)

Informatics thesis and dissertation collection