Edinburgh Research Archive

Speech emotion recognition with ASR integration

Item Status

Embargo End Date

Authors

Li, Yuanchao

Abstract

Speech Emotion Recognition (SER) plays a pivotal role in understanding human communication, enabling emotionally intelligent systems, and serving as a fundamental component in the development of Artificial General Intelligence (AGI). However, deploying SER in real-world, spontaneous, and low-resource scenarios remains a significant challenge due to the complexity of emotional expression and the limitations of current speech and language technologies. This thesis investigates the integration of Automatic Speech Recognition (ASR) into SER, with the goal of enhancing the robustness, scalability, and practical applicability of emotion recognition from spoken language. As a starting point, we explore the interplay between ASR and SER by conducting an in-depth analysis of speech foundation models on emotionally expressive speech, from both acoustic and linguistic perspectives. Our findings uncover inherent limitations of these models: while they achieve strong performance in ASR, they often fail to capture paralinguistic cues essential for emotion recognition, and exhibit emotion-dependent biases. Additionally, we examine how linguistic properties, such as part-of-speech distributions, affective word ratings, and utterance length, interact with ASR performance across different emotion categories. Extensive experiments reveal that ASR errors are not random but follow systematic patterns associated with specific word types and emotional expressions across diverse speaking conditions. To overcome paralinguistic limitations while utilizing the hierarchical encoding capabilities of speech foundation models, we propose a joint training framework that integrates ASR hidden representations and lexical outputs in a hierarchical manner for SER. To mitigate transcription quality issues, we introduce two complementary ASR error correction strategies that require only a small amount of emotional speech data: a Large Language Model (LLM)-based method that refines N-best hypotheses using emotion-specific prompts, and a Sequence-to- Sequence (S2S) model that leverages discrete acoustic units to correct 1-best hypothesis. Both approaches substantially improve the emotional accuracy of ASR transcriptions with low-resource speech data and enhance downstream SER performance. Beyond transcription correction, this thesis introduces robust acoustic-lexical fusion frameworks to handle emotion mismatch across modalities and reduce ASR error propagation. We conduct a comprehensive investigation into cross-modal incongruity and propose incongruityaware fusion strategies that dynamically adapt to modality conflicts, extending solutions beyond SER to include sarcasm and humor detection. Building on this, we develop an ASRii aware, modality-gated fusion mechanism that integrates ASR error correction with dynamic modality selection in a sequential manner. These fusion strategies achieve strong performance on both controlled and real-world datasets, even under partially erroneous conditions (i.e. ASR transcriptions). Finally, to reduce reliance on costly labeled emotion data, we present a novel semi-supervised learning framework based on multi-view pseudo-labeling. By leveraging both acoustic similarity metrics and LLM-based confidence estimation, this approach selects high-quality unlabeled samples to augment training. The proposed framework not only outperforms traditional pseudo-labeling methods for SER but also demonstrates generalizability to speechbased Alzheimer’s dementia detection.

This item appears in the following Collection(s)