Speech emotion recognition with ASR integration
Files
Item Status
Embargo End Date
Date
Authors
Li, Yuanchao
Abstract
Speech Emotion Recognition (SER) plays a pivotal role in understanding human communication,
enabling emotionally intelligent systems, and serving as a fundamental component in
the development of Artificial General Intelligence (AGI). However, deploying SER in real-world,
spontaneous, and low-resource scenarios remains a significant challenge due to the complexity
of emotional expression and the limitations of current speech and language technologies.
This thesis investigates the integration of Automatic Speech Recognition (ASR) into SER,
with the goal of enhancing the robustness, scalability, and practical applicability of emotion
recognition from spoken language.
As a starting point, we explore the interplay between ASR and SER by conducting an in-depth
analysis of speech foundation models on emotionally expressive speech, from both acoustic
and linguistic perspectives. Our findings uncover inherent limitations of these models: while
they achieve strong performance in ASR, they often fail to capture paralinguistic cues essential
for emotion recognition, and exhibit emotion-dependent biases. Additionally, we examine how
linguistic properties, such as part-of-speech distributions, affective word ratings, and utterance
length, interact with ASR performance across different emotion categories. Extensive
experiments reveal that ASR errors are not random but follow systematic patterns associated
with specific word types and emotional expressions across diverse speaking conditions.
To overcome paralinguistic limitations while utilizing the hierarchical encoding capabilities of
speech foundation models, we propose a joint training framework that integrates ASR hidden
representations and lexical outputs in a hierarchical manner for SER. To mitigate transcription
quality issues, we introduce two complementary ASR error correction strategies that
require only a small amount of emotional speech data: a Large Language Model (LLM)-based
method that refines N-best hypotheses using emotion-specific prompts, and a Sequence-to-
Sequence (S2S) model that leverages discrete acoustic units to correct 1-best hypothesis.
Both approaches substantially improve the emotional accuracy of ASR transcriptions with
low-resource speech data and enhance downstream SER performance.
Beyond transcription correction, this thesis introduces robust acoustic-lexical fusion frameworks
to handle emotion mismatch across modalities and reduce ASR error propagation. We
conduct a comprehensive investigation into cross-modal incongruity and propose incongruityaware
fusion strategies that dynamically adapt to modality conflicts, extending solutions beyond
SER to include sarcasm and humor detection. Building on this, we develop an ASRii
aware, modality-gated fusion mechanism that integrates ASR error correction with dynamic
modality selection in a sequential manner. These fusion strategies achieve strong performance
on both controlled and real-world datasets, even under partially erroneous conditions
(i.e. ASR transcriptions).
Finally, to reduce reliance on costly labeled emotion data, we present a novel semi-supervised
learning framework based on multi-view pseudo-labeling. By leveraging both acoustic similarity
metrics and LLM-based confidence estimation, this approach selects high-quality unlabeled
samples to augment training. The proposed framework not only outperforms traditional
pseudo-labeling methods for SER but also demonstrates generalizability to speechbased
Alzheimer’s dementia detection.
This item appears in the following Collection(s)

