Quantifying the distributional distance between synthetic and real speech
Item Status
Embargo End Date
Date
Authors
Abstract
This thesis addresses the discrepancy between the high perceived naturalness of synthetic speech and its comparatively limited utility for training robust downstream applications, specifically Automatic Speech Recognition (ASR) systems. While recent Text-to-Speech (TTS) models can achieve subjective ratings that are close to ground truth under some listening test protocols, ASR models trained exclusively on synthetic data consistently exhibit significantly higher error rates when evaluated on real speech. We posit that this persistent synthetic-real gap arises from the inability of current TTS models to fully approximate the nuanced, high-dimensional probability distribution of real speech, particularly concerning its inherent variability.
To quantify this disparity, we introduce the Word Error Rate Ratio (WERR), a heuristic that directly compares ASR performance when trained on synthetic versus real data. Our empirical investigations confirm a substantial WERR, indicating that ASR models trained on synthetic speech perform considerably worse than those trained on real speech, even when the synthetic utterances are subjectively perceived as highly natural. This observation suggests that synthetic speech, while perceptually clean, often lacks the intricate acoustic and prosodic variability crucial for ASR model robustness.
We explore methodologies to enhance synthetic speech diversity, including explicit conditioning on speaker, prosodic, and environmental attributes, as well as post-generation data augmentation. While these techniques demonstrably reduce the WERR, thereby narrowing the synthetic-real gap, our findings indicate a plateau in performance, suggesting an inherent ceiling for current synthesis paradigms in fully capturing real-world speech complexity. Furthermore, we conduct a comprehensive study on the scaling properties of synthetic data for ASR training, comparing Mean Squared Error (MSE)-based and Denoising Diffusion Probabilistic Model (DDPM)-based TTS architectures. Our results demonstrate that DDPMs exhibit superior scalability and more effectively leverage large training datasets, leading to sustained improvements in ASR performance compared to MSE models, which rapidly plateau due to oversmoothing. However, even with the enhanced scaling of DDPMs, projections indicate that an extraordinarily large volume of synthetic data would be required to achieve parity with ASR models trained on real speech.
To provide a robust and objective evaluation framework for synthetic speech that directly addresses these distributional nuances, we propose the Text-to-Speech Distribution Score (TTSDS). This metric quantifies the dissimilarity between real and synthetic speech distributions across perceptually motivated factors—including generic acoustic similarity, speaker realism, prosody, and intelligibility by leveraging the 2-Wasserstein distance. Through extensive validation against subjective listening test data across time (2008-2024) and diverse domains and languages, TTSDS demonstrates strong and consistent correlations with human judgments. This validation establishes TTSDS as a reliable objective measure capable of predicting human perception and providing interpretable insights into specific areas of improvement for TTS systems. This work shows that while synthetic speech has reached impressive levels of subjective naturalness, it cannot yet accurately replicate the full distributional complexity of human speech.
This item appears in the following Collection(s)

