Topology in end-to-end automatic speech recognition based on differentiable weighted finite-state transducer

Zhao, Zeyu

Topology in end-to-end automatic speech recognition based on differentiable weighted finite-state transducer

Files

Zhao2025.pdf (1.42 MB)

Date

2025-06-11

Authors

Zhao, Zeyu

Full item page

Abstract

Traditionally, Automatic Speech Recognition (ASR) systems have been based on Hidden Markov Models (HMMs) combined with Gaussian Mixture Models (GMMs) and Deep Neural Networks (DNNs). The topology, or structure, of the HMM defines how the states are connected, how transitions are made, and how speech feature vectors align with output symbols. Since the 1980s, many discussions and experiments have refined HMM topologies, leading to the widespread adoption of the left-to-right Bakis topology. However, the topology of end-to-end ASR systems has not been studied as extensively. End-to-end (E2E) ASR methods allow us to train a neural network (NN) model from scratch with transcribed data only, which simplifies the training pipeline compared to traditionally HMM-based method. Connectionist Temporal Classification (CTC) is a popular E2E ASR method known for its one-state topology, where each state represents a specific output symbol. Despite its popularity, the unique topology of CTC is often applied without much consideration. This thesis investigates the role of topology in E2E ASR systems, especially in a framework based on Differentiable Weighted Finite-State Transducer (DWFST). We start from an attempt of modifying the topology of CTC, which results in a new loss function, named MMI-CTC. Our findings show that MMI-CTC improves overall recognition accuracy, in terms of Word Error Rate (WER), and convergence speed, demonstrating the importance of topology in end-to-end ASR systems and its impact on the training process. Building on the established importance of topology, we explore its role further in end-to-end ASR. The common CTC implementation, based on the forward-backward algorithm, is inflexible and difficult to modify, as creating a new algorithm for each topology is time-consuming and error-prone. To overcome this, we introduce an end-to-end ASR framework based on DWFST. This framework replaces the forward-backward algorithm, providing a flexible and general platform for experimenting with various topologies. With the DWFST-based framework, we first explore how topology affects recognition accuracy at different output frame rates. We discover that various topologies perform differently depending on the output frame rate. The topology influences the number of possible alignments between the input and target sequences. By adjusting the topology, we can control these alignments and improve the model's overall accuracy. Besides, we address how different topologies impact alignment quality, particularly the ``peaky'' issue associated with CTC. Our study shows that modifying the topology can significantly improve alignment quality without affecting WER performance. This improvement benefits various downstream tasks and highlights the advantages of our proposed topological changes. We also examine if the topology impacts the decoding accuracy of the Viterbi decoder. During training, all possible alignments are considered, but during decoding, the Viterbi decoder only looks for the most likely alignment. We would like to see if there is a dominant path in the decoding space that explains why the Viterbi decoder works well. Our research shows that there is no single dominant path, but the Viterbi decoder still performs accurately under certain conditions, which we discuss in this thesis in depth. Additionally, we investigate whether the choice of topology affects the model's generalisability—its ability to perform well on different datasets even only trained on one dataset. We believe that topology impacts how speech features align with output symbols, thus influencing the acoustic modelling power. Our findings reveal that different topologies lead to varying levels of generalizability, mainly affecting the acoustic model, even though the internal language model (ILM) is relatively weak in our settings. In terms of the ILM, we also explore its existence in CTC-based E2E ASR systems, a topic with mixed opinions among researchers. We design a procedure to verify the existence of ILMs in these systems and find no evidence that a strong ILM exists in a pure CTC-based ASR system. In summary, this thesis demonstrates that topology is a crucial factor in end-to-end ASR systems, and its impact on the training process, the convergence speed, the recognition accuracy, the generalisability, and the alignment quality should not be underestimated.

URI

https://hdl.handle.net/1842/43546
http://dx.doi.org/10.7488/era/6080

This item appears in the following Collection(s)

Informatics thesis and dissertation collection