Effective attention-based sequence-to-sequence modelling for automatic speech recognition
dc.contributor.advisor
Renals, Stephen
dc.contributor.advisor
Bell, Peter
dc.contributor.advisor
Loweimi, Erfan
dc.contributor.author
Zhang, Shucong
dc.contributor.sponsor
other
en
dc.date.accessioned
2022-03-15T10:06:30Z
dc.date.available
2022-03-15T10:06:30Z
dc.date.issued
2022-03-14
dc.description.abstract
With sufficient training data, attentional encoder-decoder models have given outstanding ASR results. In such models, the encoder encodes the input sequence into a sequence of hidden representations. The attention mechanism generates a soft alignment
between the encoder hidden states and the decoder hidden states. The decoder produces the current output by considering the alignment and the previous outputs.
However, attentional encoder-decoder models are originally designed for machine
translation tasks, where the input and output sequences are relatively short and the
alignments between them are flexible. For ASR tasks, the input sequences are notably
long. Further, acoustic frames (or their hidden representations) typically can be aligned
with output units in a left-to-right order, and compared to the length of the entire utterance, the duration of each output unit is usually small. Conventional encoder-decoder
models have difficulties in modelling long sequences, and the attention mechanism
does not guarantee the monotonic left-to-right alignments.
In this thesis, we study attention-based sequence-to-sequence ASR models and
address the aforementioned issues. We investigate recurrent neural network (RNN)
encoder-decoder models and self-attention encoder-decoder models. For RNN encoder-decoder models, we develop a dynamic subsampling RNN (dsRNN) encoder to shorten
the lengths of the input sequences. The dsRNN learns to skip redundant frames. Furthermore, the skip ratio may vary at different stages of training, thus allowing the
encoder to learn the most relevant information for each epoch. Thus, the dsRNN alleviates the difficulties of encoding long sequences. We also propose a fully trainable
windowed attention mechanism, in which both the window shift and window length
are learned by the model. Our windowed method forces the attention mechanism to
attend inputs within small sliding windows in a strict left-to-right order. The proposed
dsRNN and windowed attention give significant performance gains over traditional
encoder-decoder ASR models.
We next study self-attention encoder-decoder models. For RNN encoder-decoder
models, we have shown that restricting the attention within small windows is beneficial. However, self-attention encodes input sequences by comparing each element
of the sequence with all other elements of the sequence. Therefore, we investigate if
the global view of self-attention is necessary for ASR. We note that the range of the
learned context increases from the lower to the upper self-attention layers, and suggest
that the upper encoder layers may have seen sufficient contextual information without
the need for self-attention. This would imply that the upper self-attention layers can
be replaced with feed-forward layers (we can view the feed-forward layers as strict
local left-to-right self-attention). In practice, we observe replacing upper encoder self-attention layers with feed forward layers does not impact the performance. We also
observe that there are individual attention heads that only attend local information, and
thus the self-attention mechanism is redundant for these attention heads. Based on
these observations, we propose randomly removing attention heads during training but
keep all heads at testing. The proposed method achieves state-of-the-art ASR results
on benchmark datasets of different ASR scenarios.
Finally, we investigate top-down level-wise training of sequence-to-sequence ASR
models. We find that when training sequence-to-sequence ASR models on noisy data,
the use of upper layers trained on clean data forces the lower layers to learn noise-invariant features, since the features which fit the clean-trained upper layers are more
general. We further show that within the same dataset, conventional joint training
makes the upper layers quickly overfit. Therefore, we propose to freeze the upper
layers and retrain the lower layers. The proposed method is a general training strategy;
we use it not only to train ASR models but also to train other neural networks in other
domains. The proposed training method yields consistent performance gains across
different tasks (e.g., language modelling, image classification).
In summary, we propose methods which enable attention-based sequence-to-sequence
ASR systems to better model sequential data, and demonstrate the benefits of training
neural networks in a top-down cascade manner.
en
dc.identifier.uri
https://hdl.handle.net/1842/38711
dc.identifier.uri
http://dx.doi.org/10.7488/era/1967
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Zhang, S., Loweimi, E., Bell, P., and Renals, S. (2019). Windowed attention mechanisms for speech recognition. In ICASSP, pages 7100–7104.
en
dc.relation.hasversion
Zhang, S., Loweimi, E., Bell, P., and Renals, S. (2021). On the usefulness of self-attention for automatic speech recognition with transformers. In IEEE SLT, pages 89–96
en
dc.subject
automatic speech recognition
en
dc.subject
ASR systems
en
dc.subject
sequence-to-sequence ASR models
en
dc.subject
neural network training
en
dc.subject
recurrent neural network
en
dc.subject
RNN encoder-decoder models
en
dc.title
Effective attention-based sequence-to-sequence modelling for automatic speech recognition
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en
Files
Original bundle
1 - 1 of 1
- Name:
- Zhang2021.pdf
- Size:
- 2.62 MB
- Format:
- Adobe Portable Document Format
- Description:
This item appears in the following Collection(s)

