Recurrent neural network language models for automatic speech recognition
View/ Open
Date
07/07/2017Author
Gangireddy, Siva Reddy
Metadata
Abstract
The goal of this thesis is to advance the use of recurrent neural network language models
(RNNLMs) for large vocabulary continuous speech recognition (LVCSR). RNNLMs
are currently state-of-the-art and shown to consistently reduce the word error rates
(WERs) of LVCSR tasks when compared to other language models. In this thesis we
propose various advances to RNNLMs. The advances are: improved learning procedures
for RNNLMs, enhancing the context, and adaptation of RNNLMs. We learned
better parameters by a novel pre-training approach and enhanced the context using
prosody and syntactic features.
We present a pre-training method for RNNLMs, in which the output weights of a
feed-forward neural network language model (NNLM) are shared with the RNNLM.
This is accomplished by first fine-tuning the weights of the NNLM, which are then
used to initialise the output weights of an RNNLM with the same number of hidden
units. To investigate the effectiveness of the proposed pre-training method, we have
carried out text-based experiments on the Penn Treebank Wall Street Journal data, and
ASR experiments on the TED lectures data. Across the experiments, we observe small
but significant improvements in perplexity (PPL) and ASR WER.
Next, we present unsupervised adaptation of RNNLMs. We adapted the RNNLMs
to a target domain (topic or genre or television programme (show)) at test time using
ASR transcripts from first pass recognition. We investigated two approaches to adapt
the RNNLMs. In the first approach the forward propagating hidden activations are
scaled - learning hidden unit contributions (LHUC). In the second approach we adapt
all parameters of RNNLM.We evaluated the adapted RNNLMs by showing the WERs
on multi genre broadcast speech data. We observe small (on an average 0.1% absolute)
but significant improvements in WER compared to a strong unadapted RNNLM model.
Finally, we present the context-enhancement of RNNLMs using prosody and syntactic
features. The prosody features were computed from the acoustics of the context
words and the syntactic features were from the surface form of the words in the context.
We trained the RNNLMs with word duration, pause duration, final phone duration, syllable
duration, syllable F0, part-of-speech tag and Combinatory Categorial Grammar
(CCG) supertag features. The proposed context-enhanced RNNLMs were evaluated
by reporting PPL and WER on two speech recognition tasks, Switchboard and TED
lectures. We observed substantial improvements in PPL (5% to 15% relative) and small
but significant improvements in WER (0.1% to 0.5% absolute).