Learning representations for speech recognition using artificial neural networks
dc.contributor.advisor
Renals, Stephen
en
dc.contributor.advisor
Bell, Peter
en
dc.contributor.author
Swietojanski, Paweł
en
dc.contributor.sponsor
Engineering and Physical Sciences Research Council (EPSRC)
en
dc.date.accessioned
2017-07-17T13:09:27Z
dc.date.available
2017-07-17T13:09:27Z
dc.date.issued
2016-11-29
dc.description.abstract
Learning representations is a central challenge in machine learning. For speech
recognition, we are interested in learning robust representations that are stable
across different acoustic environments, recording equipment and irrelevant inter–
and intra– speaker variabilities. This thesis is concerned with representation
learning for acoustic model adaptation to speakers and environments, construction
of acoustic models in low-resource settings, and learning representations from
multiple acoustic channels. The investigations are primarily focused on the hybrid
approach to acoustic modelling based on hidden Markov models and artificial
neural networks (ANN).
The first contribution concerns acoustic model adaptation. This comprises
two new adaptation transforms operating in ANN parameters space. Both operate
at the level of activation functions and treat a trained ANN acoustic model as
a canonical set of fixed-basis functions, from which one can later derive variants
tailored to the specific distribution present in adaptation data. The first technique,
termed Learning Hidden Unit Contributions (LHUC), depends on learning
distribution-dependent linear combination coefficients for hidden units. This
technique is then extended to altering groups of hidden units with parametric and
differentiable pooling operators. We found the proposed adaptation techniques
pose many desirable properties: they are relatively low-dimensional, do not overfit
and can work in both a supervised and an unsupervised manner. For LHUC we
also present extensions to speaker adaptive training and environment factorisation.
On average, depending on the characteristics of the test set, 5-25% relative
word error rate (WERR) reductions are obtained in an unsupervised two-pass
adaptation setting.
The second contribution concerns building acoustic models in low-resource
data scenarios. In particular, we are concerned with insufficient amounts of
transcribed acoustic material for estimating acoustic models in the target language
– thus assuming resources like lexicons or texts to estimate language models
are available. First we proposed an ANN with a structured output layer
which models both context–dependent and context–independent speech units,
with the context-independent predictions used at runtime to aid the prediction
of context-dependent states. We also propose to perform multi-task adaptation
with a structured output layer. We obtain consistent WERR reductions up to
6.4% in low-resource speaker-independent acoustic modelling. Adapting those
models in a multi-task manner with LHUC decreases WERRs by an additional
13.6%, compared to 12.7% for non multi-task LHUC. We then demonstrate that
one can build better acoustic models with unsupervised multi– and cross– lingual
initialisation and find that pre-training is a largely language-independent. Up to
14.4% WERR reductions are observed, depending on the amount of the available
transcribed acoustic data in the target language.
The third contribution concerns building acoustic models from multi-channel
acoustic data. For this purpose we investigate various ways of integrating and
learning multi-channel representations. In particular, we investigate channel concatenation
and the applicability of convolutional layers for this purpose. We
propose a multi-channel convolutional layer with cross-channel pooling, which
can be seen as a data-driven non-parametric auditory attention mechanism. We
find that for unconstrained microphone arrays, our approach is able to match the
performance of the comparable models trained on beamform-enhanced signals.
en
dc.identifier.uri
http://hdl.handle.net/1842/22835
dc.language.iso
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
P. Swietojanski and S. Renals. Differentiable Pooling for Unsupervised Acoustic Model Adaptation. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2016
en
dc.relation.hasversion
P. Swietojanski J. Li and S. Renals. Learning Hidden Unit Contributions for Unsupervised Acoustic Model Adaptation. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2016
en
dc.relation.hasversion
P. Swietojanski and S. Renals. SAT-LHUC: Speaker Adaptive Training for Learning Hidden Unit Contributions. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 2016
en
dc.relation.hasversion
P. Swietojanski, P. Bell, and S. Renals. Structured output layer with auxiliary targets for context-dependent acoustic modelling. In Proc. ISCA Interspeech, Dresden, Germany, 2015.
en
dc.relation.hasversion
P. Swietojanski and S. Renals. Differentiable pooling for unsupervised speaker adaptation. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 2015
en
dc.relation.hasversion
P. Swietojanski and S. Renals. Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models. In Proc. IEEE Spoken Language Technology Workshop (SLT), Lake Tahoe, USA, 2014
en
dc.relation.hasversion
P. Swietojanski, A. Ghoshal, and S. Renals. Convolutional neural networks for distant speech recognition. IEEE Signal Processing Letters, 21(9):1120- 1124, September 2014
en
dc.relation.hasversion
S. Renals and P. Swietojanski. Neural networks for distant speech recognition. In The 4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA), Nancy, France, 2014
en
dc.relation.hasversion
P. Swietojanski, A. Ghoshal, and S. Renals. Hybrid acoustic models for distant and multichannel large vocabulary speech recognition. In Proc. IEEE Automatic Speech Recognition and UnderstandingWorkshop (ASRU), Olomouc, Czech Republic, 2013
en
dc.relation.hasversion
P. Swietojanski, A. Ghoshal, and S. Renals. Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR. In Proc. IEEE Spoken Language Technology Workshop (SLT), pages 246-251, Miami, Florida, USA, 2012.
en
dc.relation.hasversion
P. Swietojanski, J-T Huang and J. Li. Investigation of maxout networks for speech recognition. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 2014
en
dc.relation.hasversion
P. Swietojanski, A. Ghoshal, and S. Renals. Revisiting hybrid and GMMHMM system combination techniques. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada, 2013
en
dc.relation.hasversion
P. Bell, P. Swietojanski, and S. Renals. Multi-level adaptive networks in tandem and hybrid ASR systems. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada, 2013
en
dc.relation.hasversion
A. Ghoshal, P. Swietojanski, and S. Renals. Multilingual training of deep neural networks. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada, 2013
en
dc.relation.hasversion
Z. Wu, P. Swietojanski, C. Veaux, S. Renals, and S. King. A study of speaker adaptation for DNN-based speech synthesis. In Proc. ISCA Interspeech, Dresden, Germany, 2015
en
dc.relation.hasversion
P. Bell, P. Swietojanski, J. Driesen, Mark Sinclair, Fergus McInnes, and Steve Renals. The UEDIN ASR systems for the IWSLT 2014 evaluation. In Proc. International Workshop on Spoken Language Translation (IWSLT), South Lake Tahoe, USA, 2014
en
dc.relation.hasversion
P. Bell, H. Yamamoto, P. Swietojanski, Y. Wu, F. McInnes, C. Hori, and S. Renals. A lecture transcription system combining neural network acoustic and language models. In Proc. ISCA Interspeech, Lyon, France, 2013
en
dc.relation.hasversion
H. Christensen, M. Aniol, P. Bell, P. Green, T. Hain, S. King, and P. Swietojanski. Combining in-domain and out-of-domain speech data for automatic recognition of disordered speech. In Proc. ISCA Interspeech, Lyon, France, 2013
en
dc.relation.hasversion
P. Lanchantin, P. Bell, M. Gales, T. Hain, X. Liu, Y. Long, J. Quinnell, S. Renals, O. Saz, M. Seigel, P. Swietojanski, and P. Woodland. Automatic transcription of multi-genre media archives. In Proc. Workshop on Speech, Language and Audio in Multimedia, Marseille, France, 2013
en
dc.relation.hasversion
P. Bell, M. Gales, P. Lanchantin, X. Liu, Y. Long, S. Renals, P. Swietojanski, and P. Woodland. Transcription of multi-genre media archives using out-of-domain data. In Proc. IEEE Spoken Language Technology Workshop (SLT), pages 324-329, Miami, Florida, USA, 2012
en
dc.relation.hasversion
E. Hasler, P. Bell, A. Ghoshal, B. Haddow, P. Koehn, F. McInnes, S. Renals, and P. Swietojanski. The UEDIN system for the IWSLT 2012 evaluation. In Proc. International Workshop on Spoken Language Translation (IWSLT), Hong Kong, China, 2012
en
dc.subject
automatic speech recognition
en
dc.subject
deep neural networks
en
dc.subject
adaptation
en
dc.subject
distant ASR
en
dc.title
Learning representations for speech recognition using artificial neural networks
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en
Files
Original bundle
1 - 1 of 1
- Name:
- Swietojanski2016.pdf
- Size:
- 3.41 MB
- Format:
- Adobe Portable Document Format
This item appears in the following Collection(s)

