Show simple item record

dc.contributor.advisorRenals, Stephen
dc.contributor.advisorBell, Peter
dc.contributor.authorKlejch, Ondrej
dc.date.accessioned2021-09-17T11:01:56Z
dc.date.available2021-09-17T11:01:56Z
dc.date.issued2020-11-30
dc.identifier.urihttps://hdl.handle.net/1842/38068
dc.identifier.urihttp://dx.doi.org/10.7488/era/1339
dc.description.abstractThe performance of automatic speech recognition systems degrades rapidly when there is a mismatch between training and testing conditions. One way to compensate for this mismatch is to adapt an acoustic model to test conditions, for example by performing speaker adaptation. In this thesis we focus on the discriminative model-based speaker adaptation approach. The success of this approach relies on having a robust speaker adaptation procedure – we need to specify which parameters should be adapted and how they should be adapted. Unfortunately, tuning the speaker adaptation procedure requires considerable manual effort. In this thesis we propose to formulate speaker adaptation as a meta-learning task. In meta-learning, learning occurs on two levels: a learner learns a task specific model and a meta-learner learns how to train these task specific models. In our case, the learner is a speaker dependent-model and the meta-learner learns to adapt a speaker-independent model into the speaker dependent model. By using this formulation, we can automatically learn robust speaker adaptation procedures using gradient descent. In the exper iments, we demonstrate that the meta-learning approach learns competitive adaptation schedules compared to adaptation procedures with handcrafted hyperparameters. Subsequently, we show that speaker adaptive training can be formulated as a meta-learning task as well. In contrast to the traditional approach, which maintains and optimises a copy of speaker dependent parameters for each speaker during training, we embed the gradient based adaptation directly into the training of the acoustic model. We hypothesise that this formulation should steer the training of the acoustic model into finding parameters better suited for test-time speaker adaptation. We experimentally compare our approach with test-only adaptation of a standard baseline model and with SAT-LHUC, which represents a traditional speaker adaptive training method. We show that the meta-learning speaker-adaptive training approach achieves comparable results with SAT-LHUC. However, neither the meta-learning approach nor SAT-LHUC outperforms the baseline approach after adaptation. Consequently, we run a series of experimental ablations to determine why SAT-LHUC does not yield any improvements compared to the baseline approach. In these experiments we explored multiple factors such as using various neural network architectures, normalisation techniques, activation functions or optimisers. We find that SAT-LHUC interferes with batch normalisation, and that it benefits from an increased hidden layer width and an increased model size. However, the baseline model benefits from increased capacity too, therefore in order to obtain the best model it is still favourable to train a speaker independent model with batch normalisation. As such, an effective way of training state-of-the-art SAT-LHUC models remains an open question. Finally, we show that the performance of unsupervised speaker adaptation can be further improved by using discriminative adaptation with lattices as supervision obtained from a first pass decoding, instead of traditionally used one-best path tran scriptions. We find that this proposed approach enables many more parameters to be adapted without overfitting being observed, and is successful even when the initial transcription has a WER in excess of 50%.en
dc.contributor.sponsorEuropean Commissionen
dc.language.isoenen
dc.publisherThe University of Edinburghen
dc.relation.hasversionFainberg, J., Klejch, O., Loweimi, E., Bell, P., and Renals, S. (2019). Acoustic model adaptation from raw waveforms with SincNet. In ASRUen
dc.relation.hasversionFainberg, J., Klejch, O., Renals, S., and Bell, P. (2019). Lattice-based lightly-supervised acoustic model training. In Interspeechen
dc.relation.hasversionKlejch, O., Bell, P., and Renals, S. (2016). Punctuated transcription of multi-genre broadcasts using acoustic and lexical approaches. In SLTen
dc.relation.hasversionKlejch, O., Bell, P., and Renals, S. (2017). Sequence-to-sequence models for punctu ated transcription combining lexical and acoustic features. In ICASSP.en
dc.relation.hasversionKlejch, O., Fainberg, J., and Bell, P. (2018). Learning to adapt: a meta-learning ap proach for speaker adaptation. In Interspeech.en
dc.relation.hasversionKlejch, O., Fainberg, J., Bell, P., and Renals, S. (2019). Lattice-based unsupervised test-time adaptation of neural network acoustic models. arXiv preprint arXiv:1906.11521.en
dc.relation.hasversionKlejch, O., Fainberg, J., Bell, P., and Renals, S. (2019). Speaker adaptive training using model agnostic meta-learning. In ASRUen
dc.relation.hasversionLiepins, R., Germann, U., Barzdins, G., Birch, A., Renals, S., Weber, S., van der Kreeft, P., Bourlard, H., Prieto, J., Klejch, O., et al. (2017). The SUMMA platform prototype. In Software Demonstrations ACL.en
dc.relation.hasversionRoth, J., Chaudhuri, S., Klejch, O., Marvin, R., Gallagher, A., et al. (2019). AVA ActiveSpeaker: An audio-visual dataset for active speaker detection. arXiv preprint arXiv:1901.01342en
dc.relation.hasversionTsunoo, E., Klejch, O., Bell, P., and Renals, S. (2017). Hierarchical recurrent neural network for story segmentation using fusion of lexical and acoustic features. In ASRUen
dc.subjectautomatic speech recognitionen
dc.subjectspeaker adaptationen
dc.subjectmeta-learningen
dc.titleLearning to adapt: meta-learning approaches for speaker adaptationen
dc.typeThesis or Dissertationen
dc.type.qualificationlevelDoctoralen
dc.type.qualificationnamePhD Doctor of Philosophyen


Files in this item

This item appears in the following Collection(s)

Show simple item record