Leveraging deep speaker embedding variability factors for verification and diarization

Luu, Chau Van Quy

Leveraging deep speaker embedding variability factors for verification and diarization

Simple item page

dc.contributor.advisor

Bell, Peter

dc.contributor.advisor

Renals, Stephen

dc.contributor.author

Luu, Chau Van Quy

dc.date.accessioned

2023-12-21T14:12:52Z

dc.date.available

2023-12-21T14:12:52Z

dc.date.issued

2023-12-21

dc.description.abstract

The tasks of speaker verification (SV, determining whether a test utterance has the same speaker as an enrolment utterance) and speaker diarization (SD, determining ‘who spoke when?’) both fall under the umbrella of speaker recognition tasks. Both SV and SD have become tasks with high applicability in mainstream technology. Example applications of SV and SD are a voice assistant that only activates for a specific user, and colour-coding subtitles according to the speaker that produced them, respectively. Both SV and SD systems have found success in recent years by utilising speaker embeddings, vector representations of speaker identity extracted from segments of speech. By comparing the similarity of the embeddings extracted from different utterances, it is possible to distinguish the speaker of each utterance, and this mechanism is what many successful verification and diarization systems are based upon. Deep speaker embeddings are the speaker embeddings extracted from the intermediate layer of a neural network. This neural network is trained in such a way as to encode speaker identity in a discriminative fashion in the desired intermediate layer. For example, often the training objective is speaker classification or a variation thereof, and while this has been shown to be a very successful strategy, various sources of information can be encoded into this embedding space. Some of these sources are speaker related, and intuitively make up part of speaker identity, such as speaker gender. However, some sources of information are not explicitly speaker related, such as the channel and recording information, but can nonetheless be captured during training. In this work, we look at these sources of information and variability, describing them speaker embedding variability factors and explore how they interact with and affect the downstream tasks of SV and SD. Specifically, our work looks at the following topics: reducing channel variability, reducing speaker variability distribution mismatch, explicitly encouraging variability for speaker identity related factors for increased robustness, and investigating the contribution of certain speaker attributes to separability. For reducing channel variability, we propose a training regime based on adversarial methods that adds an adversarial loss based on discriminating whether pairs of embeddings come from the same recording. This approach adds channel invariance to the training objective of the embedding network while being dataset agnostic, not requiring any additional labels. We show that our induced channel invariability improves verification for out-of-recording pairs of utterances, while also improving diarization performance. In the pursuit of encouraging speaker identity related variability, we propose a multitask learning training framework for leveraging auxiliary labels, adding additional at tribute related tasks to the overall training objective. In conjunction with the standard speaker classification task, the addition of speaker age and speaker nationality classification tasks was shown to improve verification and diarization performance, partic ularly when fine-tuning to new domains. We suggest the reason for this improvement is due to the robustness imparted by structuring the embedding space in a way that we already know is speaker identity related, thus decreasing the risk of fitting to other non-identity related factors. We also investigated the contribution of speaker attributes to speaker separability using disentangled representations. Here, we combined the aforementioned multi-task framework along with adversarial methods to successfully isolate aspects of speaker identity in specific dimensions of the speaker embedding. We then ablated these dimensions and found that for different datasets, different speaker attributes were of varying importance to separability in diarization and verification tasks, with gender a particularly strong factor for in-the-wild celebrity utterances. Furthermore, by looking at the logits of the speaker classification network that speaker embeddings are extracted from, we found that for a group unseen utterances, the predicted posterior distribution (of training set speakers) was extremely skewed. By im plementing a form of iterative fine-tuning on high probability training set speakers in combination with a form of dropout on the output layer, we showed improvements to verification performance. We suggest that the cause of this improvement is due to a distribution mismatch of speakers, relating to the aforementioned speaker attributes. Overall, we explored several different approaches to manipulating the variability factors present in deep speaker embeddings, finding that each approach had merits when applied to specific scenarios. We suggest approaches for future work that build upon the techniques outlined in this thesis, in particular for speaker attribute-related learning and disentanglement.

en

dc.identifier.uri

https://hdl.handle.net/1842/41307

dc.identifier.uri

http://dx.doi.org/10.7488/era/4042

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.relation.hasversion

Chau Luu et al. (May 2020a). “Channel Adversarial Training for Speaker Verification and Diarization”. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7094–7098. DOI: 10.1109/ICASSP40776.2020.9053323

en

dc.relation.hasversion

Chau Luu et al. (Aug. 30, 2021). “Leveraging Speaker Attribute Information Using Multi Task Learning for Speaker Verification and Diarization”. In: Interspeech 2021. Interspeech 2021. ISCA, pp. 491–495. DOI: 10 . 21437 / Interspeech.2021-622. URL: https://www.isca-speech.org/archive/ interspeech_2021/luu21_interspeech.html

en

dc.relation.hasversion

Chau Luu et al. (Sept. 18, 2022). “Investigating the Contribution of Speaker Attributes to Speaker Separability Using Disentangled Speaker Representations”. In: Interspeech 2022. Interspeech 2022. ISCA, pp. 610–614. DOI: 10.21437/ Interspeech . 2022 - 10643. URL: https : / / www . isca - speech . org / archive/interspeech_2022/luu22_interspeech.html

en

dc.relation.hasversion

Chau Luu et al. (Nov. 1, 2020b). “Dropping Classes for Deep Speaker Representation Learning”. In: The Speaker and Language Recognition Workshop (Odyssey 2020). The Speaker and Language Recognition Workshop (Odyssey 2020). ISCA, pp. 357–364. DOI: 10.21437/Odyssey.2020-50. URL: https: //www.isca-speech.org/archive/odyssey_2020/luu20_odyssey.html

en

dc.subject

speaker verification

en

dc.subject

speaker diarization

en

dc.subject

vector representations

en

dc.subject

deep speaker embeddings

en

dc.subject

neural network

en

dc.subject

channel variability

en

dc.subject

speaker classification network

en

dc.subject

variability factors

en

dc.title

Leveraging deep speaker embedding variability factors for verification and diarization

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Doctoral

en

dc.type.qualificationname

PhD Doctor of Philosophy

en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Luu2023.pdf
Size:: 7.4 MB
Format:: Adobe Portable Document Format
Description:

Download

This item appears in the following Collection(s)

Informatics thesis and dissertation collection