Edinburgh Research Archive

Leveraging deep speaker embedding variability factors for verification and diarization

Item Status

Embargo End Date

Authors

Luu, Chau Van Quy

Abstract

The tasks of speaker verification (SV, determining whether a test utterance has the same speaker as an enrolment utterance) and speaker diarization (SD, determining ‘who spoke when?’) both fall under the umbrella of speaker recognition tasks. Both SV and SD have become tasks with high applicability in mainstream technology. Example applications of SV and SD are a voice assistant that only activates for a specific user, and colour-coding subtitles according to the speaker that produced them, respectively. Both SV and SD systems have found success in recent years by utilising speaker embeddings, vector representations of speaker identity extracted from segments of speech. By comparing the similarity of the embeddings extracted from different utterances, it is possible to distinguish the speaker of each utterance, and this mechanism is what many successful verification and diarization systems are based upon. Deep speaker embeddings are the speaker embeddings extracted from the intermediate layer of a neural network. This neural network is trained in such a way as to encode speaker identity in a discriminative fashion in the desired intermediate layer. For example, often the training objective is speaker classification or a variation thereof, and while this has been shown to be a very successful strategy, various sources of information can be encoded into this embedding space. Some of these sources are speaker related, and intuitively make up part of speaker identity, such as speaker gender. However, some sources of information are not explicitly speaker related, such as the channel and recording information, but can nonetheless be captured during training. In this work, we look at these sources of information and variability, describing them speaker embedding variability factors and explore how they interact with and affect the downstream tasks of SV and SD. Specifically, our work looks at the following topics: reducing channel variability, reducing speaker variability distribution mismatch, explicitly encouraging variability for speaker identity related factors for increased robustness, and investigating the contribution of certain speaker attributes to separability. For reducing channel variability, we propose a training regime based on adversarial methods that adds an adversarial loss based on discriminating whether pairs of embeddings come from the same recording. This approach adds channel invariance to the training objective of the embedding network while being dataset agnostic, not requiring any additional labels. We show that our induced channel invariability improves verification for out-of-recording pairs of utterances, while also improving diarization performance. In the pursuit of encouraging speaker identity related variability, we propose a multitask learning training framework for leveraging auxiliary labels, adding additional at tribute related tasks to the overall training objective. In conjunction with the standard speaker classification task, the addition of speaker age and speaker nationality classification tasks was shown to improve verification and diarization performance, partic ularly when fine-tuning to new domains. We suggest the reason for this improvement is due to the robustness imparted by structuring the embedding space in a way that we already know is speaker identity related, thus decreasing the risk of fitting to other non-identity related factors. We also investigated the contribution of speaker attributes to speaker separability using disentangled representations. Here, we combined the aforementioned multi-task framework along with adversarial methods to successfully isolate aspects of speaker identity in specific dimensions of the speaker embedding. We then ablated these dimensions and found that for different datasets, different speaker attributes were of varying importance to separability in diarization and verification tasks, with gender a particularly strong factor for in-the-wild celebrity utterances. Furthermore, by looking at the logits of the speaker classification network that speaker embeddings are extracted from, we found that for a group unseen utterances, the predicted posterior distribution (of training set speakers) was extremely skewed. By im plementing a form of iterative fine-tuning on high probability training set speakers in combination with a form of dropout on the output layer, we showed improvements to verification performance. We suggest that the cause of this improvement is due to a distribution mismatch of speakers, relating to the aforementioned speaker attributes. Overall, we explored several different approaches to manipulating the variability factors present in deep speaker embeddings, finding that each approach had merits when applied to specific scenarios. We suggest approaches for future work that build upon the techniques outlined in this thesis, in particular for speaker attribute-related learning and disentanglement.

This item appears in the following Collection(s)