Leveraging deep speaker embedding variability factors for verification and diarization
Files
Item Status
Embargo End Date
Date
Authors
Luu, Chau Van Quy
Abstract
The tasks of speaker verification (SV, determining whether a test utterance has the same
speaker as an enrolment utterance) and speaker diarization (SD, determining ‘who
spoke when?’) both fall under the umbrella of speaker recognition tasks. Both SV
and SD have become tasks with high applicability in mainstream technology. Example
applications of SV and SD are a voice assistant that only activates for a specific user,
and colour-coding subtitles according to the speaker that produced them, respectively.
Both SV and SD systems have found success in recent years by utilising speaker embeddings, vector representations of speaker identity extracted from segments of speech.
By comparing the similarity of the embeddings extracted from different utterances, it
is possible to distinguish the speaker of each utterance, and this mechanism is what
many successful verification and diarization systems are based upon.
Deep speaker embeddings are the speaker embeddings extracted from the intermediate
layer of a neural network. This neural network is trained in such a way as to encode
speaker identity in a discriminative fashion in the desired intermediate layer. For example, often the training objective is speaker classification or a variation thereof, and
while this has been shown to be a very successful strategy, various sources of information can be encoded into this embedding space. Some of these sources are speaker
related, and intuitively make up part of speaker identity, such as speaker gender. However, some sources of information are not explicitly speaker related, such as the channel
and recording information, but can nonetheless be captured during training.
In this work, we look at these sources of information and variability, describing them
speaker embedding variability factors and explore how they interact with and affect
the downstream tasks of SV and SD. Specifically, our work looks at the following topics: reducing channel variability, reducing speaker variability distribution mismatch,
explicitly encouraging variability for speaker identity related factors for increased robustness, and investigating the contribution of certain speaker attributes to separability.
For reducing channel variability, we propose a training regime based on adversarial
methods that adds an adversarial loss based on discriminating whether pairs of embeddings come from the same recording. This approach adds channel invariance to the
training objective of the embedding network while being dataset agnostic, not requiring any additional labels. We show that our induced channel invariability improves
verification for out-of-recording pairs of utterances, while also improving diarization
performance.
In the pursuit of encouraging speaker identity related variability, we propose a multitask learning training framework for leveraging auxiliary labels, adding additional at tribute related tasks to the overall training objective. In conjunction with the standard
speaker classification task, the addition of speaker age and speaker nationality classification tasks was shown to improve verification and diarization performance, partic ularly when fine-tuning to new domains. We suggest the reason for this improvement
is due to the robustness imparted by structuring the embedding space in a way that
we already know is speaker identity related, thus decreasing the risk of fitting to other
non-identity related factors.
We also investigated the contribution of speaker attributes to speaker separability using disentangled representations. Here, we combined the aforementioned multi-task
framework along with adversarial methods to successfully isolate aspects of speaker
identity in specific dimensions of the speaker embedding. We then ablated these dimensions and found that for different datasets, different speaker attributes were of
varying importance to separability in diarization and verification tasks, with gender a
particularly strong factor for in-the-wild celebrity utterances.
Furthermore, by looking at the logits of the speaker classification network that speaker
embeddings are extracted from, we found that for a group unseen utterances, the predicted posterior distribution (of training set speakers) was extremely skewed. By im plementing a form of iterative fine-tuning on high probability training set speakers in
combination with a form of dropout on the output layer, we showed improvements to
verification performance. We suggest that the cause of this improvement is due to a
distribution mismatch of speakers, relating to the aforementioned speaker attributes.
Overall, we explored several different approaches to manipulating the variability factors present in deep speaker embeddings, finding that each approach had merits when
applied to specific scenarios. We suggest approaches for future work that build upon
the techniques outlined in this thesis, in particular for speaker attribute-related learning
and disentanglement.
This item appears in the following Collection(s)

