Edinburgh Research Archive

Leveraging deep speaker embedding variability factors for verification and diarization

dc.contributor.advisor
Bell, Peter
dc.contributor.advisor
Renals, Stephen
dc.contributor.author
Luu, Chau Van Quy
dc.date.accessioned
2023-12-21T14:12:52Z
dc.date.available
2023-12-21T14:12:52Z
dc.date.issued
2023-12-21
dc.description.abstract
The tasks of speaker verification (SV, determining whether a test utterance has the same speaker as an enrolment utterance) and speaker diarization (SD, determining ‘who spoke when?’) both fall under the umbrella of speaker recognition tasks. Both SV and SD have become tasks with high applicability in mainstream technology. Example applications of SV and SD are a voice assistant that only activates for a specific user, and colour-coding subtitles according to the speaker that produced them, respectively. Both SV and SD systems have found success in recent years by utilising speaker embeddings, vector representations of speaker identity extracted from segments of speech. By comparing the similarity of the embeddings extracted from different utterances, it is possible to distinguish the speaker of each utterance, and this mechanism is what many successful verification and diarization systems are based upon. Deep speaker embeddings are the speaker embeddings extracted from the intermediate layer of a neural network. This neural network is trained in such a way as to encode speaker identity in a discriminative fashion in the desired intermediate layer. For example, often the training objective is speaker classification or a variation thereof, and while this has been shown to be a very successful strategy, various sources of information can be encoded into this embedding space. Some of these sources are speaker related, and intuitively make up part of speaker identity, such as speaker gender. However, some sources of information are not explicitly speaker related, such as the channel and recording information, but can nonetheless be captured during training. In this work, we look at these sources of information and variability, describing them speaker embedding variability factors and explore how they interact with and affect the downstream tasks of SV and SD. Specifically, our work looks at the following topics: reducing channel variability, reducing speaker variability distribution mismatch, explicitly encouraging variability for speaker identity related factors for increased robustness, and investigating the contribution of certain speaker attributes to separability. For reducing channel variability, we propose a training regime based on adversarial methods that adds an adversarial loss based on discriminating whether pairs of embeddings come from the same recording. This approach adds channel invariance to the training objective of the embedding network while being dataset agnostic, not requiring any additional labels. We show that our induced channel invariability improves verification for out-of-recording pairs of utterances, while also improving diarization performance. In the pursuit of encouraging speaker identity related variability, we propose a multitask learning training framework for leveraging auxiliary labels, adding additional at tribute related tasks to the overall training objective. In conjunction with the standard speaker classification task, the addition of speaker age and speaker nationality classification tasks was shown to improve verification and diarization performance, partic ularly when fine-tuning to new domains. We suggest the reason for this improvement is due to the robustness imparted by structuring the embedding space in a way that we already know is speaker identity related, thus decreasing the risk of fitting to other non-identity related factors. We also investigated the contribution of speaker attributes to speaker separability using disentangled representations. Here, we combined the aforementioned multi-task framework along with adversarial methods to successfully isolate aspects of speaker identity in specific dimensions of the speaker embedding. We then ablated these dimensions and found that for different datasets, different speaker attributes were of varying importance to separability in diarization and verification tasks, with gender a particularly strong factor for in-the-wild celebrity utterances. Furthermore, by looking at the logits of the speaker classification network that speaker embeddings are extracted from, we found that for a group unseen utterances, the predicted posterior distribution (of training set speakers) was extremely skewed. By im plementing a form of iterative fine-tuning on high probability training set speakers in combination with a form of dropout on the output layer, we showed improvements to verification performance. We suggest that the cause of this improvement is due to a distribution mismatch of speakers, relating to the aforementioned speaker attributes. Overall, we explored several different approaches to manipulating the variability factors present in deep speaker embeddings, finding that each approach had merits when applied to specific scenarios. We suggest approaches for future work that build upon the techniques outlined in this thesis, in particular for speaker attribute-related learning and disentanglement.
en
dc.identifier.uri
https://hdl.handle.net/1842/41307
dc.identifier.uri
http://dx.doi.org/10.7488/era/4042
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Chau Luu et al. (May 2020a). “Channel Adversarial Training for Speaker Verification and Diarization”. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7094–7098. DOI: 10.1109/ICASSP40776.2020.9053323
en
dc.relation.hasversion
Chau Luu et al. (Aug. 30, 2021). “Leveraging Speaker Attribute Information Using Multi Task Learning for Speaker Verification and Diarization”. In: Interspeech 2021. Interspeech 2021. ISCA, pp. 491–495. DOI: 10 . 21437 / Interspeech.2021-622. URL: https://www.isca-speech.org/archive/ interspeech_2021/luu21_interspeech.html
en
dc.relation.hasversion
Chau Luu et al. (Sept. 18, 2022). “Investigating the Contribution of Speaker Attributes to Speaker Separability Using Disentangled Speaker Representations”. In: Interspeech 2022. Interspeech 2022. ISCA, pp. 610–614. DOI: 10.21437/ Interspeech . 2022 - 10643. URL: https : / / www . isca - speech . org / archive/interspeech_2022/luu22_interspeech.html
en
dc.relation.hasversion
Chau Luu et al. (Nov. 1, 2020b). “Dropping Classes for Deep Speaker Representation Learning”. In: The Speaker and Language Recognition Workshop (Odyssey 2020). The Speaker and Language Recognition Workshop (Odyssey 2020). ISCA, pp. 357–364. DOI: 10.21437/Odyssey.2020-50. URL: https: //www.isca-speech.org/archive/odyssey_2020/luu20_odyssey.html
en
dc.subject
speaker verification
en
dc.subject
speaker diarization
en
dc.subject
vector representations
en
dc.subject
deep speaker embeddings
en
dc.subject
neural network
en
dc.subject
channel variability
en
dc.subject
speaker classification network
en
dc.subject
variability factors
en
dc.title
Leveraging deep speaker embedding variability factors for verification and diarization
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en

Files

Original bundle

Now showing 1 - 1 of 1
Name:
Luu2023.pdf
Size:
7.4 MB
Format:
Adobe Portable Document Format
Description:

This item appears in the following Collection(s)