Leveraging deep speaker embedding variability factors for verification and diarization
dc.contributor.advisor
Bell, Peter
dc.contributor.advisor
Renals, Stephen
dc.contributor.author
Luu, Chau Van Quy
dc.date.accessioned
2023-12-21T14:12:52Z
dc.date.available
2023-12-21T14:12:52Z
dc.date.issued
2023-12-21
dc.description.abstract
The tasks of speaker verification (SV, determining whether a test utterance has the same
speaker as an enrolment utterance) and speaker diarization (SD, determining ‘who
spoke when?’) both fall under the umbrella of speaker recognition tasks. Both SV
and SD have become tasks with high applicability in mainstream technology. Example
applications of SV and SD are a voice assistant that only activates for a specific user,
and colour-coding subtitles according to the speaker that produced them, respectively.
Both SV and SD systems have found success in recent years by utilising speaker embeddings, vector representations of speaker identity extracted from segments of speech.
By comparing the similarity of the embeddings extracted from different utterances, it
is possible to distinguish the speaker of each utterance, and this mechanism is what
many successful verification and diarization systems are based upon.
Deep speaker embeddings are the speaker embeddings extracted from the intermediate
layer of a neural network. This neural network is trained in such a way as to encode
speaker identity in a discriminative fashion in the desired intermediate layer. For example, often the training objective is speaker classification or a variation thereof, and
while this has been shown to be a very successful strategy, various sources of information can be encoded into this embedding space. Some of these sources are speaker
related, and intuitively make up part of speaker identity, such as speaker gender. However, some sources of information are not explicitly speaker related, such as the channel
and recording information, but can nonetheless be captured during training.
In this work, we look at these sources of information and variability, describing them
speaker embedding variability factors and explore how they interact with and affect
the downstream tasks of SV and SD. Specifically, our work looks at the following topics: reducing channel variability, reducing speaker variability distribution mismatch,
explicitly encouraging variability for speaker identity related factors for increased robustness, and investigating the contribution of certain speaker attributes to separability.
For reducing channel variability, we propose a training regime based on adversarial
methods that adds an adversarial loss based on discriminating whether pairs of embeddings come from the same recording. This approach adds channel invariance to the
training objective of the embedding network while being dataset agnostic, not requiring any additional labels. We show that our induced channel invariability improves
verification for out-of-recording pairs of utterances, while also improving diarization
performance.
In the pursuit of encouraging speaker identity related variability, we propose a multitask learning training framework for leveraging auxiliary labels, adding additional at tribute related tasks to the overall training objective. In conjunction with the standard
speaker classification task, the addition of speaker age and speaker nationality classification tasks was shown to improve verification and diarization performance, partic ularly when fine-tuning to new domains. We suggest the reason for this improvement
is due to the robustness imparted by structuring the embedding space in a way that
we already know is speaker identity related, thus decreasing the risk of fitting to other
non-identity related factors.
We also investigated the contribution of speaker attributes to speaker separability using disentangled representations. Here, we combined the aforementioned multi-task
framework along with adversarial methods to successfully isolate aspects of speaker
identity in specific dimensions of the speaker embedding. We then ablated these dimensions and found that for different datasets, different speaker attributes were of
varying importance to separability in diarization and verification tasks, with gender a
particularly strong factor for in-the-wild celebrity utterances.
Furthermore, by looking at the logits of the speaker classification network that speaker
embeddings are extracted from, we found that for a group unseen utterances, the predicted posterior distribution (of training set speakers) was extremely skewed. By im plementing a form of iterative fine-tuning on high probability training set speakers in
combination with a form of dropout on the output layer, we showed improvements to
verification performance. We suggest that the cause of this improvement is due to a
distribution mismatch of speakers, relating to the aforementioned speaker attributes.
Overall, we explored several different approaches to manipulating the variability factors present in deep speaker embeddings, finding that each approach had merits when
applied to specific scenarios. We suggest approaches for future work that build upon
the techniques outlined in this thesis, in particular for speaker attribute-related learning
and disentanglement.
en
dc.identifier.uri
https://hdl.handle.net/1842/41307
dc.identifier.uri
http://dx.doi.org/10.7488/era/4042
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Chau Luu et al. (May 2020a). “Channel Adversarial Training for Speaker Verification and Diarization”. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7094–7098. DOI: 10.1109/ICASSP40776.2020.9053323
en
dc.relation.hasversion
Chau Luu et al. (Aug. 30, 2021). “Leveraging Speaker Attribute Information Using Multi Task Learning for Speaker Verification and Diarization”. In: Interspeech 2021. Interspeech 2021. ISCA, pp. 491–495. DOI: 10 . 21437 / Interspeech.2021-622. URL: https://www.isca-speech.org/archive/ interspeech_2021/luu21_interspeech.html
en
dc.relation.hasversion
Chau Luu et al. (Sept. 18, 2022). “Investigating the Contribution of Speaker Attributes to Speaker Separability Using Disentangled Speaker Representations”. In: Interspeech 2022. Interspeech 2022. ISCA, pp. 610–614. DOI: 10.21437/ Interspeech . 2022 - 10643. URL: https : / / www . isca - speech . org / archive/interspeech_2022/luu22_interspeech.html
en
dc.relation.hasversion
Chau Luu et al. (Nov. 1, 2020b). “Dropping Classes for Deep Speaker Representation Learning”. In: The Speaker and Language Recognition Workshop (Odyssey 2020). The Speaker and Language Recognition Workshop (Odyssey 2020). ISCA, pp. 357–364. DOI: 10.21437/Odyssey.2020-50. URL: https: //www.isca-speech.org/archive/odyssey_2020/luu20_odyssey.html
en
dc.subject
speaker verification
en
dc.subject
speaker diarization
en
dc.subject
vector representations
en
dc.subject
deep speaker embeddings
en
dc.subject
neural network
en
dc.subject
channel variability
en
dc.subject
speaker classification network
en
dc.subject
variability factors
en
dc.title
Leveraging deep speaker embedding variability factors for verification and diarization
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en
Files
Original bundle
1 - 1 of 1
- Name:
- Luu2023.pdf
- Size:
- 7.4 MB
- Format:
- Adobe Portable Document Format
- Description:
This item appears in the following Collection(s)

