Speech segmentation and speaker diarisation for transcription and translation
View/ Open
Date
27/06/2016Author
Sinclair, Mark
Metadata
Abstract
This dissertation outlines work related to Speech Segmentation – segmenting an audio
recording into regions of speech and non-speech, and Speaker Diarization – further
segmenting those regions into those pertaining to homogeneous speakers.
Knowing not only what was said but also who said it and when, has many useful
applications. As well as providing a richer level of transcription for speech, we will
show how such knowledge can improve Automatic Speech Recognition (ASR) system
performance and can also benefit downstream Natural Language Processing (NLP)
tasks such as machine translation and punctuation restoration.
While segmentation and diarization may appear to be relatively simple tasks to
describe, in practise we find that they are very challenging and are, in general, ill-defined
problems. Therefore, we first provide a formalisation of each of the problems
as the sub-division of speech within acoustic space and time. Here, we see that the
task can become very difficult when we want to partition this domain into our target
classes of speakers, whilst avoiding other classes that reside in the same space, such as
phonemes. We present a theoretical framework for describing and discussing the tasks
as well as introducing existing state-of-the-art methods and research.
Current Speaker Diarization systems are notoriously sensitive to hyper-parameters
and lack robustness across datasets. Therefore, we present a method which uses a series
of oracle experiments to expose the limitations of current systems and to which
system components these limitations can be attributed. We also demonstrate how Diarization
Error Rate (DER), the dominant error metric in the literature, is not a comprehensive
or reliable indicator of overall performance or of error propagation to subsequent
downstream tasks. These results inform our subsequent research.
We find that, as a precursor to Speaker Diarization, the task of Speech Segmentation
is a crucial first step in the system chain. Current methods typically do not account
for the inherent structure of spoken discourse. As such, we explored a novel method
which exploits an utterance-duration prior in order to better model the segment distribution
of speech. We show how this method improves not only segmentation, but also
the performance of subsequent speech recognition, machine translation and speaker
diarization systems.
Typical ASR transcriptions do not include punctuation and the task of enriching
transcriptions with this information is known as ‘punctuation restoration’. The benefit
is not only improved readability but also better compatibility with NLP systems
that expect sentence-like units such as in conventional machine translation. We show
how segmentation and diarization are related tasks that are able to contribute acoustic
information that complements existing linguistically-based punctuation approaches.
There is a growing demand for speech technology applications in the broadcast media
domain. This domain presents many new challenges including diverse noise and
recording conditions. We show that the capacity of existing GMM-HMM based speech
segmentation systems is limited for such scenarios and present a Deep Neural Network
(DNN) based method which offers a more robust speech segmentation method resulting
in improved speech recognition performance for a television broadcast dataset.
Ultimately, we are able to show that the speech segmentation is an inherently ill-defined
problem for which the solution is highly dependent on the downstream task
that it is intended for.
Collections
The following license files are associated with this item: