Dependence and independence in automatic speech recognition and synthesis.
Abstract
When automatically recognising or synthesising speech by computer, we are
forced to make a number of assumptions of statistical independence in order
to make certain problems tractable. This paper gives a few examples of how
phonetic knowledge is already usefully informing these decisions about independence, and a few examples of where it isn't, yet. Temporal integration
how information from a region of speech is related, and is gathered together
during perception is an important aspect of this.
Automatic speech recognition (ASR) and synthesis by computer usually involve the use of various statistical models. For ASR these are models of how
acoustic patterns group into larger units (usually phones), how those phones
group into words and how words form sentences. For speech synthesis, we
need to predict various things, such as: pronunciations for words; durations
for phones; where to place phrase breaks and pitch accents; and so on.
In building such models, deciding what factors are dependent and what are
independent (in a statistical sense) is crucial. Modelling dependency means
a model with more parameters, so the more things we can assume to be in-
dependent, the better (so long as those assumptions are close enough to the
truth).