Dependence and independence in automatic speech recognition and synthesis.
When automatically recognising or synthesising speech by computer, we are forced to make a number of assumptions of statistical independence in order to make certain problems tractable. This paper gives a few examples of how phonetic knowledge is already usefully informing these decisions about independence, and a few examples of where it isn't, yet. Temporal integration how information from a region of speech is related, and is gathered together during perception is an important aspect of this. Automatic speech recognition (ASR) and synthesis by computer usually involve the use of various statistical models. For ASR these are models of how acoustic patterns group into larger units (usually phones), how those phones group into words and how words form sentences. For speech synthesis, we need to predict various things, such as: pronunciations for words; durations for phones; where to place phrase breaks and pitch accents; and so on. In building such models, deciding what factors are dependent and what are independent (in a statistical sense) is crucial. Modelling dependency means a model with more parameters, so the more things we can assume to be in- dependent, the better (so long as those assumptions are close enough to the truth).