Precise Estimation of Vocal Tract and Voice Source Characteristics
This thesis addresses the problem of quality degradation in speech produced by parameter-based speech synthesis, within the framework of an articulatory-acoustic forward mapping. I first investigate current problems in speech parameterisation, and point out the fact that conventional parameterisation inaccurately extracts the vocal tract response due to interference from the harmonic structure of voiced speech. To overcome this problem, I introduce a method for estimating filter responses more precisely from periodic signals. The method achieves such estimation in the frequency domain by approximating all the harmonics observed in several frames based on a least squares criterion. It is shown that the proposed method is capable of estimating the response more accurately than widely-used frame-by-frame parameterisation, for simulations using synthetic speech and for an articulatory-acoustic mapping using actual speech. I also deal with the source-filter separation problem and independent control of the voice source characteristic during speech synthesis. I propose a statistical approach to separating out the vocal-tract filter response from the voice source characteristic using a large articulatory database. The approach realises such separation for voiced speech using an iterative approximation procedure under the assumption that the speech production process is a linear system composed of a voice source and a vocal-tract filter, and that each of the components is controlled independently by different sets of factors. Experimental results show that controlling the source characteristic greatly improves the accuracy of the articulatory-acoustic mapping, and that the spectral variation of the source characteristic is evidently influenced by the fundamental frequency or the power of speech. The thesis provides more accurate acoustical approximation of the vocal tract response, which will be beneficial in a wide range of speech technologies, and lays the groundwork in speech science for a new type of corpus-based statistical solution to the source-filter separation problem.