An Investigation of nonlinear speech synthesis and pitch modification techniques
Abstract
Speech synthesis technology plays an important role in many aspects of man–machine interaction,
particularly in telephony applications. In order to be widely accepted, the synthesised
speech quality should be as human–like as possible. This thesis investigates novel techniques
for the speech signal generation stage in a speech synthesiser, based on concepts from nonlinear
dynamical theory. It focuses on natural–sounding synthesis for voiced speech, coupled with the
ability to generate the sound at the required pitch.
The one–dimensional voiced speech time–domain signals are embedded into an appropriate
higher dimensional space, using Takens’ method of delays. These reconstructed state space
representations have approximately the same dynamical properties as the original speech generating
system and are thus effective models.
A new technique for marking epoch points in voiced speech that operates in the state space
domain is proposed. Using the fact that one revolution of the state space representation is equal
to one pitch period, pitch synchronous points can be found using a Poincar´e map. Evidently the
epoch pulses are pitch synchronous and therefore can be marked.
The same state space representation is also used in a locally–linear speech synthesiser. This
models the nonlinear dynamics of the speech signal by a series of local approximations, using
the original signal as a template. The synthesised speech is natural–sounding because, rather
than simply copying the original data, the technique makes use of the local dynamics to create
a new, unique signal trajectory. Pitch modification within this synthesis structure is also investigated,
with an attempt made to exploit the ˇ Silnikov–type orbit of voiced speech state space
reconstructions. However, this technique is found to be incompatible with the locally–linear
modelling technique, leaving the pitch modification issue unresolved.
A different modelling strategy, using a radial basis function neural network to model the state
space dynamics, is then considered. This produces a parametric model of the speech sound.
Synthesised speech is obtained by connecting a delayed version of the network output back to
the input via a global feedback loop. The network then synthesises speech in a free–running
manner. Stability of the output is ensured by using regularisation theory when learning the
weights. Complexity is also kept to a minimum because the network centres are fixed on a
data–independent hyper–lattice, so only the linear–in–the–parameters weights need to be learnt
for each vowel realisation. Pitch modification is again investigated, based around the idea of
interpolating the weight vector between different realisations of the same vowel, but at differing
pitch values. However modelling the inter–pitch weight vector variations is very difficult, indicating
that further study of pitch modification techniques is required before a complete nonlinear
synthesiser can be implemented.