Duration, Pitch and Diphones in the CSTR TTS System
This paper describes the prosodic processing and wave-form generation components of the text-to-speech system being developed at Edinburgh University's Centre for Speech Technology Research. Intonation is specified as a sequence of minimal descriptors whose locations are given in terms of syntactically-determined prosodic domains. A pitch contour is computed by converting the descriptors into a sequence of abstract targets whose absolute values depend on a specific speaker model. Duration is determined first at the level of the syllable by a neural network, then accommodated at the segment level according to the distributions observed in a phonetically balanced database. The output waveform is generated by LPC resynthesis of diphone units. Three methods of diphone segmentation are discussed.