Edinburgh Research Archive

Bayesian networks for predicting duration of phones

Abstract


In a concatenative text-to-speech (TTS) system, the duration of a phonetic segment (phone) is predicted by a duration model which is usually trained using a database of feature vectors, that consist of a set of linguistic factors' (attributes') values describing a phone in a particular context. In general, databases used to train phone duration models are unbalanced. However, it has been shown that the probability of a rare feature vector occurring even in a small sample of text is quite high. Furthermore, factors affecting phone's duration interact; a set of two or more factors may amplify or attenuate the affect of other factors. A robust model for predicting phone duration must generalise well in order to successfully predict the durations of phones with these rare feature vectors. Since linguistic factors affecting segment duration interact, we would expect that modelling these factor interactions will give a better model. There have been a number of models developed for predicting a phone's duration, ranging from rule-based to neural nets to classification and regression tree (CART) to sums-of-products (SoP) mod¬ els. In the CART model, a phone's duration is predicted by a decision tree. The tree is built by recursively clustering the training data into subsets that share common values for certain attributes of the feature vectors. The duration of a phone is then predicted by using the tree to find the data cluster that matches as many of the feature vector attributes as possible. The CART model is easy to build, robust to errors in data but performs poorly when the percent of missing data is too high. In the SoP model, the log of a phone's duration is predicted as a sum of factors' product terms. The SoP model predicts phone duration with high accuracy, even in cases of hidden or missing data. However, this is done at the cost of substantial data pre-processing. In addition, the number of different sums-of-products models grows hyper-exponentially with the number of factors. Therefore, one must use some heuristic search techniques to find the model that fits the data the best. In our work, we use a Bayesian belief network (BN) consisting of discrete nodes for the linguistic factors and a single continuous node for the phone's duration. Interactions between factors are represented as conditional dependency relations in this graphical model. During train¬ ing, the parameters of the belief network are learned via the Expectation Maximisation (EM) algorithm. The duration of each phone in the test set is then predicted via Bayesian inference: given the parameters of the be¬ lief network, we calculate the probability of a phone taking on a particular duration given the observations of the linguistic variables. The duration value with the maximum probability is chosen as the phone's duration. We contrasted the results of the belief network model with those of the sums of products and CART models. We trained and tested all three models on the same data. In terms of the RMS error our BN model performs better than both CART and SoP models. In terms of the correlation coefficient, our BN model performs better than SoP model, and no worse than CART model. We believe our Bayesian model has many advantages compared to CART and SoP models. For instance, it captures the factors' interactions in a concise way by causal relationships among the variables in the graphical model. The Bayesian model also makes robust predictions of phone duration in cases of missing or hidden data.

This item appears in the following Collection(s)