Statistical Tree-Based Modelling of Phonetic Segment Durations

27 February 1989

New Image

Segmental durations are affected by many factors: phonetic context, speaking rate, stress, word and phrasal position, etc. Regression trees [L. Breiman, et al, Classification and regression trees (Wadsworth, Monteray, CA, 1984)] are well suited to capturing these effects, since they (1) statistically select the most significant features, (2) permit both categorical and continuous factors to be considered, (3) provide "honest" estimates of their performance, and (4) allow human interpretation and exploration of their result. In particular, transcribed databases of 400 utterances from a single speaker and 4000 utterances from 400 speakers of American English were used to build optimal decision trees that predict segment durations based on such factors. Over 70% of the durational variance for the single speaker and over 60% for the multiple speakers were accounted for by this method when using information only at the word level and below. These trees were used to derive durations for a text-to-speech synthesizer and were found to give more faithful results than the existing heuristically derived duration rules. Since tree building and evaluation is rapid once the data are collected and the candidate features specified, the technique can be readily applied to other feature sets and to other languages.