A Unified Approach of Incorporating General Features in Decision Tree Based Acoustic Modeling

15 March 1999

New Image

In this paper, an unified maximum likelihood framework of incorporating phonetic and non-phonetic features in decision tree state tying based acoustic modeling is proposed. Unlike phonetic features, non-phonetic features are those features which cannot be determined by the Viterbi alignment with the word or phoneme strings. Although non-phonetic features are used in speech recognition, they are often treated separately and based on various heuristics. In our approach, non-phonetic features are included as additional tags to the clustering of equivalent states. They are consistently used with features about phonetic contexts according to the maximum likelihood framework in decision tree clustering. Moreover, the proposed tagged decision tree is based on the full training data, and therefore, it alleviates the problem of training data depletion in building specific feature dependent acoustic models. Experimental results indicate that up to 10% word error rate reduction can be achieved in a large vocabulary (Wall Street Journal) speech recognition task based on the proposed approach.