On the Use of Energy in LPC-Based Recognition of Isolated Words

01 December 1982

New Image

In the last few years it has become common to use LPC coding techniques for speech r e c o g n i t i o n . T h e speech to be represented is modeled by a linear digital filter with coefficients chosen so that the transfer function of the filter approximates the spectrum of the speech over some short interval of time. Typically, a speech recognition 2971 system performs its task by comparing the unknown utterance or test with a number of previously stored reference patterns. Both the test and reference are characterized by a set of linear predictive coefficients. This is accomplished by digitizing the speech at some suitable rate and breaking the utterance into time windowed regions or "frames" upon which LPC analysis is performed. The frames of speech generally overlap and are typically spaced 10 to 20 ms apart in time. Thus, a typical 0.6-second utterance is represented by about 40 frames of linear predictive coefficients. It is well known in the area of speech recognition that optimal time alignment of reference patterns to test patterns substantially reduces recognition errors for a vocabulary with polysyllabic words.1 The most commonly used time alignment procedures, for the speech recognition problem, are the class of algorithms referred to as dynamic programming (DP) or dynamic time warping (DTW) methods. 2 5 Let us assume that we are given a characterization of an isolated word that consists of a set of N vectors of LPC coefficients. The test pattern, T, is represented as: T = (T(l), T{2), · · · , T(N)}, (1) where the vector T{i) is a spectral (LPC) representation of the ith frame of the test word.