Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

01 January 2002

New Image

When automatic speech recognition (ASR) and speaker verification (SV) are applied in adverse acoustic environments, endpoint detection and energy normalization can be crucial to the functioning of both of the systems. In low signal-to-noise (SNR) situations, conventional approaches to endpoint detection and energy normalization often fail, and ASR performances usually degrade dramatically. The purpose of this paper is to address the above endpoint problem. For ASR, we propose a real-time approach. It uses an optimal filter plus a 3-state decision logic for endpoint detection. The filter is designed utilizing several criteria to ensure accuracy and robustness. It has almost invariant responses at various background noise levels. The detected endpoints are then applied to energy normalization sequentially. Evaluation results show that the proposed algorithm significantly reduces the string error rates in 7 out of 12 tested databases. The reduction rates even exceed 50% in two of them. For SV, we propose a batch-mode approach. It uses the optimal filter plus a two-mixture energy model for endpoint detection. The experiments show that the batch-mode algorithm can detect endpoints as accurately as using HMM force alignment while the proposed one has much less computational complexity.