Latency in Speech Feature Analysis for Telepresence Event Coding

23 August 2010

New Image

For videoconferencing, there are network bandwidth and screen real-estate constraints that limit the number of user channels. We propose an intermediate transmission mode that transmits only at "events", where these are detected by both audio and video changes from the short-term signal average. Our objective in this paper is to determine latency until the audio portion of a single telepresence channel stabilizes. It is this stable signal from which we detect events. We describe a recursive filter approach for feature determination and experiments on the Switchboard telephone call database. Results show latency to stable signal of up to 10 seconds. Although events can be detected much more quickly (1 sec) if the signal is already stable, this latency time must be considered at the start of a conversation or at changes in the short-term average. single channel, let alone the many channels required for a multi-person conference, 2) monitors have finite screen space, limiting the number of video channels that can be simultaneously displayed, 3) we don't need to see full video of all users when much less frequent event notification is sufficient, and 4) many users don't want to undergo constant video "surveillance". For the purpose of computing audio waveform envelope estimates, a class of filters that yields near real-time output is IIR (infinite impulse response) filters. These are recursive linear filters, where each sample output is in part a function of past outputs. There are disadvantages of IIR filters (such as potential instability for large filter orders and nonlinear phase), however the economy of this type of filter makes it the processing method of choice for many real-time applications.