Spoken Dialogue Evaluation for the Bell Labs Communicator System
25 March 2002
In the first part of this paper, we expand on some of the analyses in [3], using the dialogues collected by the Bell Labs Communicator System [1,2] in the DARPA 2001 Communicator Evaluation. Since task completion is generally considered to be one of the most important factors in deploying a dialogue system, we decided to investigate the effects of task completion on other objective and subjective measures. Dialogue-level metrics, however, do not really help system designers pinpoint "hot-spots" in the dialogue where the system needs improvement. Thus, we initiated a second study, with naive rates evaluating the system responses on a turn-by-turn basis. The findings presented in the second part of the paper suggest that subjective metrics of dialogue quality may be difficult to reproduce.