Integrating geometrical and linguistic analysis for email signature block parsing

01 October 1999

New Image

The signature block is a common structured component found in email messages. Accurate identification and analysis of signature blocks is important in many multimedia messaging and information retrieval applications such as email text-to-speech rendering, automatic construction of personal address databases, and interactive message retrieval. It is also a very challenging task, because signature blocks often appear in complex two-dimensional layouts which are guided only by loose conventions. Traditional text analysis methods designed to deal with sequential text cannot handle two-dimensional structures, while the highly unconstrained nature of signature blocks makes the application of two-dimensional grammars very difficult. In this article, we describe an algorithm for signature block analysis which combines two-dimensional structural segmentation with one-dimensional grammatical constraints. The information obtained from both layout and linguistic analysis is integrated in the form of weighted finite-state transducers. The algorithm is currently implemented asa component in a preprocessing system for email text-to-speech rendering.