Page Decomposition and Signature Finding via Shape Classification and Geometric Layout
20 September 1999
The decomposition of a page image into text and various types of non-text elements is a challenging problem important to document analysis problems such as OCR, storage and retrieval, and identifying the sender and recipient of a FAX. A fast classifier based on a skeletonization of the image attempts to classify groups of related line segments as text, ruling lines, signatures, other line art, or miscellaneous. Then everything classified as text is processed by Baird's language-free layout analysis system so that a post-processor can use the geometric layout to refine the decisions about what is text and what is non-text. This could then be further processed to identify complex objects such as tables, signature blocks and line drawings. In order to recognize signature and separate them from ruling lines and components of line drawings, line segments from skeletonization need to be strung together by a curve-fitting process. After finding long, fairly-straight lines and setting them aside, a more lenient criterion is used for stringing together pairs of segments to form the groups on which to run the fast classifier.