Document Image Layout Comparison and Classification

20 September 1999

New Image

This paper describes methods for document image comparison and classification at the spatial layout level. The goal is to develop fast algorithms for initial document type classification without OCR, which can then be verified using more elaborate methods based on more detailed geometric and syntactic models. This is based on the assumption that different types of printed documents often have fairly distinct spatial layout styles. A novel feature set called interval encoding is introduced to capture elements of spatial layout. This feature set encodes region layout information in fixed-length vectors by capturing structural characteristics of the image. 

These fixed-length vectors are then compared to each other through a Manhattan distance computation for fast page layout comparison. The paper describes experiments and results to rank-order a set of document pages in terms of their layout similarity to a test document. We also demonstrate the usefulness of these features derived from interval coding, in a hidden Markov model based page layout classification system that is trainable and extendible. The methods described in the paper can be used in various document processing tasks including document retrieval, understanding and routing.