Exploration of Contextual Constraints for Charter Pre-Classification

01 January 2001

New Image

(PREVIOUS TITLE: Identification of Case, Digits and Special Symbols Using a Context Window) We present strategies and results for identifying the symbol type of every character in a text document. Assuming reasonable word and character segmentation for shape clustering, we designed several type recognition methods that depend on cluster n-grams, characteristics of neighbors, and within-word context. On an ASCII test corpus of 925 articles, these methods represent a substantial improvement over default assignment of all characters to lower case.