OCR with No Shape Training

01 January 2000

New Image

We present a document-specific OCR systems and apply it to a corpus of faxed business letters. The pages are segmented into text-lines, words, and characters. Unsupervised classification of the character bitmaps on each page, using a "clump" metric, typically yields several hundred clusters with highly skewed populations. Letter identities are assigned to each cluster by maximizing the number of matches with a lexicon of English words. The resulting transcription is evaluated by comparing it with word-level ground-truth. We found that for 2/3 of the pages, we can identify almost 80% of the words that are included in the lexicon, without any shape-based training. As expected, most of the residual errors are caused by mis-segmentation, including missed lines, and punctuation. This research differs from earlier attempts to apply cipher decoding to OCR in: 1) using real data; 2) a more appropriate clustering algorithm; and 3) decoding a many-to-many instead of a one-to-one mapping between clusters and letters.