Fast Identification of Stop Words for Font Learning and Keyword Spotting
20 September 1999
A recently proposed adaptive strategy for text recognition attempts to derive knowledge about the dominant font on a given page. The strategy uses a linguistic observation that over half of all words in a typical English passage are contained in a small set of less than 150 stop words. The small size of the set permits a word shape analysis for recognition of those words. Such analyses yield word identities with which character prototypes can be extracted from the word images. Ideally shape analysis should be applied only to the most likely stop words. This paper describes a fast procedure for locating the most likely candidates for those words. The procedure uses width statistics of individual words and their immediate neighbors, and is shown to be feasible and reliable using both simulated and real images. In an experiment using 400 page images, the method removed 63% of the words from being considered as stop words. The discrimination between stop and non-stop words also assists keyword spotting for information retrieval.