UNIX Time-Sharing System: Statistical Text Processing

01 July 1978

New Image

Statistical Text Processing By L. E. McMAHON, L. L. CHERRY, and R. MORRIS (Manuscript received December 5, 1977) Several studies of the statistical properties of English text have used the system and UNIX programming tools. This paper describes several of the useful UNIX facilities for statistical studies and summarizes some studies that have been made at the character level, the character-string level, and the level of English words. The descriptions give a sample of the results obtained and constitute a short introduction, by case-study, on how to use UNIX tools for studying the statistics of English. UNIX* I. INTRODUCTION The UNIX system is an especially friendly environment in which to do statistical studies of English text. The file system does not impose arbitrary limits on what can be done with different kinds of files and allows tools to be written to apply to files of text, files of text statistics, etc. Pipes and filters allow small steps of processing to be combined and recombined to effect very diverse purposes, almost as English words can be recombined to express very diverse thoughts. The C language, native to the UNIX system, is especially convenient for programs which manipulate characters. Finally, an accidental but important fact is that many UNIX systems are heavily used for document preparation, thus ensuring the ready availability of text for practicing techniques and sharpening tools. This paper gives short reports on several different statistical * UNIX is a trademark of Bell Laboratories.