What Use is Statistics for Massive Data?

01 January 2000

New Image

Statistics in the broad sense is about extracting information from data. The common view of statistics is much narrower, though. Often, it is seen only as a set of cookbook methods that are designed for small sets of data that are obtained according to a known design or sampling plan. Massive dynamic sets of data with tens or hundreds of gigabytes or even terabytes of data that are increasingly common in business, manufacturing, environmental sciences, astronomy, data networking and many other areas are felt to be beyond the domain of statistics. Moreover, the most visible challenges with massive data involve computing, which leads to the view that data mining and other branches of computer science are more appropriate for massive data than statistics is. This paper, however, argues that statistics in the broad sense is essential for extracting information from really big sets of data and shows how statistics intersects with and complements both mathematics and computer science. An application to realtime fraud detection is used to illustrate the ideas.