Distinct Counting with a Self-Learning Bitmap
29 March 2009
Estimating the number of distinct elements (cardinality) is a fundamental problem in database that has attracted recurrent interest, due to its extensive applications in modern computer networks. There have been several probabilistic approaches, proposed in the computer science literature to address this problem in online applications that only require limited computing and memory resources. However, an undesirable property of these methods is that their performances are not scale-invariant, in the sense that their relative root mean square estimation errors depend on the unknown cardinalities. In this paper, we develop a novel approach, called self-learning bitmap (S-bitmap) that is scale-invariant for cardinalities in a wide range. S-bitmap uses a binary vector whose entries are updated from 0 to 1 by an adaptive sampling process, where the sampling rates are reduced sequentially as more and more entries change from 0 to 1. We prove rigorously that the S-bitmap estimate is not only unbiased but scale-invariant. We also demonstrate with extensive simulation and experimental studies that, in order to achieve the same accuracy as state-of-the-art methods, our approach requires significantly less memory and consumes similar or less operations for many practical cardinality scales.