Distinct Counting via a Self-Learning Bitmap
01 January 2009
Estimating the number of distinct elements (cardinality) is a fundamental problem in database that has attracted active research in the past two decades, due to its extensive applications (especially in the Internet). Several algorithms have been proposed via sampling or sketching for obtaining statistical estimates that only require limited computing and memory resources. Although some of these estimates have desirable statistical properties such as asymptotically unbiasedness, their performance in terms of relative estimation accuracy usually depends on the unknown cardinalities. For example, with limited memory, linear counting cite{whang.etc.1990} works best with smaller cardinalities while the LogLog counting algorithm cite{durand.flajolet.2003} works best with large cardinalities. In this paper, we address the following question: can a distinct counting algorithm have uniformly reliable performance, i.e. constant relative estimation errors for a unknown cardinality in a wide range, say from tens to millions? We propose a self-learning bitmap algorithm (S-bitmap) to answer this question. The S-bitmap is a bitmap obtained via a novel adaptive sampling process, where the bits corresponding to the sampled items are set to 1, and the sampling rates are learned from the number of distinct items already passed and reduced sequentially as more bits are set to 1. A unique property of S-bitmap is that its relative estimation error is truly stabilized, i.e. invariant to the unknown cardinality in a prescribed range. We demonstrate through both theoretical and empirical studies that with a given memory requirement, S-bitmap is more accurate and reliable than state-of-the-art algorithms such as multi-resolution Bitmap, LogLog counting and Hyper-LogLog counting algorithms.