Independence is Good: Dependency-Based Histogram Synopses for High-Dimensional Data

01 June 2001

New Image

Approximating the joint data distribution of a multi-dimensional data set through a compact and accurate histogram synopsis is a fundamental problem arising in numerous practical scenarios, including query optimization, query profiling, and approximate query answering. Existing work has proposed two distinct approaches for tackling this difficult approximation problem. The first approach uses simple one-dimensional histograms on the marginal distributions of individual attributes in conjunction with a simplistic attribute-independence assumption. The second approach uses a multi- dimensional histogram for directly approximating the full joint data distribution over the complete set of attributes. Unfortunately, both approaches are doomed to fail for high-dimensional data sets with complex correlation patterns between attributes.