ApproxJoin: Approximate Distributed Multi-way Joins
11 October 2018
The join operation is a fundamental building block of parallel data processing. Unfortunately, it is very resource-intensive to compute an equi-join across massive datasets. The approximate computing paradigm allows users to trade accuracy and latency for expensive data processing operations. The equi-join operator is thus a natural candidate for optimization using approximation techniques. Although sampling-based approaches are widely used for approximation, sampling over joins is a compelling but challenging task regarding the output quality. Naive approaches, which perform joins over dataset samples, would not preserve statistical properties of the join output. To realize this potential, we combine Bloom filter sketching and stratified sampling into a novel approximate join operator. This join operator leverages the bloom filter to avoid shuffling non-joinable tuples around the network and then applies the stratified sampling to obtain an unbiased representative sample of the join output. Instead of decomposing multi-way joins as a sequence of binary joins, a unique property of our design is that it seamlessly joins multiple datasets in a single processing step. Our analysis shows that our technique scales well and significantly reduces data movement, without sacrificing tight error bounds on the accuracy of final outputs. We implemented ApproxJoin in Apache Spark and evaluated it using micro-benchmarks and real-world case studies. The evaluation shows that ApproxJoin achieves a speedup of 6-9 times over unmodified Spark-based joins with sampling fraction of 10%. Furthermore, the speedup is companied with a significant reduction in the shuffled data with 5-82 times less than unmodified Spark-based joins.