The Marriage of Approximate Computing and Privacy-Preserving Data Analytics

17 May 2016

New Image

Online advertisement is a major economic force for modern online services, where users' private data is continuously collected for real-time data analytics. In the current advertisement ecosystem, the goals of users and data analysts are at odds: users' seek stronger privacy, while analysts strive for high-utility data analytics in near real time. In this paper, we target to design a pragmatic privacy-preserving data analytics system that resolves this tension. More specifically, we answer the following research question: How to preserve users' privacy while supporting high-utility data analytics for low-latency stream processing? Our design builds on the marriage of two existing computing paradigms: privacy-preserving data analytics and approximate computing. Privacy-preserving data analytics techniques, such as differential privacy, produce noisy output to protect individual user's privacy. Approximate computation returns an approximate output instead of the exact output to achieve low-latency execution (and also efficient utilization of computing resources). We make the observation that these two computing paradigms are complementary, and can be married together! Both computing paradigms strive for approximation, but they differ in their means for computing the approximate output. Privacy-preserving analytics adds explicit noise to the final aggregate query output. Whereas, approximate computing relies on representative sampling of the entire dataset to compute over only a subset of data items. To realize this marriage, we designed an online sampling algorithm that achieves zero-knowledge privacy to produce an approximate output with bounded error in real-time. We implemented our algorithm in a data analytics system called PrivApprox based on Apache Flink Streaming. Our evaluation using micro-benchmarks and real-world case-studies shows that PrivApprox achieves the benefits of both approximate computing and privacy-preserving data analytics.