Location K-Anonymity under a Streaming Model

New Image

In this paper we study how to achieve location anonymization under a streaming model. K-anonymity is a popular notion of privacy for which every recorded data point is identical to at least $k-1$ other data points. For location based data, k-anonymity refers to recording a coarse region that contains at least $k$ data points as the rough location of each of the data points in the region. One primary goal is to make sure the anoynimized location data is as accurate as possible, namely the size the recorded coarse region with $k+$ data points are as small as possible. Specifically, under our model, location data arrive online and are initially stored in a temporary memory a fixed size $m$. When or before this temporary memory becomes full, the data has to move to a permanent memory k-anonymized. Our problem formulation is motivated by applications such as location based advertising and disease tracking, in combination of certain legal requirement on privacy. We first show that in the {em competitive analysis framework} when an adversary specifies location data, any online algorithm can be arbitrarily bad in terms of the recorded region size. We therefore resort to assume the location distribution is known a priori. If the distribution is uniform, we consider a simple and natural algorithm {sc PickMax}, in which we prepartition the region into $m/k$ subregions uniform in size and pick the subregion with the largest occupancy whenever the buffer is full. We give a detailed analysis and show that the largest occupancy converges to $2k$. We therefore discover, perhaps somewhat unintuitively that it is sufficient to achieve k-anonymity if we partition the region into $2m/k$ subregions each of which is half the size. We also use simulation to compare the performance of {sc PickMax} and a variant {sc PickK} which records a subregion the moment it reaches occupancy $k$. Finally we apply simulation to a real wireless trace for which the distribution is not uniform. We partition the region in a quadtree form of nonuniform sizes and apply variants of {sc PickMax} and {sc PickK}.