Profiling user activities with minimal traffic traces
01 January 2015
Understanding user behavior is essential to personalize and enrich a user's online experience. While there are significant benefits to be accrued from the pursuit of personalized services based on a fine-grained behavioral analysis, care must be taken to address user privacy concerns. In this paper, we consider the use of web traces with truncated URLs - each URL is trimmed to only contain the web domain - for this purpose. While such truncation removes the fine-grained sensitive information (e.g., search query, purchased products, location etc.), it also strips the data of many features (such as file name suffix that is a good indicator of file content) that are crucial to the profiling of user activity. We show how to overcome the severe handicap of lack of crucial features for the purpose of filtering out the URLs representing a user activity from the noisy network traffic trace (including advertisement, spam, analytics, webscripts) with high accuracy. This enables the correct identification of user activities even with the truncated URLs, which in turn, enables the network operators to provide personalized services while mitigating privacy concerns by storing and sharing only truncated traffic traces