Management of graphs by estimating anonymization parameters using machine learning techniques (PhD Midterm Evaluation)
24 June 2013
Since the arrival of Web 2.0, the user has become the central actor on the Internet. No longer a passive consumer of published information, the user on the Web has become the central focus with his relationships and interactions at the forefront. New technologies gave to each user the possibility to produce and publish his own data. The amount of data available on the internet became huge. The information sources a user has access to are very diverse. The data available for analysis becomes even more vulnerable than before. In parallel, tools analyzing data extracted from the Internet become more and more powerful. We are addressing the problem of privacy related to data which needs, for different reasons, to be publicly made available. We intend to propose a new approach for data anonymization based on machine learning techniques. An important research work has already been performed in the data anonymization field, as described in the state of the art of the current document. Existing algorithms or anonymization tools are dependent of the settings of a certain number of parameters. The choice of those parameters is an essential step for the anonymization process and this step is essential when dealing with complex data and with multiple privacy risks or utilities related to the anonymized data. A first contribution of my thesis consists in a new method to determine, for a given family of data, a given privacy attack and utility evaluation, by using machine learning techniques, which are the best adapted parameters for a certain anonymization strategy in order to preserve the balance between utility loss and privacy risk. A second contribution deals with finding anonymization parameters for complex structures anonymization (such as graphs resulting from communication logs with a temporal component). Most of the works in anonymization dealing with relational data, tackled the anonymization for data represented as a simple graph or a bipartite graph and the relations are in most of the cases single and not oriented. The application of the parameters learning method on such structures leaded to the proposal and implementation of new techniques for privacy risks detection and utilities measurements proper to those kind of structures.