If scientists are to study massive datasets such as mobile phone records, search queries and movie ratings, the owners of these datasets need to find a way to anonymize the data before releasing it.
The high profile cracking of data sets such as the Netflix prize dataset and the AOL search query data set means that people would be wise not to trust these kinds of releases until the anonymization problem has been solved.
The general approach to anonymization is to change the data in some significant but subtle way to ensure that no individual is identifiable as a result. One way of doing this is to ensure that every record in the set is identical to at least one other record.
That’s sensible but not always easy, point out Rajeev Motwani and Shubha Nabar at Stanford University in Palo Alto. For example, a set of search queries can be huge, covering the search habits of millions of people over many months. The variety of searches people make over such a period make it hard to imagine that two entries would be identical. And analyzing and changing such a huge dataset in a reasonable period of time is tricky too.
Motwani and Nabar make a number of suggestions. Why not break the data set into smaller, more manageable clusters, they say. And why not widen the criteria for what it means to be identical to allow similar searches to be replaced with identical terms. For example, replacing a search for “organic milk” with a search for “dairy product”. These ideas seem eminently sensible.
The problem becomes even more difficult when the data is in graph form, as it might be for mobile phone records or web chat statistics. So Nabar suggest a similar anonymizing technique: ensure that every node on the graph should share some number of its neighbors with a certain number of other nodes.
The trouble is that the anonymization technique can destroy the very patterns that you are looking for in the data, for example in the way mobile phones are used. And at present, there’s no way of knowing what has been lost.
So what these guys need to do next is find some kind of measure of data loss that their proposed changes cause, to give us a sense of how much damage is being done to the dataset during anonymization.
In the meantime, dataset owners should show some caution over how, why and to whom they release their data.
arxiv.org/abs/0810.5582: Anonymizing Unstructured Data
arxiv.org/abs/0810.5578: Anonymizing Graphs