Over the holiday period, the physics arxiv blog is re-running the most popular blogs (by page views) of 2007.
Breaking the Netflix prize dataset
Hell, this is good work. In October last year, Netflix released over 100 million movie ratings made by 500,000 subscribers to their online DVD rental service. The company then offered a prize of $1million to anyone who could better the company’s system of DVD recommendation by 10 per cent or more.
Of course, Netflix assured everybody that the data had been anonymized by removing any personal details.
That turns out to have been a tad optimistic. Arvind Narayanan and Vitaly Shmatikov at the the University of Texas at Austin have just de-anonymized it.
Here’s how: turns out that an individual’s set of ratings and the dates on which they were made are pretty unique, particularly if the ratings involve films outside the most popular 100 movies. So it’s straightforward to find a match by comparing the anonymized data against publicly available ratings on the Internet Movie Database (IMDb).
That’s exactly what Narayanan and Shmatikov have done. And get this, once the match is made, it immediately links the user to the any private ratings on the Netflix database.
“Given a user’s public IMDb ratings, which the user posted voluntarily to selectively reveal some of his (or her; but we’ll use the male pronoun without loss of generality) movie likes and dislikes, we discover all the ratings that he entered privately into the Netflix system, presumably expecting that they will remain private.”
So what, I hear ya ask.
Here’s what the dynamic duo have to say about one person whose data they outed:
“First, we can immediately find his political orientation based on his strong opinions about “Power and Terror: Noam Chomsky in Our Times” and “Fahrenheit 9/11.” Strong guesses about his religious views can be made based on his ratings on “Jesus of Nazareth” and “The Gospel of John”. He did not like “Super Size Me” at all; perhaps this implies something about his physical size? Both items that we found with predominantly gay themes, “Bent” and “Queer as folk” were rated one star out of five. He is a cultish follower of “Mystery Science Theater 3000”. This is far from all we found about this one person, but having made our point, we will spare the reader further lurid details. “
So Netflix may have inadvertently revealed the political affiliation, sexual orientation, BMI and God-knows-what else of 500,00 of their subscribers. Way to go!
Next up the mobile phone datasets we talked about a coupla weeks back
Ref: arxiv.org/abs/cs/0610105 : Robust De-anonymization of Large Datasets (How to Break Anonymity of the Netflix Prize Dataset)