Breaking the Netflix Prize dataset

Netflix data

Hell, this is good work. In October last year, Netflix released over 100 million movie ratings made by 500,000 subscribers to their online DVD rental service. The company then offered a prize of $1million to anyone who could better the company’s system of DVD recommendation by 10 per cent or more.

Of course, Netflix assured everybody that the data had been anonymized by removing any personal details.

That turns out to have been a tad optimistic. Arvind Narayanan and Vitaly Shmatikov at the the University of Texas at Austin have just de-anonymized it.

Here’s how: turns out that an individual’s set of ratings and the dates on which they were made are pretty unique, particularly if the ratings involve films outside the most popular 100 movies. So it’s straightforward to find a match by comparing the anonymized data against publicly available ratings on the Internet Movie Database (IMDb).

That’s exactly what Narayanan and Shmatikov have done. And get this, once the match is made, it immediately links the user to the any private ratings on the Netflix database.

“Given a user’s public IMDb ratings, which the user posted voluntarily to selectively reveal some of his (or her; but we’ll use the male pronoun without loss of generality) movie likes and dislikes, we discover all the ratings that he entered privately into the Netflix system, presumably expecting that they will remain private.”

So what, I hear ya ask.

Here’s what the dynamic duo have to say about one person whose data they outed:

First, we can immediately find his political orientation based on his strong opinions about “Power and Terror: Noam Chomsky in Our Times” and “Fahrenheit 9/11.” Strong guesses about his religious views can be made based on his ratings on “Jesus of Nazareth” and “The Gospel of John”. He did not like “Super Size Me” at all; perhaps this implies something about his physical size? Both items that we found with predominantly gay themes, “Bent” and “Queer as folk” were rated one star out of five. He is a cultish follower of “Mystery Science Theater 3000”. This is far from all we found about this one person, but having made our point, we will spare the reader further lurid details. “

So Netflix may have inadvertently revealed the political affiliation, sexual orientation, BMI and God-knows-what else of 500,00 of their subscribers. Way to go!

Next up the mobile phone datasets we talked about a coupla weeks back

Ref: arxiv.org/abs/cs/0610105 : Robust De-anonymization of Large Datasets (How to Break Anonymity of the Netflix Prize Dataset)

35 Responses to “Breaking the Netflix Prize dataset”

  1. [...] Check it out! While looking through the blogosphere we stumbled on an interesting post today.Here’s a quick excerptIn October last year, Netflix released over 100 million movie ratings made by 500000 subscribers to their online DVD rental service. The company then offered a prize of $1million to anyone who could better the company’s system of DVD … [...]

  2. [...] Oh yes, I know that’s not in your Facebook profile, you don’t think that’s the only way to find out information about you… do you? The Internet’s an elephant, it remembers [...]

  3. S. says:

    Please don’t start promoting that recriminatory attitude towards Netflix. Enough damage was done with the AOL dataset. An excelent research team, with an open vision (open enough to release an important dataset), was simply fired. Research on query logs suffered a strong cut.

    Bad things only happen to people who do things.
    To be safe do nothing.

  4. Yehuda says:

    We should all realize we don’t deal with a cryptographic system but just with movie ratings. So even *if* under some very speculative circumstances, that will lead to revealing ratings of one user or two, the damage is miniature if at all.
    However, the very substantial damage is if this over cautious attitude will prevent other companies in the future from releasing valuable datasets to the research community, thus slowing innovation and scientific progress.
    I doubt the motivation of that research, I think that they over-emphasize problems that don’t really matter, and I’m mostly concerned with how this research will be perceived.

    It is always easier to be over-cautious, but the praise should go to those who dare…

  5. [...] Prize Data Blogged in Virginia Tech, Academia, Digital Privacy, Data Mining on 2007-11-27 Here’s a nice blog entry and here’s the actual [...]

  6. Stuart says:

    The part you’re glossing over here is that they can only find the intersection of people who have accounts on both netflix and IMDB.

    I’d be willing to bet if a ven diagram of the userbases were made you’d have very little overlap.

    Way to Go on your fear mongering and sensationalism!

  7. none says:

    horribly explained.

  8. John Q. Public says:

    So, you managed to get a match between information in a list of entries in one database and get a fuzzy match to people in another database, but only where people had marked their comments as public in the second database?

    And you’re claiming that the first database didn’t strip the anonymity out?

    Are you guys really that nuts?

    http://www.urbandictionary.com/define.php?term=troll

  9. Mike Green says:

    This was a waste of time to read.

  10. DR says:

    There are too many IFs in that research approach:
    ** IF a person has accounts and rated movies on both Netflix and IMDB (how much overlap can there be?)
    ** AND IF IMDB releases the identity of its members (which it does not)

    Also – suppose you did get somehow the identity of IMDB users – why do you need to correlate that to Netflix? – you can get all your privacy breach satisfaction from the IMDB data that you already have! If you can deduct from IMDB ratings the Netflix ratings – then you already have the same information from IMDB as you would have from Netflix.
    This is stupid, just like saying: I can easily calculate how much is 2+2 by adding 4-2+4-2…

  11. Kit says:

    The results are not direct and no clear conclusions can be made on just a rating. As an example, John Doe may be an athiest but he may like the movie “Last Temptation of Christ” as a movie and may have given a high rating.

  12. Shii says:

    Congratulations, you matched someone who took out movies to someone who you already know rated those movies.

  13. [...] reports that a pair of computer scientists have figured out how to de-anonymize the "anonymous" data set that Netflix released as part of its million-dollar contest to improve its recommendation [...]

  14. [...] that would improve their movie recommendation system — a worthy goal. However, this week researchers announced that they successfully re-identified the data using publicly available [...]

  15. Ali says:

    I read this paper a few weeks ago and was rather unsurprised by the conclusion. While a valid research topic for UT CS, this is hardly news worth. I’m disappointed it made Slash and even more disappointed it’s being compared to the AOL blunder. I for one applaud Netflix for releasing the ratings data.

  16. [...] No Such Thing As An Anonymized Dataset Slashdot reports that a pair of computer scientists have figured out how to de-anonymize the “anonymous” data set that Netflix released as part of its million-dollar contest to improve its recommendation [...]

  17. This story shows that seemingly harmless anonymized commercial information can be easily re-identified to build political, sexual, and even psychological profiles of Netflix users.

    What if your future employer uses data from Netflix and others to create not just a voting and sexual profile, but a profile of your risk for expensive diseases?

    Today Americans have no control over ANY electronic prescription, genetic, or health records. Employers, insurers, banks, and schools can all data mine our health records without consent.

    Tell Congress to restore our right of control over personal health information and to start by ending prescription data mining. Sign our petition now at: http://www.patientprivacyrights.org/site/PageServer?pagename=Prescription_Privacy_Video

    Only Congress can restore our rights to control access to personal health records, or grant us the right to control our financial and commerical information—including control over access to our Netflix movie ratings.

    Why should Netflix be able to reveal our movie ratings data for any reason without consent?

  18. Bond says:

    Agree with the various angles in the “Responses” section. Just to add to it, the person who’s information is extracted hopefully does’nt live as a hermit, in which case people around him already know much more about him than the fuzzy comparisons can conclude. That said marketeers salivate over this kind of data but they would also have to jump some hoops atleast. As for the common man who does’nt know the person who’s information is extracted, probably does’nt care anyways.

  19. [...] research community. Along with last year’s AOL data release debacle, Soghoian points to a more recent case where researchers were able to de-anonymize a data set released by Netflix, comprising of 100 [...]

  20. [...] is some interesting discussion about the research at the physics arX1v blog and [...]

  21. [...] researchers at the University of Texas have de-anonymised (re-nymised? nymified?) the NetFlix Prize [...]

  22. Mr. Anonymized says:

    Isn’t the main intent of this data to create a profile, with the intent of giving the user more of what he seems to enjoy? The contest is to make this even better. So I say by getting rid of anonymity and correlating with even bigger datasets these two scientist have gone in the right direction to win the prize.
    As Yehuda already noted, it is important to take into context the type of data made public. I also believe it is not possible to deduce a personality by the rating of films they have seen. Sounds as absurd as the Rorschach inkblot tests. Maybe a culture change is needed. Or a law which makes conclusions based on indirect personal data unvalid.

  23. [...] Another hot topic out there seems to be finding new ways to learn from and leverage user traffic and search data. Through Seb Chan’s post on the Pwerhouse museum blog I discovered the New York Times’ awesome (i.e. incredibly useful) blog Open. Yesterday on this blog the NYT announced a new feature using their search data to cluster queries. It’s called Also Try. Of course, Seb is interested in this because of his own work with metrics on the Museum front. And then there was the study by some folks at UTAustin to de-anonymize a sub-set of Netflix data. Yikes! This is scary stuff. This study was referenced by quite a few posts that I found through my wordpress technology tag feed such as “Breaking the Netflix Prize Dataset“. [...]

  24. Netflickr says:

    Sounds like an interesting project. Thanks for the info. Anyone who can predict renting/buying trends from this limited data should go into market research or stock picking. No harm done in this web 2.0 world. For those who are pissed at Netflix try going to http://www.intelius.com for a real breach in privacy!

  25. Good site. Thank you!!!

  26. an actual ML researcher says:

    I believe this story is a tad bit irresponsibly reported. While the concept that people’s personal data can be linked in previously unforeseen ways, there is no conclusive evidence here that anyone’s privacy is being violated. The release of the Netflix data, and the contest, has been tremendously valuable to the machine learning community. I think the real violation here is to claim that a Netflix user and an IMDB user are one and the same because you found an intersection between their ratings. If you carefully studied the Netflix prize you would know that Netflix did actually add random noise to its ratings (increased and decreased some random ratings). Thus some correlations that you would find could easily be completely false. Now, I would wonder why, if the authors did find many links between IMDB users and Netflix users, are the authors not at the top of the leaderboard for the prize. Presumably they would have more rating data than other competitors – if this technique actually found a substantial amount of extra information about users. Seems like a lot of hocus pocus here at the expense of an incredibly well organized and executed research experiment.

  27. Good site. Thank you!

  28. Useful site. Thank you!!

  29. Very good site. Thank you.

  30. Useful site. Thank you!!

  31. jimmyjot says:

    It’s been suggested by others, but my biggest concern with the article (and, it seems, with the paper as well) is that it presumes to make opinions about people’s beliefs based on their viewing habits and ratings. They may very well be evaluating movies on their merits, rather than on how much they agree with their content.

    Choice of movies also does not tell a whole lot. I would classify myself as a bleeding heart liberal, yet I would like to read stuff by Ann Coulter or Rush Limbaugh, if only to learn how they make their arguments and hopefully shred them.

    This kind of conclusion-leaping is what led to McCarthyism in the 1950s. Let’s not let it happen again.

  32. jadedindece says:

    I try to get advice from someone about what to do.
    http://www.google.com

  33. [...] An investigation will discover what we already know: NebuAD’s cookie-based opt-out process doesn’t work: it stops targeted ad delivery, but it doesn’t actually opt you out of having your information gathered and sold. But it would be useful to see independent confirmation of NebuAD’s claims that personal identifiers are "anonymized" (such claims are made quite often but frequently aren’t true). [...]

  34. [...] deep profiles that they can be traced back to individuals, as researchers armed with the AOL/a> and Netflix data releases [...]