Can data overload protect our privacy?

If you were chatting on MSN messenger in June 2006, your conversation was being recorded and the details (but not the content) passed to Eric Horvitz and Jure Leskovec at Microsoft Research in Redmond, Washington. Using this data, these scientists have created “the largest social network constructed and analyzed to date”.

They’ve now published their results which show the habits of people who use Messenger and the scale on which it occurs. But this study is noteworthy for another reason: it gives a curious insight into the limitations of this kind of analysis. The Microsoft team says it had too much data and this affected its ability to crunch it effectively.

Here’s what they did. The researchers used data such as IP address and log in and out times as well self-reported information such as age, sex, and zip code (which are obviously highly accurate) to carry out their analysis.

The bald details are that 30 billion IM conversations took place between 180 million people all over the world in June 2006.

The researchers found that people tend to chat to individuals who share the same language, age group and geographical location (in other worlds to people like themselves). They also chat more often and for longer with members of the opposite sex.

Each account had on average 50 buddies and, in the IM world, people are separated by “7 degrees of separation”.

That’s about the strength of it and I’m underwhelmed. No fascinating insights into the correlation between chatting spikes and news broadcasts/ad breaks/episodes of Friends; or the patterns of chat in the workplace versus home using IP location changes; or how IM users travel the world. Just straightforward count ’em ‘n’ weep numbers.

But there’s a good reason for the lack of more detailed insight. The problem, say Horvitz and Leskovec, is the size of the data base: 4.5 terabytes which took 12 hours to copy to a dedicated eight-processor server. “The sheer size of the data limits the kinds of analyses one can perform,” they say.

So will data overload always protect us from Big Brother’s prying eyes? Perhaps in some circumstances like these but otherwise I wouldn’t count on it. It’s straightforward to sample big datasets like this (although that can introduce problems of its own).

I wouldn’t mind betting that with a little more effort, it would be possible to identify individuals from their travel and chatting patterns, perhaps by correlating the data with local telephone and business directories much in the same way this has been done with search data. However, it looks as if Horvitz and Leskovec have steered carefully around this issue.

Of course, Microsoft doesn’t need to do this since it can store a much fuller set of data anyway including the full text of the conversations and whatever data it has on the identity of the owners.

And you can be sure that more shadowy organisations with access to much greater computing resources will also have this full data set and be happily chewing through it as you read this.

Ref: arxiv.org/abs/0803.0939: Planetary-Scale Views on an Instant-Messaging Network