The AOL data mess

by Eszter Hargittai on August 7, 2006

Not surprisingly this is the kind of topic that spreads like wildfire across blogland.
AOL search data snippet

AOL Research released (link to Google cache page) the search queries of hundreds of thousands of its users over a three month period. While user IDs are not included in the data set, all the search terms have been left untouched. Needless to say, lots of searches could include all sorts of private information that could identify a user.

The problems in the realm of privacy are obvious and have been discussed by many others so I won’t bother with that part. (See the blog posts linked above.) By not focusing on that aspect I do not mean to diminish its importance. I think it’s very grave. But many others are talking about it so I’ll focus on another aspect of this fiasco.

As someone who has research interests in this area and has been trying to get search companies to release some data for purely academic purposes, needless to say an incident like this is extremely unfortunate. Not that search companies have been particularly cooperative so far – based on this case not surprisingly -, but chances for future cooperation in this realm have just taken a nosedive.

To some extent I understand. No company wants to end up with this kind of a mess on their hands. And it would take way too much work on their part to remove all identifying information from a data set of this sort. I still wonder if there are possible work-arounds though, such as allowing access on the premises or some such solution. But again, that’s a lot of trouble, and why would they want to bother? Researchers like me would like to think we can bring something new to the table, but that may not be worth the risk.

Note, however, that dealing with sensitive data is nothing new in academic research. People are given access to very detailed Census data, for example, and confidentiality is preserved. From what I can tell the problem here did not stem from researchers, it was someone at AOL who was careless with the information. But the outcome will likely be less access to data for all sorts of researchers.

Another question of interest: Now that these data have been made public what are the chances for approval from a university’s institutional review board for work on this data set? (Alex raises related questions as well.) Would an approval be granted? These users did not consent to their data being used for such purposes. But the data have been made public and theoretically do not contain any identifying information. Even if they do, the researcher could promise that results would only be reported in the aggregate leaving out any potentially identifying information. Hmm…

For sure, this will be a great example in class when I teach about the privacy implications of online behavior.

Not surprisingly, people are already crunching the data set, here are some tidbits from it.

A propos the little snippet I grabbed from the data (see image above), see this paper of mine for an exploration of spelling mistakes made while using search engines and browsing the Web. About a third of that sample was AOL users.

The image above is from data in the xxx-01.txt file.

{ 2 trackbacks }

Sökmotorer och forskningsetik at Det perfekta tomrummet
08.08.06 at 9:13 am
a thaumaturgical compendium » Blog Archive » AOL and the aftermath
08.08.06 at 9:37 pm



joel turnipseed 08.08.06 at 1:40 am

Fascinating… as a writer, I’d love to know what 500,000 people were thinking over a three month period–a sort of “mind of God” snapshot. Still, at 500MB for the zipped file–that’s a HUGE file. I don’t think the vast majority of home PC users could even open the unzipped file, could they (it must be several gig)?

Several people have noted the “kill my wife” search, but you wonder: could this have been a wannabe crime novelist? someone whose wife was already murdered by someone else? or an actual would-be murderer? I’m reminded of the Pete Townsend case (interesting that he was acquitted, no? would a non-famous, say, elementary school teacher have been?). There are just so many things to consider w/r/t to these things even after they’re out of the bag.

Meantime, you do wonder, at some point, whether any notion of “privacy” should go out the door altogether? Or at least certain forms of it? Wouldn’t airing out what we’re all REALLY like, en masse, do some social good? I’m thinking in particular of the bullshit campaign launched (and ongoing) by the Swift Boat Veterans for Truth, even in the face of obvious, and widespread, war crimes in Vietnam–and even where there were no war crimes, damned obvious horrors of war. Not only was Kerry defeated in no small part from this obvious lack of widespread insight into human nature, but a lot of the bullshit in Iraq would’ve been obvious to anyone with the slightest self-awareness and a little honest human experience. That is, if you’re really open to what humans are like (Goethe on the “crime imagined” is good here), you just shouldn’t be surprised at rapes, mutilations, mass murders, beheadings, etcetera: they’re part of who we are as a species & as such, unavoidable–especially in times/places of war. Is it better that we know that (if not, per AOL’s scrambling of user names, which) people search/think certain things with whatever degree of frequency?

I know, I know… there’s the Scylla of the Scarlet Letter that inheres in a real global village… and the Charybdis of Big Brother which is actual the Big Evil Twins of States and Corporations. But don’t we all already have a kind of social punchcard consisting of our education levels (and grades), credit reports, police records, property records, etcetera? If the job/house/school/life you can lead is already nearly inexorably determined for you in the private confines of admissions committees, HR departments, and banks… what’s a dirty thought or two aired to the public?


Seth Finkelstein 08.08.06 at 6:38 am

Yes indeed – this issue was something I wondered about, during the huge Google subpoena panic. I’m bad at politics, but I thought there was a lot of fear-mongering and political expediency going on, sensationalism about the goals of the specific request. It was reported as if it was a scary Big Brother fishing expedition, but was in fact basically an extremely tame academic-type study. Though the general tracking issue is quite real.

I did the system administration for one of the early studies of user behaviour using log files, back in 1999. I don’t know if such a study could be done done, from viewpoint of the attacks it would get for using the data.

I suppose the only advice I can give on that point is to say think carefully about the career implications, though I believe going ahead with such studies would help establish they can produce useful results (though this is not advice to anyone to hurt themselves, pioneers are the ones with arrows in their backs). Think: “So that the data have not been released in vain …”


Maria 08.08.06 at 7:57 am

exactly, seth. this is what john battelle has in mind vis a vis the ‘database of intentions’, and is precisely the data that google tried to keep private earlier in the year. also agree with you on the expediency and pr efforts surrounding google’s response to said request.

brief point from the data protection point of view – this does not implicate European (and most other) DP laws as it is not personally identifiable data. It’s interesting, nonetheless.


Eszter 08.08.06 at 8:26 am

Maria, the data are not personally identifiable, because the user IDs have been removed? Based on someone’s searches, in some cases a good guess could be made about the user. But it’s true that there could be no guarantee it’s one particular person. After all, someone could be running lots of searches on his/her own name, address, interests, etc, or someone else could be trying to find out information about said person.

My recollection is that the other companies (Yahoo!, MSN, AOL) that had been approached by the US government also weren’t going to hand over any data with identifying information. (If I wasn’t rushing out the door I’d look up some references, sorry.)


engels 08.08.06 at 8:39 am

After all, someone could be running lots of searches on his/her own name, address, interests, etc, or someone else could be trying to find out information about said person.

Well yes but from a small number of searches you could triangulate. Only one person has connections with exactly that set of five people or institutions, etc…


Maria 08.08.06 at 9:30 am

Yep, the data do not include name and physical address that are generally held to be personally identifiable data. (it’s a legal term under the EU data protection directive, meaning information that pertains to an individual.)

Search data is arguably ‘traffic data’, i.e. data about network activities that isn’t itself directly tied to an identifed (n.b. not identifiable) individual, even though it may be possible to infer connections. E.g. it’s the difference between an IP number that may or may not be shared you and other customers of your ISP and your name and postal address which are uniquely yours.

Even in the EU, traffic data are afforded a much lower level of protection (even when they are tied directly to personally identifiable data such as a phone number and billing data) than are personal data.

Though there have been some recent developments on the treatment of traffic data in the EU which I really must get around to writing about…


Eszter 08.08.06 at 10:16 am

Interesting and helpful, Maria, thanks!

BTW, I agree with the earlier comments about how people shouldn’t jump to conclusions based on some searches. In some cases we really don’t have enough information to make much sense of just a few queries without much context.


abb1 08.08.06 at 11:13 am

How is the search “how to kill my wife” (my wife?) indicative of an intent to commit a murder? That’s just foolishness, absurdity, ain’t it?


Matt Weiner 08.08.06 at 11:57 am

Someone was recently convicted of murdering his ex-wife in part because he searched for “how to kill someone and not get caught.” Though there was other evidence.

via the comment from “Abbie Normal”


agm 08.08.06 at 12:14 pm

Funny, Seth, cause on my end of the Net the google subpeona brouhaha was deemed to be exactly that, a fishing expedition with no justifiable reason to be done. It just ain’t all that hard to figure out who is who given a sufficiently connected set of searches and link trail and such.


Alex R 08.09.06 at 4:42 am

And here’s the followup… The NY Times reporters track down one of the searchers based on her searches, and interview her. Will the EU still consider this kind of data not to be “personally identifiable” given this kind of example? It’s one thing to look at the data and say “Gee, you could probably identify some of these searchers” and another thing to actually do it…


head 08.12.06 at 6:53 pm

Yes.. try out the AOL search database yourself.. It is just fun to look at some of the search data..

Comments on this entry are closed.