Almost every semester, I use the AOL Breach data as a point of departure for something in at least one of my classes. The data is fascinating. Most data is fascinating, but this data is particularly so: at once shocking, funny, creepy, poignant, sad, frightening, noble, ignoble, shrewd, and lewd. It’s also rich in the way data can be rich. It’s completeness—for a sample of several thousand AOL accounts, it includes the complete account search history during March, April, and May of 2006—which includes timestamped search strings and the result rank and destination of clicks-through, makes it ripe for discovering all sorts of patterns of human thought and behavior.
It’s AOL data week in one of my classes now. This morning, I proposed several nontrivial questions about the data that could be answered with SQL queries. We looked at the results and discussed what they might say about the unwitting study subjects. Then I asked my students to suggest some questions of their own. What are the typical time-of-day and day-of-week patterns of an individual AOL customer’s searches? Are there identifiable differences in the patterns (and by extension in the sleep, social, and perhaps employment or school behavior) of people whose searches included, say, “britney”? For what kinds of searches do users most often click through several pages of results? And so on.
One of my students suggested an excellent simple question. What are the most common searches of the form “how to …”? Out of millions of queries in the AOL data, there were many thousands of “how to … ?” searches. The most frequent was “how to tie a tie,” requested 92 times by a total of 47 distinct users. The rest of the top ten (in terms of most distinct users asking the question) were how to write a resume, gain weight, have sex, get pregnant, write a book, write a bibliography, start a business, lose weight, and make money, each sought by a dozen or more different people. AOL converted the queries to lower case and removed much of the punctuation, but they didn’t correct spelling. Consequently, how to masterbate and how to masturbate appear separately at ranks 49 and 51 respectively. The question would have nearly hit the top 10 without the misspellings.
Here’s a PDF file of the top 1000 “how to” queries submitted through AOL explorer by a sample of AOL users in the spring of 2006. You can probably guess that it’s not safe for work. Although there are no pictures, plenty of sex, drugs, and gambling is spelled out, and there are more than a few questions likely to offend in one way or another. Have a look.
July 2nd, 2010 at 5:29 pm
In case anyone wonders about the red dots under “America”: it’s Microsoft Word 2003 trying to be helpful with a Smart Tag: