10 Mar 2011 0:16
I spent a good chunk of the last 24 hours at one of my favorite hangouts, Language Log. My reason for lingering was to pore over (in good company) some interesting graphs Mark Liberman had put up about the ever-controversial adverb “literally.” [Link: Two Breakfast Experiments™: Literally]
The graphs purported to show, inter alia, “a remarkably lawful relationship between the frequency of a verb and the probability of its being modified by literally, as revealed by counts from the 410-million-word COCA corpus.” [Aside: Visit COCA some time. It’s beautiful, it’s open 24/7, and admission is free.]
Sadly (for the researcher whose graphs Mark posted), there was no linguistic revelation; happily (for me and other mathophiles) the graphs highlighted a very interesting statistical artifact. Good stuff was learned.
Instead of rehashing what you can find in the comment thread at Language Log, what I’ll do here is give a non-linguistic example of this statistical artifact. First, a very few general remarks about statistics.
Much of statistics is about making observations or drawing inferences from selected data. In a nutshell, statistical analysis often goes like this: look at some data (such as the COCA corpus), find something interesting (such as an inverse relationship between two measurements), and draw a conclusion (in this case, a general inference about American English, of which COCA is one of the largest samples available in usable form).
Easy as a, b, c. One, two, three. Do, re, mi.
Sometimes. The mathematical underpinnings of statistics often make it possible, given certain assumptions, to make inferences from selected data with some (measurable) measure of confidence. Unfortunately, it’s easy to focus so hard on measuring the confidence (Yay, p < 0.05! I might get tenure!) that you forget the assumptions or you get careless about how you state an inference or calculation.
When bad statistics happens, there’s often a scary headline, but I can’t think up a good one at the moment and I’ll go straight to the (artifactual) graph.
This graph shows that for not-too-small cities, there’s a modest negative relationship between city size and homicide rate: on average, smaller cities tend to have higher homicide rates.
But the truth is that among not-too-small cities, smaller cities don’t tend to have higher homicide rates than larger ones. Here’s a better graph:
This graph shows almost no relationship between city size and homicide rate.
What’s going on, and what’s wrong with the relationship that shows up (and is real) in the first graph? The titles hold a clue (but don’t count on such clear titles when you see or read about graphs in the news). The first graph only shows cities that had at least 10 homicides in 2009. For that scatterplot, cities were selected for analysis according to a criterion related to the variable under investigation, homicide rate. That’s a no-no.
The 10-homicide cutoff biased the selection of cities used in the analysis. Most very large cities show up simply because they’re large enough to have 10 or more homicides, but the smallest cities that appear are only there because they had high enough homicide rates to reach 10-homicide threshold despite their relatively small populations. For the first graph, I (pretendingly) unwittingly chose all large cities together with only some smaller cities, specifically smaller cities with unusually high homicide rates for their size. Then I “discovered” that smaller cities had higher homicide rates.
Oops. It’s an easy mistake to make, and it wouldn’t surprise me if it happens often. I can easily imagine medical studies that compare the rates of some disease among cities and exclude any city that has “too few” cases of the disease to analyze.
Statistics is a powerful tool. Follow the instructions.
March 10th, 2011 at 6:53 am
Your statistical analysis is, as far as I can tell, spot on, but I think you’re being a little harsh on the original Language Log post. While you’re right that it’s important to realise that the data *does not* show that there is a blanket negative correlation between frequency of use of a word and its frequency of combination with “literally” it *does* show such a correlation for the specific subset of words analysed (words which are frequently combined with literally).
Similarly, the correlation shown in your first graph of murder rate vs city size is actually a perfectly legitimate one, as long as you’re clear about what you’re actually looking at. If for some reason I was forced to live in a city which had at least ten murders per year, then I would absolutely want that city to be as large as possible, because otherwise I’d find myself living in a small town with a disproportionately large murder rate.
To put it another way, I think there *are* legitimate reasons to be interested in studying the specific subset of words that are frequently combined with a particular modifier, it’s just very important not to overgeneralise from the specific case.