Since the dawn of time, man has wondered: what are p-values?
Fast-forwarding to the present day: I’m touching on the so-called replication crisis in psychology in my intro philosophy class. Specifically, I want to bounce off something Andrew Gelman wrote:
Ultimately the problem is not with p-values but with null-hypothesis significance testing, that parody of falsificationism in which straw-man null hypothesis A is rejected and this is taken as evidence in favor of preferred alternative B.Ergo, I could do with an intuitive, informal account of p-values for non-statisticians (such as myself!) As people have been joking, the ASA’s statement leaves something to be desired in the A-ha! department:
Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.This thing is non-intuitive. People gloss it wrongly: ‘the p-value tells you the likelihood that your result happened just by chance’ (and variations on that thought.)
Let’s start with a simple case that shows how and why this wrong gloss just has to be wrong; then, my improved, patent-pending informal gloss on the ASA’s informal gloss.
What is the simplest case in which we, the plain people of the internet, might arrive at a p < .05 experimental result in the comfort of our own homes?
Flipping a coin, getting heads 5 times in a row. We know how to calculate the likelihood of that: 2×2 x 2×2 x 2 = 32.
1 in 32 is < 5% so we publish!
No. Not even if you preregistered your 5-heads hypothesis. (Hey, it would be worth laying random longshot bets on flips if it might get you into Science!)
In calculating odds concerning a 5-head streak, you obviously aren’t calculating the chance that your coin is fair. But if you were calculating ‘the likelihood of this having happened just by chance,’ it sounds like that’s just what you would be doing. What’s the likelihood this happened due to a (longshot) chance with a fair coin, vs. a (rigged) chance with a trick coin? And then you would be concluding, apparently, that since a fair coin would only do that 1 in 32 times, 31 out of 32 times when this happens, someone has slipped you a trick coin that always comes up heads. Crazy. So what is it really, this mystery thing?
Without further ado, Holbo’s informal gloss on the ASA’s informal gloss on p-value. Specifically, what p-value < .05 basically comes to. (It helps to add that, since p-value < .05 is a bit of a fetish, and the point is to demystify it.) Any such statement will be analogous to the following:
1) If this coin is fair, odds are less than 1 in 20 that you could match or beat that 5-heads run I just got!
Tying this to the ASA thing (bit loosely):
“under a specified statistical model” = If this coin is fair
“the probability that … a statistical summary of the data … would be equal to or more extreme than” = odds are less than 1 in 20 that you could match or beat
“its observed value” = that 5-heads run I just got!
Now, to go with, an informal gloss on what your average scientific paper reports/asserts.
No such thing as the prestigious science journal Fluke, so when a striking regularity of coin flips presents itself, you hope you’ve uncovered a trick coin. Scientific papers say:
2) Probably this is a trick coin!
(I am not recommending 2) as one-size-fits-all philosophy of science, or as template for all scientific claims or even hypotheses. Just trying to prime the intuition pump for more local purposes.)
Now we can trade in the rather confusing question—‘how does that p-value < .05 thing relate to the substantive take-away we really care about?’— for a less confusing question.
What’s the relation between 1 and 2?
Kind of looks like they are heading in opposite directions. Since we care about trick coins, and the p-value claim concerns fair ones, 1) doesn’t speak to what we care about: 2).
Strictly, 1) isn’t evidence for 2). 1) is five flips, wrapped in an elementary calculation. The flips might be evidence. We see the flips through the probability packaging. This may fool us into thinking packaging has added extra nutrition or flavor to contents. But that’s not how packaging works.
5-heads in a row is evidence your coin is trick, or not, depending on background conditions. It could be weak evidence – so weak as to be none – or actually quite strong. Let’s talk through it.
We are immediately inclined to say it’s weak evidence because we assume we are talking about our world, or one like it, in which trick coins are (I dunno) 1 in 10 million? In which trick coins probably aren’t so tricky. Maybe they come up heads 70%? (What do I know of trick coins?) Trick coins are waaaaaaaaay more unlikely than plain old flipping 5 heads. Ergo a 5-head run is vastly more likely to have been a fluke.
But, obviously, if the world is different things change. Suppose you are running to the bank with your brimming mason jar of quarters, and you collide with Mysterioso the Mysterious, carrying his equally large, equally full jar of trick quarters to the theater, where he has been wowing the rubes all week with his ‘all-heads, all-the-time’ coin tricks. (Well, not ALL the time. His coins have the tricky property that if you flip them 5 times, they come up straight heads 31 out of 32 times! Pretty good, as tricks go.)
Oh no! The coins are mixed up! What to do? Flipping each 5 times is a decent method (if you and the magician agree p-value < 0.05 is acceptable error, before you go your separate ways.) Indeed, this is a situation in which that simple 1 in 32 (2×2 x 2×2 x 2) calculation is even descriptive. That is, this is that rare situation in which the wrong thing people want to say about p-values — ‘the likelihood that this happened just by chance’— is kind of right.
To review: we’re on the street, coins everywhere, magician swearing, jars rolling. From an even mix of fair and trick coins (per above) you pick a coin (any coin!) and flip – 5-heads. What to conclude?
There is a 1 in 32 likelihood that this happened just by (longshot) chance. That is, given 5-heads, there is a 1-in-32 chance that you happen to have picked a fair coin (as likely as the alternative); then (flukily) you flipped 5 heads with it. On the other hand, there is a 31 out of 32 likelihood that this didn’t happen (just) by chance. Rather, you picked a trick coin (which was quite to be expected, in the circumstances), and Mysterioso’s coins are rigged (ergo don’t land ‘just by chance’.)
So if you want to explain to someone why their ‘likelihood that this thing happened just by chance’ intuition about p-values is wrong, flip it and tell them what they are thinking could be right, but only if they just collided with Mysterioso, as it were. So you gotta ask yourself: do you have reason to believe you just collided with Mysterioso? (Well do ya? Punk!?)
OK, I promised intuitive. This Mysterioso biz is baroque. Go back to the point that 1) and 2) are, kind of, headed off in opposite directions. Nevertheless, since 1) contains evidence, you may be able to (as Wittgenstein might say) climb up the ladder of 1) and throw it away. (Adapting my other metaphor: you eat the evidence but toss the wrapper when you realize the 2 in 2×2×2×2×2 was not the right number, after all.)
What people would like – which they can actually get only in a cosmic coincidence, Mysterioso-type case – is for the rejected null hypothesis to do double-duty as a characterization of what holds in the non-null. The null hypothesis needs to be, not merely the rejected alternative to what you conclude, but a (reverse) mirror of it. But it isn’t every day you collide with a magician carrying a jar of trick coins that are, as it were, the opposite of your jar of fair coins.
(For good measure, it’s may be helpful to think about how weird Mysterioso’s coins are if they generally invert probabilities. With a fair coin, there are an infinite number of increasingly vanishingly unlikely series (5 heads, 500, 500000, 500000 tails, 500000 alternations of heads-tails-heads-tails, on and on.) It can’t be that Mysterioso’s coins are probabilistic inverts, down that line, because no coin can be veritably dead certain to do an infinite number of incompatible things. That would be … mysterious.)
Couple more points. Someone might object that Mysterioso cases aren’t cosmically coincidental, if you just loosen a bit. That’s right. Informally, a ‘collision with Mysterioso’ case can be glossed as:
1) The alternatives are each equally likely. (Fair coins roughly = trick in number, on the ground.)
2) The alternatives are each pretty likely. (If there are 20 different kinds of differently-behaved trick coins, scattered in equal numbers, flipping one 5 times can’t give you confidence as to which kind you’ve got.)
3) The alternatives are each quite different. (If trick behavior is subtle, 5 flips won’t cut it.)
The world does present you, from time to time, with situations you can reasonably believe meet conditions 1-3. In any such case, misusing 1) as a reverse mirror, to say what is true if 2) will not be wildly off. But be aware this is a heuristic way to live the life of the mind. Very sketchy.
Let’s illustrate with a realistic case where 1-3 don’t hold, but people are in fact likely to reason, wrongly, as if they do.
I tell you formula XYZ was administered to 5 cancer patients and they all recovered soon after. Would you say formula XYZ sounds likely to be an effective cancer treatment? Many would say yes. But now I add that formula XYZ is water and everyone immediately sees the problem. They were assuming it was independently even-odds XYZ was curative, or not. But it’s obviously not.
A cure for cancer is like a trick coin. You don’t find one everyday. They’re 1 in 10 million. But if you are reasoning as if you just collided with Mysterioso, you may trick yourself into thinking maybe you just cured cancer. Intuitive?
Let me conclude by quoting Andrew Gelman again:
One of my favorite blogged phrases comes from political scientist Daniel Drezner, when he decried “piss-poor monocausal social science.”
By analogy, I would characterize a lot of these unreplicable studies in social and evolutionary psychology as “piss-poor omnicausal social science.” Piss-poor because of all the statistical problems mentioned above—which arise from the toxic combination of open-ended theories, noisy data, and huge incentives to obtain “p less than .05,” over and over again. Omnicausal because of the purportedly huge effects of, well, just about everything. During some times of the month you’re three times more likely to wear red or pink—depending on the weather. You’re 20 percentage points more likely to vote Republican during those days—unless you’re single, in which case you’re that much more likely to vote for a Democrat. If you’re a man, your political attitudes are determined in large part by the circumference of your arms. An intervention when you’re 4 years old will increase your earnings by 40%, twenty years down the road. The sex of your baby depends on your attractiveness, on your occupation, on how big and tall you are. How you vote in November is decided by a college football game at the end of October. A few words buried in a long list will change how fast you walk—or not, depending on some other factors. Put this together, and every moment of your life you’re being buffeted by irrelevant stimuli that have huge effects on decisions ranging from how you dress, to how you vote, to where you choose to live, your career, even your success at that career (if you happen to be a baseball player). It’s an omnicausal world in which there are thousands of butterflies flapping their wings in your neighborhood, and each one is capable of changing you profoundly. A world if, it truly existed, would be much different from the world we live in.
A reporter asked me if I found the replication rate of various studies in psychology to be “disappointingly low.” I responded that yes it’s low, but is it disappointing? Maybe not. I would not like to live in a world in which all those studies are true, a world in which the way women vote depends on their time of the month, a world in which men’s political attitudes were determined by how fat their arms are, a world in which subliminal messages can cause large changes in attitudes and behavior, a world in which there are large ESP effects just waiting to be discovered. I’m glad that this fad in social psychology may be coming to an end, so in that sense, it’s encouraging, not disappointing, that the replication rate is low. If the replication rate were high, then that would be cause to worry, because it would imply that much of what we know about the world would be wrong. Meanwhile, statistical analysis (of the sort done by Simonsohn and others), and lots of real-world examples (as discussed on this blog and elsewhere) have shown us how it is that researchers could continue to find “p less than .05” over and over again, even in the absence of any real and persistent effects.
I like the way he is connecting up misunderstanding of p-value with, as it were, ideology of mind.
Extending my coin case: it’s like social psychology convinced itself the field had collided with Mysterioso, so these trick things are as independently likely as anything. Bias thick on the mental ground, so any strong hint of bias is likely to indicate something real, not a fluke.
Which is great, if trick coins are what pays, for you.
Here I have to tread carefully. My Upton Sinclair-inspired subtitle is crass: It is difficult to get a man to intuit p-values when his h-index depends upon his not intuiting them. (But I couldn’t resist.) I am, as I said, no statistician, so I’m not going to lecture people about making p-value errors. But I do like to think of myself as a student of the history of different ways and styles of theorizing about the nature of the mind.
Here we have a case of at least some technical/intellectual confusion, due to the unintuitive character of of p-values, dovetailing with motivated reasoning – you want the world to be a place that exhibits features you can get professionally promoted for publishing! – and with a certain style of thinking about the mind.
There are basically two philosophies of mind.
1) Aristotle: Man is the rational animal.
2) Puck: What fools these mortals be!
Gelman is basically saying: it would suck if we had to go Puck. But psychologists delight in 2), which is an honorable tradition, let’s be fair.
The more Puckish the mind, the more Mysterioso the situation, the more plausible the sense that p-value < .05, for alleged bias, is like a mirror in which we see our foolish face. But who’s more right? Aristotle or Puck? “Methought I was — there is no man can tell what. Methought I was, and methought I had …” There’s no easy answer. But trying to get to the bottom of Bottom’s Dream by calculating p-values would be distinctly ass-backwards. (I’m not saying anyone was really such a fool.)
I’ll sign off by saying why this stuff is coming up for me. I’m teaching Plato to first years, per usual (buy the book! or get it for free!) and a spot of social psychology to go with. I have students read a few chapters from Jonathan Haidt’s The Happiness Hypothesis. But, in his pop psych way (nothing wrong with that!) he passes along stuff that has, in the last few years, been challenged, refuted, not replicated, debunked (not sure how unkind to be about it in each case): the priming stuff. John Bargh’s work. Now Roy Baumeister’s ego depletion stuff is getting its cookies burnt. Maybe you read the NY Times article saying it isn’t so bad? Well, I’m no expert, but it looks to my inexpert eye as though the anti-replication skeptics are getting the best of it.
Haidt is of Puck’s school. The frame for his book, starting in Chapter 1: Why do people keep doing such stupid things? Hence the smart follow-up: just how replicable is it that people keep doing such stupid things?
In short, it’s time to save my syllabus by ‘teaching the controversies’. Hence my desire for an informal gloss on p-values. How’d I do, do you think?