Colliding with Mystery – or – It is difficult to get a man to intuit p-values when his h-index depends upon his not intuiting them

by John Holbo on March 11, 2016

Since the dawn of time, man has wondered: what are p-values?

Fast-forwarding to the present day: I’m touching on the so-called replication crisis in psychology in my intro philosophy class. Specifically, I want to bounce off something Andrew Gelman wrote:

Ultimately the problem is not with p-values but with null-hypothesis significance testing, that parody of falsificationism in which straw-man null hypothesis A is rejected and this is taken as evidence in favor of preferred alternative B.

Ergo, I could do with an intuitive, informal account of p-values for non-statisticians (such as myself!) As people have been joking, the ASA’s statement leaves something to be desired in the A-ha! department:

Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.

This thing is non-intuitive. People gloss it wrongly: ‘the p-value tells you the likelihood that your result happened just by chance’ (and variations on that thought.)

Let’s start with a simple case that shows how and why this wrong gloss just has to be wrong; then, my improved, patent-pending informal gloss on the ASA’s informal gloss.

What is the simplest case in which we, the plain people of the internet, might arrive at a p < .05 experimental result in the comfort of our own homes?

Flipping a coin, getting heads 5 times in a row. We know how to calculate the likelihood of that: 2 x 2 x 2 x 2 x 2 = 32.

1 in 32 is < 5% so we publish!

No. Not even if you preregistered your 5-heads hypothesis. (Hey, it would be worth laying random longshot bets on flips if it might get you into Science!)

In calculating odds concerning a 5-head streak, you obviously aren’t calculating the chance that your coin is fair. But if you were calculating ‘the likelihood of this having happened just by chance,’ it sounds like that’s just what you would be doing. What’s the likelihood this happened due to a (longshot) chance with a fair coin, vs. a (rigged) chance with a trick coin? And then you would be concluding, apparently, that since a fair coin would only do that 1 in 32 times, 31 out of 32 times when this happens, someone has slipped you a trick coin that always comes up heads. Crazy. So what is it really, this mystery thing?

Without further ado, Holbo’s informal gloss on the ASA’s informal gloss on p-value. Specifically, what p-value < .05 basically comes to. (It helps to add that, since p-value < .05 is a bit of a fetish, and the point is to demystify it.) Any such statement will be analogous to the following:

1) If this coin is fair, odds are less than 1 in 20 that you could match or beat that 5-heads run I just got!

Tying this to the ASA thing (bit loosely):

“under a specified statistical model” = If this coin is fair
“the probability that … a statistical summary of the data … would be equal to or more extreme than” = odds are less than 1 in 20 that you could match or beat
“its observed value” = that 5-heads run I just got!

Now, to go with, an informal gloss on what your average scientific paper reports/asserts.

No such thing as the prestigious science journal Fluke, so when a striking regularity of coin flips presents itself, you hope you’ve uncovered a trick coin. Scientific papers say:

2) Probably this is a trick coin!

(I am not recommending 2) as one-size-fits-all philosophy of science, or as template for all scientific claims or even hypotheses. Just trying to prime the intuition pump for more local purposes.)

Now we can trade in the rather confusing question—‘how does that p-value < .05 thing relate to the substantive take-away we really care about?’— for a less confusing question.

What’s the relation between 1 and 2?

Kind of looks like they are heading in opposite directions. Since we care about trick coins, and the p-value claim concerns fair ones, 1) doesn’t speak to what we care about: 2).

Strictly, 1) isn’t evidence for 2). 1) is five flips, wrapped in an elementary calculation. The flips might be evidence. We see the flips through the probability packaging. This may fool us into thinking packaging has added extra nutrition or flavor to contents. But that’s not how packaging works.

5-heads in a row is evidence your coin is trick, or not, depending on background conditions. It could be weak evidence – so weak as to be none – or actually quite strong. Let’s talk through it.

We are immediately inclined to say it’s weak evidence because we assume we are talking about our world, or one like it, in which trick coins are (I dunno) 1 in 10 million? In which trick coins probably aren’t so tricky. Maybe they come up heads 70%? (What do I know of trick coins?) Trick coins are waaaaaaaaay more unlikely than plain old flipping 5 heads. Ergo a 5-head run is vastly more likely to have been a fluke.

But, obviously, if the world is different things change. Suppose you are running to the bank with your brimming mason jar of quarters, and you collide with Mysterioso the Mysterious, carrying his equally large, equally full jar of trick quarters to the theater, where he has been wowing the rubes all week with his ‘all-heads, all-the-time’ coin tricks. (Well, not ALL the time. His coins have the tricky property that if you flip them 5 times, they come up straight heads 31 out of 32 times! Pretty good, as tricks go.)

Oh no! The coins are mixed up! What to do? Flipping each 5 times is a decent method (if you and the magician agree p-value < 0.05 is acceptable error, before you go your separate ways.) Indeed, this is a situation in which that simple 1 in 32 (2 x 2 x 2 x 2 x 2) calculation is even descriptive. That is, this is that rare situation in which the wrong thing people want to say about p-values — ‘the likelihood that this happened just by chance’— is kind of right.

To review: we’re on the street, coins everywhere, magician swearing, jars rolling. From an even mix of fair and trick coins (per above) you pick a coin (any coin!) and flip – 5-heads. What to conclude?

There is a 1 in 32 likelihood that this happened just by (longshot) chance. That is, given 5-heads, there is a 1-in-32 chance that you happen to have picked a fair coin (as likely as the alternative); then (flukily) you flipped 5 heads with it. On the other hand, there is a 31 out of 32 likelihood that this didn’t happen (just) by chance. Rather, you picked a trick coin (which was quite to be expected, in the circumstances), and Mysterioso’s coins are rigged (ergo don’t land ‘just by chance’.)

So if you want to explain to someone why their ‘likelihood that this thing happened just by chance’ intuition about p-values is wrong, flip it and tell them what they are thinking could be right, but only if they just collided with Mysterioso, as it were. So you gotta ask yourself: do you have reason to believe you just collided with Mysterioso? (Well do ya? Punk!?)

OK, I promised intuitive. This Mysterioso biz is baroque. Go back to the point that 1) and 2) are, kind of, headed off in opposite directions. Nevertheless, since 1) contains evidence, you may be able to (as Wittgenstein might say) climb up the ladder of 1) and throw it away. (Adapting my other metaphor: you eat the evidence but toss the wrapper when you realize the 2 in 2x2x2x2x2 was not the right number, after all.)

What people would like – which they can actually get only in a cosmic coincidence, Mysterioso-type case – is for the rejected null hypothesis to do double-duty as a characterization of what holds in the non-null. The null hypothesis needs to be, not merely the rejected alternative to what you conclude, but a (reverse) mirror of it. But it isn’t every day you collide with a magician carrying a jar of trick coins that are, as it were, the opposite of your jar of fair coins.

(For good measure, it’s may be helpful to think about how weird Mysterioso’s coins are if they generally invert probabilities. With a fair coin, there are an infinite number of increasingly vanishingly unlikely series (5 heads, 500, 500000, 500000 tails, 500000 alternations of heads-tails-heads-tails, on and on.) It can’t be that Mysterioso’s coins are probabilistic inverts, down that line, because no coin can be veritably dead certain to do an infinite number of incompatible things. That would be … mysterious.)

Couple more points. Someone might object that Mysterioso cases aren’t cosmically coincidental, if you just loosen a bit. That’s right. Informally, a ‘collision with Mysterioso’ case can be glossed as:

1) The alternatives are each equally likely. (Fair coins roughly = trick in number, on the ground.)

2) The alternatives are each pretty likely. (If there are 20 different kinds of differently-behaved trick coins, scattered in equal numbers, flipping one 5 times can’t give you confidence as to which kind you’ve got.)

3) The alternatives are each quite different. (If trick behavior is subtle, 5 flips won’t cut it.)

The world does present you, from time to time, with situations you can reasonably believe meet conditions 1-3. In any such case, misusing 1) as a reverse mirror, to say what is true if 2) will not be wildly off. But be aware this is a heuristic way to live the life of the mind. Very sketchy.

Let’s illustrate with a realistic case where 1-3 don’t hold, but people are in fact likely to reason, wrongly, as if they do.

I tell you formula XYZ was administered to 5 cancer patients and they all recovered soon after. Would you say formula XYZ sounds likely to be an effective cancer treatment? Many would say yes. But now I add that formula XYZ is water and everyone immediately sees the problem. They were assuming it was independently even-odds XYZ was curative, or not. But it’s obviously not.

A cure for cancer is like a trick coin. You don’t find one everyday. They’re 1 in 10 million. But if you are reasoning as if you just collided with Mysterioso, you may trick yourself into thinking maybe you just cured cancer. Intuitive?

Let me conclude by quoting Andrew Gelman again:

One of my favorite blogged phrases comes from political scientist Daniel Drezner, when he decried “piss-poor monocausal social science.”

By analogy, I would characterize a lot of these unreplicable studies in social and evolutionary psychology as “piss-poor omnicausal social science.” Piss-poor because of all the statistical problems mentioned above—which arise from the toxic combination of open-ended theories, noisy data, and huge incentives to obtain “p less than .05,” over and over again. Omnicausal because of the purportedly huge effects of, well, just about everything. During some times of the month you’re three times more likely to wear red or pink—depending on the weather. You’re 20 percentage points more likely to vote Republican during those days—unless you’re single, in which case you’re that much more likely to vote for a Democrat. If you’re a man, your political attitudes are determined in large part by the circumference of your arms. An intervention when you’re 4 years old will increase your earnings by 40%, twenty years down the road. The sex of your baby depends on your attractiveness, on your occupation, on how big and tall you are. How you vote in November is decided by a college football game at the end of October. A few words buried in a long list will change how fast you walk—or not, depending on some other factors. Put this together, and every moment of your life you’re being buffeted by irrelevant stimuli that have huge effects on decisions ranging from how you dress, to how you vote, to where you choose to live, your career, even your success at that career (if you happen to be a baseball player). It’s an omnicausal world in which there are thousands of butterflies flapping their wings in your neighborhood, and each one is capable of changing you profoundly. A world if, it truly existed, would be much different from the world we live in.

A reporter asked me if I found the replication rate of various studies in psychology to be “disappointingly low.” I responded that yes it’s low, but is it disappointing? Maybe not. I would not like to live in a world in which all those studies are true, a world in which the way women vote depends on their time of the month, a world in which men’s political attitudes were determined by how fat their arms are, a world in which subliminal messages can cause large changes in attitudes and behavior, a world in which there are large ESP effects just waiting to be discovered. I’m glad that this fad in social psychology may be coming to an end, so in that sense, it’s encouraging, not disappointing, that the replication rate is low. If the replication rate were high, then that would be cause to worry, because it would imply that much of what we know about the world would be wrong. Meanwhile, statistical analysis (of the sort done by Simonsohn and others), and lots of real-world examples (as discussed on this blog and elsewhere) have shown us how it is that researchers could continue to find “p less than .05” over and over again, even in the absence of any real and persistent effects.

I like the way he is connecting up misunderstanding of p-value with, as it were, ideology of mind.

Extending my coin case: it’s like social psychology convinced itself the field had collided with Mysterioso, so these trick things are as independently likely as anything. Bias thick on the mental ground, so any strong hint of bias is likely to indicate something real, not a fluke.

Which is great, if trick coins are what pays, for you.

Here I have to tread carefully. My Upton Sinclair-inspired subtitle is crass: It is difficult to get a man to intuit p-values when his h-index depends upon his not intuiting them. (But I couldn’t resist.) I am, as I said, no statistician, so I’m not going to lecture people about making p-value errors. But I do like to think of myself as a student of the history of different ways and styles of theorizing about the nature of the mind.

Here we have a case of at least some technical/intellectual confusion, due to the unintuitive character of of p-values, dovetailing with motivated reasoning – you want the world to be a place that exhibits features you can get professionally promoted for publishing! – and with a certain style of thinking about the mind.

There are basically two philosophies of mind.

1) Aristotle: Man is the rational animal.
2) Puck: What fools these mortals be!

Gelman is basically saying: it would suck if we had to go Puck. But psychologists delight in 2), which is an honorable tradition, let’s be fair.

The more Puckish the mind, the more Mysterioso the situation, the more plausible the sense that p-value < .05, for alleged bias, is like a mirror in which we see our foolish face. But who's more right? Aristotle or Puck? "Methought I was — there is no man can tell what. Methought I was, and methought I had ..." There's no easy answer. But trying to get to the bottom of Bottom's Dream by calculating p-values would be distinctly ass-backwards. (I'm not saying anyone was really such a fool.) I'll sign off by saying why this stuff is coming up for me. I’m teaching Plato to first years, per usual (buy the book! or get it for free!) and a spot of social psychology to go with. I have students read a few chapters from Jonathan Haidt’s The Happiness Hypothesis. But, in his pop psych way (nothing wrong with that!) he passes along stuff that has, in the last few years, been challenged, refuted, not replicated, debunked (not sure how unkind to be about it in each case): the priming stuff. John Bargh’s work. Now Roy Baumeister’s ego depletion stuff is getting its cookies burnt. Maybe you read the NY Times article saying it isn’t so bad? Well, I’m no expert, but it looks to my inexpert eye as though the anti-replication skeptics are getting the best of it.

Haidt is of Puck’s school. The frame for his book, starting in Chapter 1: Why do people keep doing such stupid things? Hence the smart follow-up: just how replicable is it that people keep doing such stupid things?

In short, it’s time to save my syllabus by ‘teaching the controversies’. Hence my desire for an informal gloss on p-values. How’d I do, do you think?



John Holbo 03.11.16 at 8:50 am

Obviously the solution is to flip bitcoins only. (Just wanted to beat everyone to that one!)

Also, I am ashamed to say: I have never taken a single stats course. I really am not proud of that.


Colin 03.11.16 at 10:16 am

Perhaps this is why particle physicists aim for a ‘five sigma’ result (p < 3×10^{-7} or so): they generally believe in a rational, orderly universe, so are more reluctant to reject the null hypothesis. Or it could just be based on the strength of evidence available to particle physicists versus social scientists. What psychological phenomena do we have five sigma or better evidence for?


Sean Carroll 03.11.16 at 10:46 am

I’d say overall, you did quite well! The most obvious next step is to just admit that we should all be Bayesian in our treatment of credences, and that one’s prior probabilities actually matter quite a bit for theory choice. (Which is almost what you said, without the jargon — if your prior for a coin being unfair is 1 in ten million, the fact that the likelihood of five heads in a row is greater under that hypothesis than under the fair-coin hypothesis won’t budge your credences all that much.)

Particle physicists have the luxury of waiting for five sigma since their data is very clean and they know how to collect more and more of it. But being honest Bayesians is becoming more popular even there.


Zamfir 03.11.16 at 11:18 am

@Colin, particle physicists aim at 5 sigmas because they have seen subtle systematic errors with a 3 to 4 sigma effect in earlier experiments. And of course, because they have enough data to aim that high. The LHC records many millions of collisions every second, that must be comparable with all the data from psychological surveys ever.

So the high sigma is not to rule out spurious effects due to chance alone. It’s to rule out alternative hypotheses that they might have overlooked. The argument is basically: an tuly alternative explanation never matches the data this precise, unless its a only a subtle variation on our hypothesis.

Basically, the sheer mount of data allows a brute force approach to hypothesis testing.


John Quiggin 03.11.16 at 11:24 am

I had a go at this a while back. My conclusion

what has been the dominant response in practice is to disregard the “95 per cent” number associated with classical hypothesis testing theory and to treat research findings as a kind of Bayesian update on our beliefs about the issue in question. If we have no prior beliefs one way or the other, a rough estimate is that a finding reported with “95 per cent” confidence is about 50 per cent likely to be right.


Another lurker 03.11.16 at 11:27 am

@Colin, well in many cases where physicists use 5-sigmas they are actually trying to disprove the null hypothesis, for example the established theory that, say, neutrinos are massless.


oldster 03.11.16 at 12:45 pm

…Daniel Drezner, when he decried “piss-poor monocausal social science.”

He’s exactly right, though. This is really the source of all of our problems today.


John Holbo 03.11.16 at 12:50 pm

Hi John, thanks, I must have missed your original post about it somehow. Last year I wasn’t really focused on the replication crisis so much.


TM 03.11.16 at 12:58 pm

I don’t know what it means to “intuit p-values” but I would suggest that “It is difficult for a scientist to report p-values greater than .05 when their H-index depends upon rejecting the null hypothesis.”

It should be mentioned that the appropriate choice of p-value depends on how many hypothesis tests are (or can be) conducted. Observing a 5-head run is unremarkable if hundreds of teams are conducting coin flips experiments. One reason for a very high statistical threshold in physics is that they do have virtually unlimited data. You would always expect to find some irregularities due to random effects. It may be appropriate to have a lower threshold in other sciences but a p-value is never in itself proof of a theory.


John Holbo 03.11.16 at 1:16 pm

“intuit p-values”

I’m just getting at the idea that people like to feel what they are doing makes sense. The concern is that p-values < .05 have become a bit of a fetish. This fetish value would be diminished if the limits, per Gelman (and other critics), were more intuitive. "It should be mentioned that the appropriate choice of p-value depends on how many hypothesis tests are (or can be) conducted." This is certainly right.


Zamfir 03.11.16 at 1:26 pm

TM says:”One reason for a very high statistical threshold in physics is that they do have virtually unlimited data. You would always expect to find some irregularities due to random effects.”

I don’t think this is correct. The statistics of particle physics treat the entire LHC as 1 experiment, or at best as a few campaigns. As such, the effect the large dataset is already taken into account in the sigma-values.


JoB 03.11.16 at 1:41 pm

“If we have no prior beliefs one way or the other”

That will be both hard and a non-starter for Bayesian updating.


bianca steele 03.11.16 at 1:50 pm

Instead of flipping coins, what about drawing different-colored balls from an urn? (Urns are nicely Grecian.) You draw one ball and it’s black. What confidence do you have that all the balls are white?

The math of how you know a given number of observations gives you a particular p is yet another thing, however.


Ronan(rf) 03.11.16 at 1:51 pm

A genuine, though layman’s(perhaps wrongheaded), question ….Taking the mysterio situation literally, and sticking rigidly to the rules of coin verification, who would be more likely to come out with more coins at the end of the flipping ? Or would the 1 in 32 probabilities and 31 in 32 probabilities cancel each other out ? Or Is this verification scheme weighted in mysterios favour?


bianca steele 03.11.16 at 1:54 pm

Argh–flip one of those colors to the other one. The math may well be different for coin-flipping and urn-drawing. I’m used to seeing urn-drawing in this kind of situation and feel like it’s more intuitive.


Kiwanda 03.11.16 at 1:58 pm

Speaking of physics, I’m still looking for an explanation for how an atom’s width of displacement of the path of a laser beam traveling a couple miles can only be explained by gravity waves, and in particular the collision of two black holes a few billion years ago. It’s not obvious that it would take much to make such a displacement: maybe a truck coming to an abrupt halt, a dog barking on the other side of the world, a whale sounding. How did *that* statistical analysis go?


TM 03.11.16 at 2:12 pm

11: Let’s try again. The *power* of a significance test (to detect an effect) depends on the number of observations. There is always a trade-off between power and significance or Type I and Type II error. However, if you have an immense number of observations, you can afford to set the significance level very low and still conduct a powerful test. Most experimenters aren’t in that position, they have to choose a relatively high significance level, otherwise they would never detect an effect even if it is real. That means of course that they can’t just be content to report a barely significant result: more studies are needed to confirm the result.

The coin flip is a good example. It is easy to devise a very powerful test and choose a very low significance level – just throw the coin a thousand times instead of just five. That strategy just isn’t usually available to real world experimenters.


erk 03.11.16 at 2:32 pm

if it takes you this many words to explain something, either you don’t understand it, you havn’t edited your writing, or it is so complex that , like advanced statistical mechanics, the only proper response is, without 4 semesters of study, you are not gonna get it

imo, if after 200 years people don’t get stats, it is the fault of the teachers

also, that whole null hypothesis – anytime you introduce dependent conditions, people zone out
when you understand that the stat profession will start to learn how to do pedagogy


erk 03.11.16 at 2:34 pm

kiwanda @16
it is called money and time: LIGO spend a lot of money on a lot of smart people over a long time
you do enough of that , you can measure really really reaaaaaalllly small things


Lee A. Arnold 03.11.16 at 2:39 pm

Is it fair to define the p-value as the probability of a probability, given a possibility?


Scott P. 03.11.16 at 2:42 pm

I think the coin examples are a bit confusing, as we are rarely interested in the behavior of coins, fair or not. (If we want to test whether a coin is fair, we’d likely use a chi-square test).

Here’s a better example: You poll 100 Republican voters in Montana, 75 say they will vote for Trump, 25 for Cruz. Let’s assume just for the sake of argument that everybody votes, so we don’t have to worry about likely voters.

The null hypothesis would be: This was just a statistical quirk, Trump is no more likely to win Montana than Cruz.

The alternate hypothesis would be: It is not a quirk, Trump is favored in Montana.


John Holbo 03.11.16 at 2:44 pm

it’s a fair coin = null hypothesis
it’s a trick coin = surprising regularity that is not a fluke but has some physical cause

That’s supposed to be the analogy. You probably got it, but just to be clear.

It helps in the psychology case that we are so frequently detecting bias, and trick coins are biased. But the point doesn’t concern only psychology.


bianca steele 03.11.16 at 2:59 pm

Perhaps the willingness to engage in reductionism contributes to a difference that would weaken the analogy? We’re inferring invIsabel causes from observations of something else. The medical researcher, on the contrary, doesn’t have to infer that the efficacy of a test is reducible to lower-order biochemistry and so on. We dislike reductionism in social science. And for good reason.


JW Mason 03.11.16 at 3:08 pm

If you haven’t seen it, you might be interested in Why Most Published Research Findings Are False, which has a very nice discussion of these issues in the context of biomedical research.


JW Mason 03.11.16 at 3:09 pm

Oops, I screwed up the link. Should be: Why Most Published Research Findings Are False


JW Mason 03.11.16 at 3:11 pm


BenK 03.11.16 at 3:13 pm

People do tend to favor their own survival. If productivity is measured in p-values, then p-values will be produced. The alternatives which have been suggested – that scientists, say, should all be given endless amounts of money to produce nothing in particular – should be silly prima facie, but have been given surprising amounts of attention, even supported by arguments that _not_ doing so amounts to censorship and anti-scientific behavior.

The real problem is that curiosity-driven science wasn’t putting food in peoples’ mouths much of the time. All scientists were suspicious of applied science publications for good reason; but pure science was published by people who made their money as teachers, physicians, or something else. They could afford, if you will, to have high standards.


Scott P. 03.11.16 at 3:49 pm

“That’s supposed to be the analogy. You probably got it, but just to be clear.”

Yes, but I have a hard time conceptually mapping it onto my election poll example, which is a lot closer to the kinds of problems I actually deal with. In my example, we have a sample (those responding to the poll) and a population (the total voting population). The questions is whether the sample accurately reflects a specific aspect of the overall population (who has the most support).

With a coin, it’s clear the sample is a series of flips. But what is the population? I don’t know. Something like the universe of all possible flips? Is that even a thing? That’s my issue.


mbw 03.11.16 at 4:04 pm

Mostly, just like Sean said @3.

One point that hasn’t been emphasized here is that the dopey p-value convention not only produces a lot of false positives but also in important cases produces a lot of false negatives. If, for example, the manufacturer of some chlorinated hydrocarbon that looks just like a known carcinogen, but with an added methyl group somewhere, is allowed to pick a “null hypothesis” that the new substance is safe, an unreasonable amount of evidence will be required before concluding that it isn’t. Bayesian methods have limits, but they at least don’t hide all the subjectivity in an occult process by which null hypotheses are chosen. Sander Greenland has written extensively on this issue.


mbw 03.11.16 at 4:10 pm

p.s. On particle physics and null hypotheses:
We all know the sorts of things that qualify as null hypotheses. Say you’re about to do an experiment at CERN searching for new particles in a particle energy range. Which is the legitimate null hypothesis”
1. Nothing will show up beyond the typical random background events.
2. What you see will require overturning much of the framework of modern physics.

Obviously (1), right? (2) is the opposite of a “dull null”.

Except if the experiment was the final search for the Higgs boson, (1) and (2) were in fact logically equivalent. They’re the same result.


Kiwanda 03.11.16 at 4:14 pm

erk @19: I’ve heard “proof by money and authority” before; it doesn’t answer the question. There are a lot of really really reaaaaaalllly small things around, including the ones I mentioned; how do they know that the one they detected must have been a gravitational wave? I’m sure all those smart people (well, smart for physicists) had ways to eliminate all other explanations. What were they?


mbw 03.11.16 at 4:18 pm

@27 Do you have any idea what the actual topic is here? It’s not remotely close to pure vs. applied science. Roughly speaking, it’s whether to
1. Misuse a particular crude version of frequentist statistics (p-values), as is the custom.
2. Use that crude version, but more carefully.
3. Use more informative frequentist descriptions (e.g. confidence intervals)
4. Use likelihood functions
5. Use Bayesian inference (likelihoods*priors).


Dipper 03.11.16 at 4:20 pm

I see the gravity wave deniers are here (Kiwanda @16) …

with no specific knowledge of this, firstly the wave was detected in Washington state and Louisiana, so there is a record of a wave passing through being detected in two places, and if it was a truck passing by or a dog barking then events such as this will happen quite a lot so should be being able to be detected, and again two trucks have to pass or two dogs bark by the two detectors at the right time separation and be not like other trucks passing or dogs barking at other times …


mbw 03.11.16 at 4:22 pm

@31 Well, if you think that exactly the same sort of huge truck drove by the detectors in WA and LA at times differing by less than the time it takes light to go from one place to the other, well sure it’s a problem. Also apparently those truck drivers had been studying their general relativity simulation outputs to know exactly what sort of chirp pattern to make as they rumbled on past.


Patrick 03.11.16 at 4:23 pm

It’s not just bad stats.

If read social science papers, you’ll start to notice certain tropes. Like this:

“Here are our results. We studied twenty things. Some had statistically significant correlations with our variable and some didn’t. Some replicated prior studies and some didn’t and some prior studies were themselves inconsistent with each other. Here are ten paragraphs interpreting those results via cultivation theory. Here are conclusively statements about how our results support cultivation theory, except for the ones that weren’t statistically significant. Here are explanations minimizing the importance of the statistically insignificant parts and the parts that don’t replicate prior studies. Here are three sentences conceding that there are other explanations and cultivation theory might not be real, but please don’t read this far.”

Swap out cultivation theory for whatever pet interpretive framework is at issue.


bianca steele 03.11.16 at 4:25 pm

I think I’m going to do what I think Rich would do, and ask what you’re trying to do with this post, John.

Are you trying to learn about the issue? Then I think you’ve got a lot of good suggestions here.

Are you trying to find a good introduction to the basics of p-values for your students so you can introduce the problem of unreplicated research? Then something like Scott P.’s example (maybe with my Grecian urns added in!), with some verbiage about psychological research being a lot more complicated than that, maybe all you can do in the time you have.

If you’re trying to keep Haidt in your syllabus and you want a foolproof way to keep from misleading your students with it? Why do you want to keep it in your syllabus if you’ve learned it’s pretty much 100% misleading, because it uses unsound methods?


mbw 03.11.16 at 4:28 pm

@33 Sorry Dipper, posted before yours came up on my page.


Kiwanda 03.11.16 at 5:10 pm

@34: Finally, something! So, probably, not a dog barking on the other side of the world, something from farther away. That’s good. And of all the known and unknown phenomena in all the universe, only gravitational waves could possibly yield the observed chirp pattern. So that’s settled then.


RNB 03.11.16 at 5:21 pm

Look forward to reading this. Jordan Ellenberg has a delightful discussion of p-values in How Not to Be Wrong.


eric titus 03.11.16 at 6:33 pm

An example that is perhaps closer to the real world application in the social sciences. (also, in my experience people have trouble moving from coins/urns to regression-type situations):

You’re watching two statistically minded friends having a frisbee throwing competition. On 100 throws each, one averages 30 meters and the other averages 25. You crunch some numbers (looking at variation) and find that there is a >95% chance that friend 1 is a better frisbee-er than friend 2.

But wait! It turns out that one frisbee was more aerodynamic than the other. You kept track of which frisbee was being used, and find that Friend 1 used this frisbee on 60 throws but Friend 2 only on 40 throws. Your regression models finds that the special frisbee appears to add 3 meters to a throw, and there is still a >95% chance that friend 1 is the better frisbee thrower. End of the analysis?

Perhaps. But you are still making plenty of assumptions. For example, you estimate that the special frisbee will add a fixed distance to the throw, but perhaps upon rerunning a model where the frisbee adds 10% is a better fit. You then notice that friend 1 got better over time while friend 2 stayed consistent–maybe you were actually measuring differences in adaptation to that day’s environment? And you spot that friend 2 made a few very weak throws in a row–perhaps they were distracted during this period? The story is more complicated than it first appeared and you file away these observations for follow-up papers (no need to confuse the reviewers!). A disconcerting thought pops into your head: who knows what else you may have missed!


JoB 03.11.16 at 6:47 pm

Frisbee beats urn beats coin. Sure thing.


Dipper 03.11.16 at 7:39 pm

@Kiwanda 38. Yes it’s possible we may be wrong. We might be imagining everything we see. Or there might be magic turtles creating these things. But a man had a theory that predicts that these things should occur. A machine was built to detect them, and it found them exactly how he said they would occur within our ability to measure it. We don’t have any other theory that predicts they should occur. So we say that this evidence supports this theory.

Over time, some people may find an alternative source for these signals not currently considered, or detect something the theory doesn’t predict, or else someone may come up with a different theory that explains this and more. But right now, we have a theory that predicts exactly what was seen, so we go with the theory.


F 03.11.16 at 7:58 pm


Come on. Do you really think they haven’t thought about this? The LIGO people put tremendous effort into imagining as many possible alternate explanations as they could. What they know is that

a) it’s not any of the (very large) number of possible alternate explanations they considered
b) it is exactly consistent with the predictions of gravity waves.

Could it be some other random unexplained thing? Sure. But if you have an observation that matches one explanation and you have ruled out as many other possible explanations as possible, Occam’s razor says you go with the explanation that matches until you have evidence that contradicts it.


Peter Dorman 03.11.16 at 8:24 pm

This is a nice, intuitive (if somewhat wordy) explanation of the Bayesian critique of frequentist p-value testing. I like it. I’ll use it.

But Gelman has another critique too, which I’m too lazy at the moment to look up and quote. It’s that what the p-value actually tells you is whether you have enough observations or not. If N is sufficiently large you would be likely to reject the straw man null in most instances.

Let’s stay with your coin flipping. Suppose we have two coins, and the question is whether they are equally fair. We flip each coin five times and record the difference in the number of heads we get with each. Then we do it again and again. We do it 20 times. The null hypothesis is that the coins are equally fair, so the question is whether the mean of these 20 differences is zero. To get the p-value, you begin by assuming that the coins actually are equally fair and then ask, given the variability of this difference each time we do the flipping exercise, how likely is it you would get a given non-zero mean over 20 runs? If the probability is less than .05 you reject the null; otherwise you accept it.

Now let’s say that, in nature, no two coins are exactly the same. There is a tiny, tiny difference in the likelihood that a flip with one coin will give you a heads compared to another. That’s often how it is in the real world, whether we are thinking about natural or social science. This means that if we sample long enough—increase our number of runs from 20 to 2 million or billion—eventually we will get a p < .05 and have to reject the null.

But how interesting is this? Surely for any practical question, like whether you’re going to get ripped off in a coin-tossing game for stakes, such a minuscule difference in fairness doesn’t matter. That’s the sense in which the null hypothesis is a straw man. In an ideal world you might know exactly what the threshold is for a meaningful difference in fairness and that threshold would become your null. But in real life we seldom know this, and in fact people will often disagree. The one thing we do know is that setting the null at zero is arbitrary and, as N increases, increasingly distracts us from what we ought to be looking at instead. To summarize, under null hypothesis statistical testing more N means more uninformative “significant” results.

This is the effect size critique of NHST.


nnyhav 03.11.16 at 8:40 pm

Gelman’s good (and I follow), but Cosma Shalizi’s my go-to for this sort of thing: f’rinstance,
Any P-Value Distinguishable from Zero is Insufficiently Informative
(or search for other such) (also via, see Jordan Ellenberg, How Not to Be Wrong: The Power of Mathematical Thinking)

A-and, Pynchon’s Proverbs for Paranoids #3, ” If they can get you asking the wrong questions, they don’t have to worry about answers,” is invertible: “If you worry about answers, be sure to ask your questions wrong.”


Zamfir 03.11.16 at 8:52 pm

Are gravity wave deniers really a thing? I can see how someone would want to deny the holocaust, or global warming, or even the moonlandings. But aren’t gravity waves a tad obscure? What’s next, deniers of the Riemann-Zeta function?


anon/portly 03.11.16 at 8:54 pm

1) If this coin is fair, odds are less than 1 in 20 that you could match or beat that 5-heads run I just got!

2) Probably this is a trick coin!

I am probably just confused, but “if this coin is fair” (and the “1/32 < 5%" thing) seems relevant only if your null hypothesis is p = .5, i.e. that the coin is fair. "This is a trick coin" only seems relevant if your null hypothesis is p = 0, i.e. that "someone has slipped you a trick coin that always comes up heads."

If your null is p = 0, you obviously can't reject it the coin always comes up heads. You can only reject it if it comes up tails at least once.

If your null is p = 0.5, then if your alternative is p not= 0.5, then actually I think "1/32 < 5%" isn't quite right, you have a 2/32 chance of getting a result that "extreme," as they say, because you could also get 5 straight tails. If your alternative is p < .5 then the "1/32 < 5%" I think works, so it seems to me that implicitly that is what is really being tested for, that it is a trick coin that comes up tails less than 50% of the time, rather than a trick coin that always comes up heads. And of course if all of the coins in the world are fair, 1/32 of the time you'll prove the coin is not fair, and 1/32 of the time you'll be making a type-1 error.

Like I said, I could be very confused. I await correction….

People gloss it wrongly: ‘the p-value tells you the likelihood that your result happened just by chance’ (and variations on that thought.)

This seems not wrong, just incomplete. How about: the p-value tells you the likelihood that, if the null hypothesis is true,, your result (i.e. sample), or a result (sample) even more favorable for the alternative hypothesis,, happened just by chance.


Yankee 03.11.16 at 9:35 pm

Kiwanda, there were 2 widely separated detectors, so comparing the signals eliminated everything that didn’t come from much further away than that baseline. And the signal wasn’t just a blink, it was a waveform that was analyzable in terms of General Relativity and a whole lot of very postdoc math. Sure these experiments are very hard to do, no doubt there will be technical papers published, if you have the chops to read them. I would also like to understand how the genetic anthropology people are doing what they claim to be doing.

I mean, one OP point is that uneducated intuition isn’t trustworthy, right?


peterv 03.11.16 at 9:35 pm

It is not surprising this stuff is difficult and counter-intuitive. We’ve had an understanding of the syntax of probability statements and their mathematical manipulation since the 1660s, but we still have no agreement on what they mean (their semantics), nor even on whether probability is the best formalism for representing uncertainty.


RNB 03.11.16 at 11:28 pm

@44 implies correctly, I believe, that an intuitive explanation of p-values should clarify that statistical significance does not imply anything about importance or size. Ellenberg notes that Chinese statisticians use the word xianzhu for significance in a statistical sense, which is closer to ‘notable’ but may still carry a connotation of importance.


RNB 03.11.16 at 11:29 pm

Naomi Orskes and Erik Conway imagining Chinese scientists three hundred or so years from now reflecting on inaction on climate change:

“Yet overwhelming evidence suggests that 20th c scientists believed that a claim [linking e.g. intensifying hurricanes and warmer sea surface temperatures]could be accepted only if, by the standards of Fisherian statistics, the possibility that an observed event could have happened by chance was less than 1 in 20. Many phenomena whose causal mechanisms were physically, chemically or biologically linked were dismissed as ‘unproven’ because they did not adhere to this standard of demonstration. Historians have long argued about why this standard was accepted, given that it had neither epistemological nor substantive mathematical basis. We have to come understand the 95% confidence limit as a social convention rooted in scientists’ desire to demonstrate their disciplinary severity.”


Kiwanda 03.11.16 at 11:32 pm

I don’t know if gravity-wave deniers are a thing, but interested members of the public who want to know a tiny bit about how the reasoning was done are a thing.

Again, it’s not a question of whether the predicted phenomenon could show up on your really really reaaaaaalllly sensitive device, it’s how you know that of all the things affecting your really really reaaaaaalllly sensitive device, everything that is not your predicted phenomenon looks nothing like it.

Simultaneous measurement at widely separated locations, as mbw mentioned, is definitely a good and readily understandable clue worth mentioning, and one that’s more interesting to hear about than “billions of dollars and lots of smart people”; so is finding *exactly* the predicted phenomenon, for some not-entirely specified value of *exactly*. So is the “injection of fake signals” that apparently was used, although the nature of those fake signals, and why they were needed in light of the rock solid evidence just mentioned, would be interesting to hear about.

Actually: looking a bit at an announcement and an earlier report, there was quite a considerable amount of signal processing and statistical analysis involved,. This seems a bit inconsistent with F’s “exactly consistent with the predictions of gravity waves”, but OK.

The search against a “template bank” of 250K chirp patterns, generated by coverage of the configuration parameters of the two hypothesized black holes, is also a bit more complicated than “exactly consistent”, but OK.

The sub-light-transit time lag mentioned by mbw doesn’t seem consistent with ” The 15ms window is determined by the 10 ms inter-site propagation time plus 5 ms for uncertainty in arrival time of weak signals,” in the second reference, but OK.

So: they used seismic isolation, environmental sensors, and mass suspensions to reduce and detect noise. They looked for near-simultaneous arrival. They looked for matches to among broad array of black-hole-collision chirp patterns, against various measured assumed-random noise conditions, and *with those assumptions*, came up with a very small false alarm rate. The specific distinctiveness of the chirp patterns, in any way other than against a collection of noise patterns based on observation, is not discussed, but OK.


John Holbo 03.12.16 at 12:13 am

Thanks for comments, everyone. This is generally helpful, and it is gratifying to have actual experts tell me I did ok, despite total lack of training. I need to give myself a quick self-study course on Bayes vs. frequentist. But I didn’t want to teach anything in this area before getting some expert confirmation that I wasn’t just laughably turned around on some elementary point. It seems, if I am wrong, I am at least respectably so. Good enough. What philosopher ever does better than that, in philosophy 101?


Dean C. Rowan 03.12.16 at 12:27 am

Like Scott P., but for squishier reasons (IANAS), I’m uncomfortable with the coin-flipping illustration. So I turn to Babbie, where I read:

“The fundamental logic of tests of statistical significance, then, is this: Faced with any discrepancy between the assumed independence of variables in a population and the observed distribution of sample elements, we may explain that discrepancy in either of two ways: (1) we may attribute it to an unrepresentative sample, or (2) we may reject the assumption of independence…. Most simply put, there is a *high* probability of a small degree of unrepresentativeness and a *low* probability of a large degree of unrepresentativeness.

“The statistical significance of a relationship observed in a set of sample data, then, is always expressed in terms of probabilities. ‘Significant at the .05 level (p<=.05)' simply means that the probability that a relationship as strong as the observed one can be attributed to sampling error alone is no more than 5 in 100."

Back to Scott P.'s complaint. What population is being represented in the coin-flip exercise? Of which population is it a sample?


John Holbo 03.12.16 at 12:35 am

“What population is being represented in the coin-flip exercise?”

The fair coin population.

“‘Significant at the .05 level (p<=.05)' simply means that the probability that a relationship as strong as the observed one can be attributed to sampling error alone is no more than 5 in 100." Your statement is equivalent (roughly) to my IF this is a fair coin statement 1). That's what makes getting to: Trick coin! problematic.


John Holbo 03.12.16 at 12:37 am

To put it slightly differently: my 1) represents the fair coin population. But we are trying to move to a discussion of the whole coin population (fair + trick). And that’s why using 1) as evidence for 2) is tricky.


John Holbo 03.12.16 at 12:39 am

To clarify further, it depends what you mean by ‘the coin-flip exercise’. If by that you mean the whole thought-process, then you might say that we are moving, in the course of it, from representing one population to representing another population. And that’s why it’s confusing.


Dean C. Rowan 03.12.16 at 12:42 am

Flipping a single coin five times is in no way a sample of the fair coin population. Flipping a sufficiently large and well-chosen random sample of coins five times, and then comparing the resulting results of H or T, is a sample of the fair coin population.


John Holbo 03.12.16 at 12:43 am

So when I said ‘they are heading in different directions’ it would have been more helpful to say ‘dealing with different populations’. I hope I got that right. (I could be confused.)


John Holbo 03.12.16 at 12:44 am

“Flipping a single coin five times is in no way a sample of the fair coin population.”

Yes, but statement 1) concerns the fair coin population. IF this is a fair coin …


RNB 03.12.16 at 12:46 am

@58 So say you flip a coin five times on 100 different occasions. On how many of those occasions would you have to get a run of five heads to reject the null hypothesis that the coin is fair?


Dean C. Rowan 03.12.16 at 12:57 am

Again, IANAS[tatistician by any stretch], but “trying to move to a discussion of the whole coin population” implies to me sampling from the population and generalizing from results across that sample. I recognize that you are positing in 1) that the coin is fair, that we know it’s fair. (The flipping of that coin five times is what I mean by “the coin-flip exercise.”) If that’s so, then which variables are you testing with the five flips? What relationship are we trying to establish? We aren’t “moving, in the course of [this process] from one population [fair coins] to representing another population [biased coins].” That is not how generalizations work.

It seems to me that we can’t posit the coin is fair when we’re running actual tests. We want a random assortment of fair and biased coins (established through survey design measures), and then we want to sample from that assortment to the population of coins. If we’re testing only fair coins, then we need a way to know with some degree of certainty which coins are fair. A sample won’t accomplish that for us.


Dean C. Rowan 03.12.16 at 12:58 am

Loving this, but must dash away. Will return later, but without Babbie!


John Holbo 03.12.16 at 1:03 am

“It seems to me that we can’t posit the coin is fair when we’re running actual tests.”

This is right in a sense. But look at it this way: you can posit anything, while flipping coins. You can posit that your coin is a man from Mars in disguise. It doesn’t interfere with the flips. You just can’t prove your posit, or significantly lend support to it, BY flipping coins. 1) really does say JUST what is the case IF it’s fair. It posits that, if you like. And we really are wanting to move to ‘it’s trick’, i.e. a form of denial of 1). This makes the evidential status of 1) a bit non-intuitive. That’s the point of the post. To try to make that less slippery. But I guess it’s still slippery!


John Holbo 03.12.16 at 1:07 am

Correction: 2) doesn’t deny 1), per se. Compare:

1) If this is aspirin, it will probably cure your headache.

That isn’t a denial of:

2) This is just a sugar pill.

Both 1 and 2 can be true. 1) isn’t evidence for 2). 1) only talks about the aspirin population, and 2 takes us out of that. But, obviously, the truth of 1, plus taking the pill and having it not cure your headache could carry you to 2.


John Holbo 03.12.16 at 1:41 am

Here’s one last, alternative angle on the problem.

The advantage of the coin flip case is that 2 x 2 x 2 x 2 x 2 is about as simple as it’s going to get. That is obviously how you get the solution to the ‘how likely to get 5-heads’ question. It is also obvious that it is, essentially, five posits that this coin is fair, multiplied. That’s what the 2 means. Conversely, if this coin is trick, then 2 is necessarily the wrong value. That’s all it is to be a trick coin. So it should be obvious that just multiplying 2 x 2 x 2 x 2 x 2, by itself, can’t automatically bear on the question of whether 2 was wrong number to have plugged in, from the start.


John Holbo 03.12.16 at 1:43 am

2 x 2 x 2 x 2 x 2. No hidden parts, hence no suspicion that some statistical secret sauce has, miraculously gotten us something more.


Dean C. Rowan 03.12.16 at 2:14 am

Back at it.

RNB@61: The question begs the question, inasmuch as we can compare the frequency of five head occurrences over the 100 flips of the one coin with that of flips gleaned from a population comprised of a normal distribution of fair and biased coins, but then we must choose arbitrarily the threshold for making the call. That’s the “probable” part of probability.

By “posit” I mean that we are doing our think-piece based on a premise that the coin is fair. We are stating that we know it is fair and wondering what a five-heads result with a KNOWN fair coin tells us about results from tests with biased and fair coins combined (the population). I can’t fake an answer to that query. But testing coins we know to be fair won’t tell us anything about coins we know not to be fair.

I do not mean that we are hypothesizing a fair coin (which would be the null hypothesis) when we flip coins to determine the distribution of five-flip results. We could posit that the coin is a man from Mars in disguise, too, but we would have to design our experiment to test for that hypothesis. The hypothesis doesn’t interfere with the flips, but it most definitely interferes with our experimental design!

I want to address other issues here, but for now I’ll focus on the last comments, 66 and 67. It is entirely possible I’m not getting the premise you’re stating, but the value of 2 serves one purpose: it reflects the true number of possible results of a flip (ignoring the ifinitesimally small likelihood a coin will land on its edge). Nothing more or less, nothing “wrong” about it even if we’re working with a bad coin. We would only adjust that factor if we knew that the coin(s) with which we’re working are biased in such a way that their flips would produce a different (and predictable) frequency of H and T.


JimV 03.12.16 at 2:22 am

Kiwanda, here is the NASA press conference annoucning the dsicovery of gravitational waves:

I found the evidence presented there very convincing, and I expect more confirming detections of similar events will be found with greater precision as more LIGO installations go on-line. I realize however that no amount of evidence need convince someone who does not want to be convinced.


Dean C. Rowan 03.12.16 at 2:34 am

I fear I am not getting the premise. We can’t, nor should we, prove the posited “this coin is fair” by flipping the coin. If we want to test outcomes with a KNOWN fair coin, there is no point to “proving” it’s fair. We must work with a KNOWN fair coin. The 1/20 figure is arbitrary, a conventional threshold for querying whether or not the results we obtain likely reflect the truth or other unknown, uncontrolled variables that contribute to the result. Again, I don’t understand how the p-value applies to this coin-flipping exercise.


Paul Stankus 03.12.16 at 3:08 am

@JimV #69 — Quick note: the press conference announcing the observation of gravitational waves was at the NSF, not NASA. This may be OT and minor to some, but important to others; the NSF funded LIGO consistently for decades and deserves the credit for the scientific success.


John Holbo 03.12.16 at 4:26 am

OK, Dean, let me try to explain the connection between coins and psych cases by coming up with a different example. A psych case that will be like the coin case, and like the suspect psych cases in the news.

Suppose I am testing the hypothesis that cuticle thickness is positively correlated with voting Republican. (Silly, I know. That’s the point.) Now this could be quite complicated. Is thickness correlated with degrees of Republicanism? That is, the thicker the skin around your nails, the more right-wing your voting? Are Democrats all thin-cuticled? Are we predicting that ALL Republicans (or most?) have thick cuticles? Strength, frequency. But let’s keep it simple and binary (because coins are binary and we want to see the analogy there.)

The hypothesis is: greater than average cuticle thickness is positively correlated with higher than average rates of voting Republican.

Now, the test. It’s probably complicated, sure, but let’s keep it simple. Your first 5 test subjects all happen to be Republicans (vote Republican at greater than the national average); they all have thicker than average cuticles (you measured, and you have the data to know what average is.) Congratulations! Let’s call it! You have already achieved the gold standard: p-value .05.

How so?

Your null hypothesis is: cuticle thickness has nothing to damn do with Republicanism. (This is the fair coin assumption you are hoping to overthrow.)

Let’s rewrite my 1) so it’s about this experiment.

In a world in which cuticle thickness and Republicanism have nothing to damn do with each other, I give you less than a 1 in 20 chance of replicating the result I just got, or a stronger one: 5 Republicans in a row like that, all with thick cuticles.

Answering your ‘what is the population?’ question. The population is, you might say, the set of possible worlds in which cuticle thickness and Republicanism have nothing to do with each other. They are not positively correlated.

But, of course, we want to conclude something like 2): cuticle thickness and Republicanism are positively correlated. The world turns out to exhibit a weird, unexpected bias. (Trick coin! Biased coin! That’s why trick coins are a good metaphor.)

So, here again, we want to use a claim about one population (the set of possible worlds in which cuticle thickness and Republicanism are uncorrelated) as evidence about another population (some set of possible worlds in which they go together). Specifically, we want to conclude our world is part of the latter set, not the former.

So how do we get from 1) to 2). Talk of one population to talk of the other?

Well, maybe you have some posit. What might that be? Let’s say you suspect that heightened levels of enzyme XYZ in some elements of population produce two effects: neurological changes that manifest as conservatism, hence Republican voting; increased cuticle production. (Crazy, I know. But the body is weird.) What is your confidence in this posit? Let’s say it’s 50% (just to keep it simple.) You are on the edge. That’s why you are doing the experiment.

Now, intuitively, if you are independently 50% confident you will find this correlation, then 5 Republicans walking in the door and all having thick cuticles is going to quite significantly increase your confidence. You are basically in a Mysterioso situation, per the post. And you write up and publish your paper on cuticle thickness and Republican voting and enzyme XYZ. And you report you gold standard p-value < .05 results, vis a vis the null hypothesis. But it is still the case that all your p-value calcuation says is something about the population of possible worlds in which there is no correlation between greater than average cuticle thickness and greater than average Republicanism. And it is still the case that your conclusion concerns a different population: worlds in which there is such a correlation. Does that make sense?


Kiwanda 03.12.16 at 4:28 am

JimV, thanks, but I doubt if the press conference had more detail than the PhysRevLett and arXiv papers I linked. I apologize for being curious about the question of eliminating false positives, which was not covered in the popular press articles I saw.


Dr. Hilarius 03.12.16 at 4:29 am

Peter Dorman @44 hits on an important point about the limitations of p-values. Given a sufficiently large sample size it is possible to detect statistically significant differences (contra the null hypothesis) of no practical significance. This is why some journals are demanding descriptive statistics including effect size.

I ran into this issue in the context of psychologists attempting to predict future sexual violence. A large meta-analysis found having a prior male victim to be a statistically significant variable. This variable, however, accounted for less than 1% of sample variation. The psychologist, an expert for the state, maintained that it was still “significant” when assessing the risk presented by a particular individual. When I tell real statisticians about what courts will accept they get glassy-eyed and start head banging.


Plarry 03.12.16 at 5:30 am

I have read all the comments and I don’t understand the coin flipping example either. You want your null hypothesis, which you are trying to reject, to be “this coin is fair” or something like that. But that misses the essence of what statistics is about. Given a particular coin (“this coin”), we can simply test whether it is fair or not. The point is that statistics is about populations where you can’t test every member. That’s why reasoning about them is hard.


Dean C. Rowan 03.12.16 at 6:01 am

“Answering your ‘what is the population?’ question. The population is, you might say, the set of possible worlds in which cuticle thickness and Republicanism have nothing to do with each other. They are not positively correlated.”

No. This is deeply mistaken. The population has nothing to do with “possible worlds,” and here I think you are utterly wrong about how this work proceeds. The population must be a definite set: Americans, male Americans, everybody on the planet, Berkeley voters, etc. You must first determine your target population before you run an experiment to determine their characteristics. You make a judgment about that population: knowing nothing else, we should assume that half of our voters are Republican.

“But, of course, we want to conclude something like 2): cuticle thickness and Republicanism are positively correlated. The world turns out to exhibit a weird, unexpected bias.”

No. We don’t “want to conclude” anything if we’re running our experiment properly. We want to test a hypothesis. We are agnostic about the outcome.

When y0u go here: “Let’s say you suspect that heightened levels of enzyme XYZ in some elements of population produce two effects…” … you’re on the right track. One vulnerability of our study might be that we failed to consider OTHER VARIABLES that contribute to the dependent variable we are studying. This is basic experimental design. If we do it right, we have anticipated these “heightened levels” and controlled for them, so that variations in levels of enzymes does not skew the results of our cuticle study.

Put another way, we are not getting from 1) to 2) discovering facts about one population and assigning them to another, as you put it. We’re discovering facts about one population (coins, voters) about whom we reasonably assume a distribution based upon mathematical fact or sociological data (the average number of biased coins, the typical array of cuticle sizes), and we’re observing how the members of this population differ in certain terms (where they land when flipped, how thick their cuticles are). The ways in which they differ might permit us to abandon an assumption–cuticles have no correlation to political alignment, or the center of gravity of every coin is roughly identical–but they don’t allow us to make a *positive* determination that thick cuticles=Republican or anomalous flip distribution=biased coin. But notice how the coin example seems to contradict what I’ve just said. This is because we treat the coin test as purely binary. A coin’s center of gravity, and nothing else, determines how it inclines to land. If we accept that binary analysis, then yes, we can make a positive determination that a coin flipped X number of times and landing heads each of those times is biased.


Dean C. Rowan 03.12.16 at 6:20 am

Reviewing my comment @68, I want to clarify my last paragraph there. There are always only two possible outcomes of any coin flip, H or T, regardless of whether or not the coin is fair or biased. Therefore, we always use 2 as our factor of possible outcomes of a coin flip. We never want to “adjust” that factor unless we are working with a population of biased coins whose distribution of H and T flips we know, and unless we want to study outcomes of flips among the population of those coins.


faustusnotes 03.12.16 at 6:20 am

I have been thinking about a small challenge for people who think that there is a distinction between bayesian and frequentist approaches to statistical inference. I’m not sure if it’s good so I’ll try it here…

Given you have a data set of global average temperature taken over a long period (e.g. HADCRUT, RSS, whatever), can you conclude that the earth is warming using statistics on that dataset without reverting to a null/alternative hypothesis framework?

I think you can’t. I think whether or not you use p-values, confidence intervals, Bayesian credible intervals, or eyeballing, you’ll end up reverting to a null/alternative hypothesis framework. I think this framework is fundamental to the experimental process and there is no way to conduct experiments without using it.


Charles Peterson 03.12.16 at 8:02 am

In practical use p values deal with two fundamentally different things. One is to deal with rejecting the null hypothesis given a mass of data (and therefore potential hypotheses) which may be small or large. For an enormous mass of observations, you need a correspondingly low p value for a null hypothesis to be soundly rejected. This explains the astonishingly low p values required in physics and genetic SNP experiments, for example. The requirements here can be established in a fairly straightforward fashion given the other thing…

The other thing is to deal with the null hypothesis given a particular number of experimenters or long running experiments or publications. This is really where the p < 0.05 convention plays. It may be tradition (and little else) to accept that 1/20 experiments may be rejecting the null hypothesis falsely. Back when there were far fewer experimenters doing only long running experiments for which papers were rarely published this made a lot of sense; I doubt this is still true.

I fear if you look carefully at the traditional requirements in specific fields, they boil down to p < 0.05 adjusted appropriately to typical numbers of potential hypotheses in experiments in those fields. But they do little to negate the problem of 1/20 experimenters/experiments/papers rejecting the null hypothesis falsely.

Because of the above, p value alone can't be taken as proof of anything. p < 0.05 is really only a starting point, to weed out the garbage and give us a somewhat smaller set of hypotheses to consider. Proof isn't in individual p values–it's in endless replication, true forward prediction, and intuitive explanation–which helps to drive the pursuit of endless replication.

Then you have vast areas of human knowledge where controlled experiments are not typically done or would be impossible. That is how things are in MOST areas of human knowledge or whatever you want to call it.


TM 03.12.16 at 8:48 am

28: “With a coin, it’s clear the sample is a series of flips. But what is the population? I don’t know.”

In a simplified example, the overall population consists of k supporters of candidate A and n-k supporters of candidate B. So if you ask a random person which candidate they support, the chance is k/n that the answer is A. Drawing a random sample of N individuals (assuming n<<N) therefore is nothing other than a binomial experiment with p=n/k and N trials. It is now easy to calculate the probability distribution for the hypothesis say that A and B have equal support, and use your observed sample result to test that hypothesis.

Conceptually, election polls are binomial experiments just like coin flips and drawing balls from urns. I think your confusion is shared by many students and text books don't explain this connection really well. Almost all interesting applications consist of experiments of that kind – selecting a random sample from a population with certain characteristics, which is modeled as a binomial experiment.


TM 03.12.16 at 9:07 am

61 and others, the key here is understanding what a binomial experiment is. When you throw a fair coin say 1000 times, you can calculate precisely the probability of observing 600 heads. That probability is tiny (it does in fact constitute a 6-sigma discrepancy from the expectation value, since sigma=sqrt(npq) for a binomial experiment). So if you do observe 600 heads, you can conclude with a very high confidence that the coin is not fair.

It really shows that the basics of statistics are poorly taught. Too much emphasis is on advanced techniques (even in introductory courses) before students have had time to grasp the basics.


TM 03.12.16 at 9:14 am

62: “It seems to me that we can’t posit the coin is fair when we’re running actual tests.”

Of course you CAN posit a hypothesis. That’s what logic is for: you start from an assumption and explore the logical consequences. If what you observe is at odds with what you ought to observe if your assumption were true, well that casts doubt on your assumption. In most cases, “at odds” cannot be defined in absolute terms but in terms of probability, which is why we need statistical methods.


TM 03.12.16 at 9:20 am

79 corrections: Should be assuming N<<n, and p=k/n. Regret the mistakes.


John Holbo 03.12.16 at 11:38 am

“I think you are utterly wrong about how this work proceeds.”

The thing to see is that I’m not describing how the work proceeds. I’m describing what p-value claims say, in effect. Which is different. The point is that what they say is not as descriptive of the work, and how it proceeds, as we might have intuitively supposed it should be.

Of course if we are investigating trick coins – or an alleged link between cuticles and Republicanism – we are concerned with a population that includes these things. We are concerned about things in the actual world, anyway. But it doesn’t follow that p-value claims are correctly read as being claims about the population that interests us, experimentally. And, indeed, my point is that they don’t.

I agree with TM, which is something new and pleasant. Good!


SusanC 03.12.16 at 12:19 pm

On the gravity wave question above: I have no connection with gravity wave experiments, but use stats a lot in my day job. A question you might want to ask: given that I have two detectors, each of which produces an output with a fair amount of random noise, how likely is it that both detectors will see random noise that (a) looks approximately the same at both detectors, delayed by speed of light between them and (b) looks approximately like the kind of chirp you’re looking for. “approximately” because the detector output is noisy, and outputs are not expected to match exactly even when you see a genuine gravitational wave. I presume the people working on that experiment have done the statistics. But, re. the discussion above, in experiments of that form it isn’t always obvious that the false positive rate will be small, and you”ld better actually work it out in case it’s embarrassingly large. (Looking for an event of duration about 1 second with a detector running for months/years is a lot of parallel testing).

This is in principle the same as John’s coin flip example, just more complicated to work out the details.


John Holbo 03.12.16 at 12:43 pm

OK, that last comment was a bit rough. Let me try this. Dean, suppose you suspect a trick coin. You flip it 5-times, to investigate your trick hypothesis. All heads. You calculate your p-value. Obviously you only use the number 2 to calculate it: 2 x 2 x 2 x 2 x 2. Only a 5 percent chance, if this were a fair coin. Ergo, your trick hypothesis is strengthened, I assume. Equally obviously, you suspect 2 is not the right number. That is, you suspect a trick. So this 2 calculation concerns the world you suspect you are NOT in. So it’s relation to the world you suspect you ARE in not directly descriptive. Make sense?


TM 03.12.16 at 1:03 pm

You can only test a specific hypothesis. In statistics, a null hypothesis is always expressed as a mathematical equality and the alternative as an inequality. The hypothesis “the coin is fair” is p=.5 (where p is the probability of throwing a head). You can also test the hypothesis p=.6 or whatever, if there’s a good reason for doing so. It is NOT possible however to test the hypothesis “this coin is rigged”, because a coin can be rigged in an infinite number of ways and you cannot test each of them simultaneously.

That is also, I believe, the real reason why it’s important to insist that rejecting the hull hypothesis is NOT the same as accepting the alternative hypothesis. This is another complication that students have a hard time understanding. Rejecting p equals .5 seems to be logically equivalent with accepting p unequal .5 but the latter isn’t in itself a testable hypothesis. It’s an infinity of distinct hypotheses and you don’t know which of them is true, even if you have strong evidence that one of them is probably true.


John Holbo 03.12.16 at 1:10 pm

What TM said.


John Holbo 03.12.16 at 1:12 pm

Also, I wrongly said that 2 x 2 x 2 x 2 x 2 = a 5% chance. That’s obviously wrong. It’s a 1 in 32 chance. Less than 5%. We regret the error.


TM 03.12.16 at 1:41 pm

What Holbo @ 88 said … ;-)


steven johnson 03.12.16 at 2:59 pm

Gelman is cited as writing “Ultimately the problem is not with p-values but with null-hypothesis significance testing, that parody of falsificationism in which straw-man null hypothesis A is rejected and this is taken as evidence in favor of preferred alternative B.”

The thing is, it is not at all clear in what sense null-hypothesis testing is a parody of falsificationism. (In fact, it is not at all clear to me that “the” null-hypothesis is always the same thing, or at least, that it should always be. Is Ronald Fisher the seal of the prophets?) In the case of evolutionary psychology, it really is difficult to tell how, for instance, statistical significance testing of the hypothesis that men evolved to prefer specific ratios in female body measurements doesn’t falsify the hypothesis when p-values indicate no statistically significant difference in outcomes of an experiment. Then, it is difficult to tell why, if the p-values do demonstrate the existence of a statistically significant difference, the hypothesis should not merit further testing.

If, like Gelman (and the OP too?) I use my intuition drawn from seeing variations in body type from cursory familiarity with other cultures, plus visible changes in old movies, inexplicable as an evolutionary change, to conclude that something’s wrong with this notion? That’s not falsification. As I understand it, that sort of thing is verification, something falsificationism explicitly rejects. (By the way, as I understand it philosophers of science among themselves will claim Popper, the great proponent of falsification, is somewhat passe. But they more or less keep this news to themselves, and Popperianism tends to dominate everywhere but philosophy journals.) Falsificationism may be a broken hammer, but people will keep using a broken hammer when good new hammers are not on sale.

I’m not sure that Bayesian reasoning isn’t a good deal like Moliere’s prose. I suspect there is frequentist Bayesian reasoning, though I may be misled by a persistent tendency to think of Bayes’ theorem when anyone says “Bayesian.” It seems to me that Bayes’ theorem is easily understood in frequentist terms. On the other hand, it’s not at all clear to me that propositions/hypotheses have probabilities in any natural sense.

It’s not clear how Bayesian reasoning creates a non-straw man falsificationism. It is especially not clear how Gelman’s preferred improvement of measurements will do that. Skipping around I see Gelman talks about “noisy data,” which certainly seems to have a natural meaning, but it’s not clear what that means in falsification. He also complains about “open ended theories,” but Googling just finds a contrast between open ended and stage theories of human development. Stanford Encylopedia of Philosophy doesn’t seem to know that either. I assume it’s not really that obscure but somehow I’ve missed this.

At any rate it’s certainly depressing to find that only the most sophisticated statistics will lead us to true falsificationism. I guess that means most scientists are not really scientific? And of course most fields of study are not scientific either?


steven johnson 03.12.16 at 3:02 pm

PS It was Mario Bunge who impressed on me the observation about propositions not having probabilities. Perhaps I’ve misunderstood him?


Dean C. Rowan 03.12.16 at 4:59 pm

TM’s comments are helpful here, including this @82: “Of course you CAN posit a hypothesis.” But I was responding to what may very well be my misunderstanding that John was setting a condition, for the sake of his illustration about his 1) and 2), that the coin is fair. I thought he was positing that the coin is fair for the sake of his argument, not as a hypothesis to be tested. It’s as if he were saying, “I’m flipping what I know to be a fair coin. It lands five heads. The odds of this outcome are 1/32. If I didn’t know this were a fair coin, a p-value<.05 would lead me to guess this is a trick coin." My point was that it makes no sense to test a known fair coin for fairness.

John@86: "Equally obviously, you suspect 2 is not the right number." This is not obvious to me. That was the gist of my comment @77. Setting aside the slim possibility that a coin will land on its edge, there are only two possible outcomes of a coin flip. That goes for fair and trick coins. But on reflection I see where you're heading. (I've had my coffee and am now capable of crafting puns.) I was aiming that direction in @77, too. If we were working with coins biased to land heads 1/3 of the time, we would calculate the odds of five heads as the reciprocal of 3x3x3x3x3 or 1/243. Then we would flip random coins five times. If we encounter five heads we would strongly suspect we had flipped a coin biased differently or not at all. In any event, the world we are in remains one in which fair coins will land heads 50% of the time and our special known biased coins will land heads 33% of the time.


Dean C. Rowan 03.12.16 at 5:24 pm

@84: “But it doesn’t follow that p-value claims are correctly read as being claims about the population that interests us, experimentally. And, indeed, my point is that they don’t.”

Per Babbie, p-value claims are not about the population. They are claims about our sample from the population. We are willing to tolerate a 5% chance that the sample is not a fair representation of the population and that our remarkable result is due to an unrepresentative sample. Hence your “Ergo a 5-head run is vastly more likely to have been a fluke.” Yes, a fluke, and not necessarily “chance,” i.e., a five-heads run doesn’t necessarily mean we stumbled randomly across the rare trick coin. It means our sample is messed up: We didn’t flip properly. A gust of wind intervened. We incorrectly recorded two of the results as heads when in fact they were tails. Or, perhaps, we did stumble across a rare trick coin. (I remain uncomfortable referring to a flip of a single coin five times as a “sample,” but I’ll deal with it.)


anon/portly 03.12.16 at 6:17 pm

There is a 1 in 32 likelihood that this happened just by (longshot) chance. That is, given 5-heads, there is a 1-in-32 chance that you happen to have picked a fair coin (as likely as the alternative); then (flukily) you flipped 5 heads with it. On the other hand, there is a 31 out of 32 likelihood that this didn’t happen (just) by chance. Rather, you picked a trick coin….

Now that I am beginning to see the point (maybe), I thought I’d point out that you meant “1 in 33” and “32 in 33” here. (Pick up 64 coins, 32 Mysterioso, 32 yours, on average 33 times you should get 5 heads in 5 flips….).

If 31/63 of the coins are Mysterioso’s and 32/63 are fair then the p-value and the actual probability will coincide.

So if you want to explain to someone why their ‘likelihood that this thing happened just by chance’ intuition about p-values is wrong, flip it and tell them what they are thinking could be right, but only if they just collided with Mysterioso, as it were.

So I think what’s really being said is that in Mysterioso world, where P(true null) = .5, the p-value (1/32) is very close to the actual probability (1/33) of the event “happening just by chance;” whereas in the real world, where P(true null) = 1 or else P(true null) = 1, the actual probability of the event “happening just by chance” is either 100% (if P(true null) = 1) or some probability not likely to equal the p-value (if P(true null = 0).

The way I would think of this is that if you pick a coin out of the fair coin jar, the probability of 5 heads occurring “just by chance” is 100%. If you pick it out of the Mysterioso always-Heads jar, it’s 0%. If you pick it out of a jar with 31 Mysterioso always-Heads coins and 32 fair coins, it’s 1/32.


anon/portly 03.12.16 at 6:21 pm

typo, sorry, “or else P(true null) = 1” should be “or else P(true null) = 0.”


TM 03.12.16 at 9:49 pm

94: “p-value claims are not about the population. They are claims about our sample from the population. We are willing to tolerate a 5% chance that the sample is not a fair representation of the population and that our remarkable result is due to an unrepresentative sample.”

No, this is backwards. We are testing a hypothesis about the population, not about the sample. And we do assume – we have to assume – that the sample is “fair”, i.e. it is a true random sample. A 5-head run isn’t an “unfair representation” of flipping a fair coin, it’s just a somewhat rare outcome. Such an outcome was nevertheless to be expected, with probability 1/32; it is NOT evidence of anything going wrong (gust of wind, coin not flipped properly). Of course it could well be that something went wrong with the experiment, but that is an issue outside of statistics. If the experiment wasn’t conducted properly, statistics can’t fix it.


TM 03.12.16 at 9:57 pm

Ok I guess one could say that a significance test is a statement about the sample, more precisely about the likelihood of observing the sample if the population conforms to the hypothesis. But the hypothesis is about the population, we don’t hypothesize about the sample.


Dean C. Rowan 03.12.16 at 10:15 pm

TM @97 & 98: Taken together, I agree with these statements, both in deference to your authority and because it’s what I intended. A significance test is about the sample, and a hypothesis is about the population. We generalize from our observation of the sample to the population when and if the significance test gives us the confidence to do so. Statistics CAN give us a clue as to whether or not the experiment was properly conducted. Thus, a five-head result could be evidence that something is amiss, but not evidence of anything in particular being amiss. Yet if we review our tally, say, and realize we mistakenly coded two flips as heads when we remember observing two tails, then we credit the significance test for alerting us to the sampling error. Granted, a coding error isn’t precisely a sampling error in the sense that we failed to assure a random sample, but the illustration with which we’re working–a single coin flipped five times–renders the question of sampling error moot. The population is the single coin, hence it isn’t really a sample at all.


Bob 03.12.16 at 10:22 pm

I think that John’s original post gets the concepts right, but I feel that the whole thing could be put more succinctly, without being too mathematically economical.

A p-value is what is known as a conditional probability–the probability of an event occurring GIVEN that some other event has occurred, or is a fact, in the world that we are sampling from. With respect to John’s coin tossing example, the p-value answers the following question: “Given that this coin we are tossing (NOTE: not coins in general, but THIS coin) is fair, what is the probability of getting the result that we did?” A fair coin will produce 10 heads in 10 tosses, just by chance, in only (1/2)^10 of all trials. So if we get 10 heads in 10 tosses, we can conclude that either (i) something very unusual has occurred; or (ii) this is not a fair coin.

This is a frequentist approach to hypothesis testing. It proceeds by ASSUMING a state of the world (the coin we are tossing is fair) and then asking questions about the probability of getting the evidence that we did (10 heads in 10 tosses) IF that assumption is correct. If that probability is low–often, but by no means necessarily, less than 5%, it depends on your purposes–then the procedure rejects the assumption about the world. And 95% of the time we will make the right decision—i.e., the right inference from our data–if we follow this procedure. But we can say nothing about this particular case or the probability that this particular coin is fair. It is either fair or it is not; there is no probability about it. This may, after all, be one of the rare cases when a fair coin produces an extreme result. We will never know. But if we conduct all of our experiments in this way, we will be correct in our conclusions 95% of the time. Of this much we can be sure.

What is counter-intuitive about this approach is that it seems “upside down.” Everything we understand about science, or the way that detectives work, suggests that the EVIDENCE (10 heads in 10 tosses) should be the given, or the condition, or the “state of the world” (after all, it is what we observed!) and that we should be asking the following, inverse question: “Given the evidence of our experiment (10 heads in 10 tosses), what is the probability that our hypothesis that the coin is fair is true?” (The accused was found in possession of a knife with the victim’s blood on it, was seen running from the scene of the crime, with said blood dripping from said knife, etc. etc. In the light of/given all of this evidence, the jury asks itself, what is the probability that the accused is guilty? And is that probability low enough to be “beyond a reasonable doubt”?)

In fact, this frequentist procedure is SO counter-intuitive, we SO want to make probability statements about the hypothesis GIVEN the evidence, that many people, even trained scientists, often get the meaning of the p-value wrong. They speak of the probability of the hypothesis, conditional on the evidence, instead of the probability of the evidence, conditional on the hypothesis.

Now, there is a way, the Bayesian approach, to work from the evidence to the probability of the hypothesis. But to do so we need more information. First we need to make some assumption, or have some prior evidence, about the probability of coins IN GENERAL (i.e., not just this particular coin) being fair. As John rightly points out, absent any encounter with Mysterioso, the probability that a coin found at random is fair is extremely high. So high in fact that in the equations that we use to apply Bayes’ Rule the rarity of 10 heads in 10 tosses is going to be heavily discounted. Second, we need to say something about the probability of getting the evidence that we did given each of the possible cases of a coin NOT being fair. This is hard, because there are an infinite number of ways for a coin to not be fair, and only one way for it to be fair. There are ways to address this difficulty, but I just want to make the point here that putting things in an intuitive, “upside right” fashion, while possible, is by no means straightforward and does involve some subjective assumption making; whereas the frequentist approach does not have this requirement.


PaulB 03.12.16 at 11:48 pm

I wrote something once about coin tosses and p-values.


John Holbo 03.13.16 at 12:50 am

“Pick up 64 coins, 32 Mysterioso, 32 yours, on average 33 times you should get 5 heads in 5 flips”

OK, obviously it doesn’t really matter if I screwed it up – the details of one case, that is – but I got it right, didn’t I?

Mysterioso’s coins succeed in getting 5-heads 31 out of 32 and fail 1 out of 32. That’s the stipulation. Fair coins get 5-heads one in 32. That’s science fact! (or whatever it is.)

So this is the vivid way to do think through it!

Imagine a grid of 64 squares that are all your chances in this ‘all our coins are mixed and I just picked one randomly and flipped 5-heads’ sitch. First, divide in 2. 32 fair coin chances. 32 trick coin chances. Even odds which (same number of each sort of coins rolling on the road). Next, you color in 1 square on the fair side (that’s the 5-heads square.) Next you color in 31 squares on the trick side. Now you have 32 colored-in squares. These are your 5-heads options. So, in flipping 5-heads, you got one of these colored-in ones. But which? Well, only 1 is a fair coin one and 31 and trick coin ones. So, 31 out of 32, you just flipped a trick coin. Ta-DA!

I’ll just repeat the completion of the original thought: the point is to construct a case in which the crazy parody of reason I started with (if there’s less than a 5% likelihood that I could flip 5-heads, there must be a greater than 95% chance it’s a trick coin) is right. By coincidence. Which is the only way it could be right.

“Everything we understand about science, or the way that detectives work, suggests that the EVIDENCE (10 heads in 10 tosses) should be the given, or the condition, or the “state of the world” (after all, it is what we observed!) and that we should be asking the following, inverse question: “Given the evidence of our experiment (10 heads in 10 tosses), what is the probability that our hypothesis that the coin is fair is true?”

Thanks, Bob, that was helpful.


Ronan(rf) 03.13.16 at 12:57 am

hang on, i asked this at 14


John Holbo 03.13.16 at 12:58 am

Sorry, Ronan, missed that. But I got it right, right? Or am I the confused one about my own case?


Ronan(rf) 03.13.16 at 12:59 am

Sorry, mea culpa, perhaps i didnt


Ronan(rf) 03.13.16 at 1:06 am

John Holbo. Honestly I dont know. I’m confused by it to be honest. But the threads been interesting none the less.


John Holbo 03.13.16 at 1:34 am

To sum up: the reason p-values statements are unintuitive is that they are backwards (to what we intuitively figure science should be about) in two ways.

1) Rather than saying ‘given the evidence, what’s the likelihood of the hypothesis’ they are saying ‘given the hypothesis, what’s the likelihood of this evidence’.

2) The hypothesis it speaks to is the null-hypothesis, not the ‘interesting’ one (publication-worthy one).

So we are saying something backwards about something we feel we shouldn’t be talking about. Hence the temptation to say that maybe we just flip this thing and it’s the right answer concerning what interests us? But no.


LFC 03.13.16 at 1:50 am

I have read neither the entirety of the OP nor every word of the comments, but I have a question. (If it’s already been asked and answered, my apologies.)

Why are you teaching “a spot of” contemporary social psychology along with Plato? What is the point of that? (I ask this as someone who has read some Plato, albeit not very recently, but no Jonathan Haidt.)


John Holbo 03.13.16 at 1:59 am

“Why are you teaching “a spot of” contemporary social psychology along with Plato? What is the point of that?”

Thanks for asking. My module is “Reason and Persuasion”, same as the book. But the course is different from the book. Over the years, I always do some version of the following: 6 weeks on 3 Plato dialogues. Then 6 weeks on some very contemporary, non-philosophy (not written by academic philosophers) readings in which the issues and ideas and problems in the three dialogues are rearing their ugly/lovely heads again. Over the years I’ve done public policy stuff and some popular science writing. In recent years I’ve focused on psychology. Haidt is useful because he starts with ‘divided minds’. And, explicitly with Plato. He wants to say that Plato is right that ‘the soul is divided’ but wrong about how the divisions go. Basically, the idea is to think hard about whether Plato is out of date, or still wise after all these years, and how much the templates for thinking that Plato lays down are still applicable after all these years. For me, that last one is a pretty big selling point. I think that if you have some intellectual standard gambits and counters you’ve internalized, with labels that go back to Plato, you can read contemporary stuff with a bit more comprehension. ‘OK, in Plato terms, Haidt is basically doing this and denying that.’

So actually I’m doing more than a spot of psychology with my students. But I’m only doing a spot of discussion on the replication crisis in psychology, per se. Because I don’t want to have them read pop psych books that are enthusiastically trumpeting results that we now know don’t replicate.


John Holbo 03.13.16 at 2:02 am

In case you are wondering why I insist on 6-weeks of non-philosophy in a philo module it’s because, admin wise, I’m teaching what my institution calls a GE – general ed – module. So I’m supposed to be making philo interesting for non philo majors and not just doing the usual in-house thing.


Bill Benzon 03.13.16 at 2:08 am

I like 107. This null hypothesis stuff always struck me as being a bit dodgy.


John Holbo 03.13.16 at 2:11 am

I like teaching pop books of various sorts. We academics may disdain them because they play fast and loose. But I tend to think students get a lot out of them and we academics should therefore swallow our finickyness about little ways they may it wrong. (I’m wrecking their brains enough by making them read Plato. They oughta be able to relax with something easier.)


js. 03.13.16 at 2:14 am

I’m wrecking their brains enough by making them read Plato. They oughta be able to relax with something easier.

The Nicomachean Ethics is the obvious answer here.


John Holbo 03.13.16 at 2:20 am

Fair enough. Aristotle is a sensible guy. Glad you liked it, Bill!


js. 03.13.16 at 2:27 am

That joke was irresistible (probably only to me!) But seriously, I don’t get p values at all and this post (and thread!) were totally helpful. Thanks all!


Dean C. Rowan 03.13.16 at 2:30 am

This is all much clearer to me now, your goal, I mean, in addressing these questions. So now I have an intuitive itch to scratch. Why not have your students read just a bit of Feyerabend? It’s non-pop with a pop vibe. You can dance to it.


LFC 03.13.16 at 2:35 am

Thanks for the reply, JH. (And I also found parts of the post/thread helpful.)


RNB 03.13.16 at 3:07 am

Here’s part of Jordan Ellenberg’s discussion of the p value problem (pages 145-46) :

“Imagine yourself a haruspex; that is, your profession is to make predictions about future events by sacrificing sheep and then examining the features of their entrails…You do not, of course, consider your predictions to be reliable merely because you follow the practices commanded by the Etruscan deities. That would be ridiculous. You require evidence. And so you and your colleagues submit all your work to the peer-reviewed International Journal of Haruspicy, which demands without exception that all published results clear the bar of statistical significance.

Haruspicy, especially rigorous evidence-based haruspicy, is not an easy gig. For one thing, you spend a lot of your time spattered with blood and bile. For another, a lot of your experiments don’t work. You try to use sheep guts to predict the price of Apple stock, and you fail; you try to model Democratic vote share among Hispanics, and you fail…The gods are very picky and it’s not always clear precisely which arrangement of the internal organs and which precise incantations will reliably unlock the future. Sometimes different haruspices run the same experiment and it works for one but not the other — who knows why? It’s frustrating…

But it’s all worth it for those moments of discovery, where everything works, and you find that the texture and protrusions of the liver really do predict the severity of the following year’s flu season, and, with a silent thank-you to the gods, you publish

You might find this happens about one time in twenty.

That’s what I’d expect, anyway. Because I, unlike you, don’t believe in haruspicy. I think the sheep’s guts don’t know anything about the flu data, and when they match up it’s just luck. In other words, in every matter concerning divination from entrails, I’m a proponent of the null hypothesis [that there is no connection between the sheep entrails and the future]. So in my world, it’s pretty unlikely that any given haruspectic experiment will succeed.

How unlikely? The standard threshold for statistical significance, and thus for publication in IJoH, is fixed by convention to be a p-value of .05, or 1 in 20… If the null hypothesis is always true — that is, if haruspicy is undiluted hocus-pocus —then only one in twenty experiments will be publishable.

And yet there are hundreds of haruspices, and thousands of ripped-open sheep, and even one in twenty divinations provides plenty of material to fill each issue of the journal with novel results, demonstrating the efficacy of the methods and the wisdom of the gods. A protocol that worked in one case and gets published usually fails when another harupex tries it, but experiments without statistically significant results do not get published, so no one ever finds out about the failure to replicate. And even if word starts getting around, there are always small differences the experts can point to that explain why the follow-up study didn’t succeed.”

Here’s a kind of executive summary of a couple of chapters from Ellenberg’s book.

“1. It’s not enough that the data be consistent with your theory; they have to be inconsistent with the negation of your theory, the dreaded null hypothesis.

2. Here’s the procedure for ruling out the null hypothesis, in executive bullet-point form:

a. Run an experiment.

b. Suppose the null hypothesis is true, and let p be the probability (under that hypothesis) of getting results as extreme as those observed.

c. The number p is called the p-value. If it is very small, rejoice; you get to say your results are statistically significant. If it is large, concede that the null hypothesis has not be ruled out.

3. So : significance. In common language it means something like “important” or “meaningful.” But the significance test that scientists use doesn’t measure importance. When we’re testing the effect of a new drug, the null hypothesis is that there is no effect at all; so to reject the null hypothesis is merely to make a judgment that the effect of the drug is not zero. But the effect could still be very small—so small that the drug isn’t effective in any sense that an ordinary non-mathematical Anglophone would call significant.

4. Twice a tiny number is a tiny number. How good or bad it is to double something depends on how big that something is. Risk ratios are much easier for the brain to grasp than tiny splinters of probability like 1 in 7,000. But risk ratios applied to small probabilities can easily mislead you.

5. A significance test is a scientific instrument, and like any other instrument, it has a certain degree of precision. If you make the test more sensitive—by increasing the size of the studied population, for example—you enable yourself to see ever-smaller effects. That’s the power of the method, but also its danger. The truth is, the null hypothesis, if we take it literally, is probably just about always false.

6. If only we could go back in time to the dawn of statistical nomenclature and declare that a result passing Fisher’s test with a p-value of less than 0.05 was “statistically noticeable” or “statistically detectable” instead of “statistically significant”! That would be truer to the meaning of the method, which merely counsels us about the existence of an effect but is silent about its size or importance.”


John Holbo 03.13.16 at 5:18 am

Thanks for that. Funny, I’m actually friends with Jordan, and I know of his book, so I should have just gone and checked it out. Instead, I reinvented the wheel. But probably that was more educational for me, anyway.

My proudest mathematical moment was when he actually cited me, sort of, as the source of some mathematical thinking. A math prodigy citing me. I’m a math idiot. But, once, it happened. Never again, I’m sure.


Ronan(rf) 03.13.16 at 5:29 am

I actually (honestly) read Ellenberg’s book. But i disagreed with half of it. As Ray Liotta says. The good half.


Ronan(rf) 03.13.16 at 5:31 am

Sorry. I did actually read it, but that joke makes no sense.


RNB 03.13.16 at 6:29 am

I cited Naomi Orskes above, but I fear that there may be a statistical error here as persuasive as I find what she says (perhaps you’ll find it!)

I don’t have Ellenberg’s book with me, but looking over this quickly, it sure reminds me of what I read there.


TM 03.13.16 at 10:19 am

100: “And 95% of the time we will make the right decision—i.e., the right inference from our data–if we follow this procedure.”

Important quibble: the chance of a false positive is 5% or less. The chance of a false negative however is usually not known and could be quite high.

That is essentially what Oreskes has criticized: scientific conservatism dictates to minimize false positives but maybe we should be more concerned about false negatives? The example of toxicity testing was brought up by mbw @29. Under the precautionary principle, one should balance the risk of not banning a toxic substance against the risk of banning the substance when in fact it is not toxic. Or the risk of not acting against global warming against the risk of acting prematurely, or some such. In the case of Climate Change, I think it’s a red herring. Political inaction can’t be blamed on p-values. Scientific evidence is strong, and remaining uncertainties don’t justify inaction. Deniers have exploited the fact that warming between 2000 an 2012 wasn’t statistically significant, but that was always a transparent abuse of statistical methods. Climate Change denialism can’t be construed as a failure of scientific method, it’s squarely a political failure. And of course, changing the scientific methodology will do nothing to change the politics.


rwschnetler 03.13.16 at 11:27 am

One of the best threads on Crookedtimber ever.


faustusnotes 03.13.16 at 11:52 am

John Holbo, I think this is exactly wrong:

Rather than saying ‘given the evidence, what’s the likelihood of the hypothesis’ they are saying ‘given the hypothesis, what’s the likelihood of this evidence’.

As an example, here is Rutherford:

It were as likely as shooting a cannon ball at a piece of tissue and having the cannon ball bounce back.

This null hypothesis vs. alternative hypothesis approach is the main way that we do experimental reasoning.

To use my example above, how do we decide the globe is warming? We check to see if the trend in the temperature is greater than 0. It doesn’t matter what your specific mathematical process for doing that check is, the reasoning is always the same: set up a null hypothesis (no warming) and test to see if what you observe fits it (a slope in the temperature series).

There is no other way. You can use Bayesian stats or p values or whatever but ultimately you’ll be doing a Rutherford: positing a null hypothesis under an existing theory and breaking it.

If it was good enough for Michelson and Morley, it should be good enough for you!


Karl 03.13.16 at 12:08 pm

I would make a small modification to TM@87, which is that a null doesn’t have to be an equality – it can be an inequality. If p is the probability of heads, you could test the null that p < 0.5. Now, 5 heads in a row would lead you to reject the null: if 5 heads is too many for p = 0.5, then it must also be too many for p < 0.5. On the other hand, 5 tails in a row would lead you to reject p = 0.5, but there's no amount of tails in a row that would cause you to reject p < 0.5, because your null includes p = 0.

But if your null is p < 0.3, then even 3 heads in a row is too many, since it only comes up with a 2.7% chance.

The alternative is not, strictly speaking, an inequality. It's the opposite of the null. If the null is p = 0.3, then the alternative is p != 0.3, with no information about whether it's larger or smaller. If the null is p = 0.3. And if the null is p 0.3


John Holbo 03.13.16 at 12:38 pm

But the likelihood of a cannonball bouncing off a piece of tissue is pushing nowhere near the upper limit of that p< 0.05 line, faustus. He's rhetorically invoking (I dunno) 500,000 flips coming up heads, by chance, not just 5. Suppose Rutherford had said: it were as likely to toss a fair coin five times, and have heads show face up each time?


John Holbo 03.13.16 at 1:16 pm

More seriously, the point isn’t that this form of reasoning is valueless, faustus, just that the nature and strength of the inference can get confusing, hence exaggerated.

(But if would be great if Sherlock Holmes’ motto were: my dear Watson, once you have eliminated something less than 5% likely to be the truth, whatever remains, however implausible, must be the truth!)


John Holbo 03.13.16 at 1:17 pm

“One of the best threads on Crookedtimber ever.”

Thanks! I agree!


faustusnotes 03.13.16 at 1:22 pm

But objections to p-values are always presented (and yours was too!) as a problem with the fundamental philosophy, not the choice of threshold. You yourself said that the problem is arse backwards (I quoted you stating what the problem is). Your objection is not the p-value, it’s the logical process. You said it here:

the reason p-values statements are unintuitive is that they are backwards (to what we intuitively figure science should be about) in two ways

I don’t see anything about the magnitude of the threshold!

The thing is, you can’t propose an alternative that uses a different logic. Can you?


faustusnotes 03.13.16 at 1:23 pm

I should say, your objection is not the size of the p value, it’s the logical process.


John Holbo 03.13.16 at 1:31 pm

OK, quadruple-posting is bad form but what the hell. The Rutherford example is good for highlighting a different angle: you observe something incredibly unlikely, if the null-hypothesis is true. Bye-bye null. But there are, perhaps, all the same, many forms that denial of the null-hypothesis can take. So you can’t just flip null unlikeliness into likely truth of some alternative. We don’t accept either the null hypothesis or Rutherford’s alternative to it these days, after all.

This is not to deny the value of this sort of reasoning, as I keep saying. (It’s not like I have a better idea about how to reason.)


John Holbo 03.13.16 at 1:38 pm

OK, my 132 crossed with 130. I will simply repeat myself: if I have at any point erroneously implied that I think this is not a valuable way to reason, as far as it goes, I obviously did not intend that. For example, the quote faustus seizes on:

“the reason p-values statements are unintuitive is that they are backwards (to what we intuitively figure science should be about) in two ways ”

I could be saying something completely mad, yes. But there’s also a sensible reading, which I would encourage the reader to consider attributing to me.

Just because a piece of equipment works differently than we tend to assume, doesn’t mean it doesn’t work at all. But, if a piece of equipment works differently than we think it works, we are more likely to misuse it, other things equal. It’s good to be clear how it works.


John Holbo 03.13.16 at 1:39 pm

“I should say, your objection is not the size of the p value, it’s the logical process.”

I should say, that would be madness. If I have failed to sufficiently guard against this reading, it is only because the notion of objecting to the logical process itself – rather than seeking to understand it and its limits – literally never crossed my mind.


faustusnotes 03.13.16 at 2:17 pm

You have failed to guard against this reading, John. You said p-values are “unintuitive” because they get the logical process backwards, and identified exactly where. This is obviously a criticism of the structure of the reasoning, not the fact that you think they only work in rare cases. And why should they? We do comparisons that depend on not so great differences all the time. For example, differences between the sexes, or between occupational groups – these may not be so great as to drive huge p values. In the case of Rutherford and Michelson Morley they were comparing one theory with another theory, so we can expect them to fail huge. But that is not what stats is for – stats is for comparing differences between groups that share a common theory.

To use the example I put up at Gelman’s, suppose you’re trying to identify whether there are different rates of female genital mutilation (FGM) between muslims and non-muslims in Nigeria. Many different religious and ethnic groups practice this to some extent, and you’re not comparing to radically different physical theories – you’re simply trying to identify if there’s a difference. If that comparison annoys you, think of a difference in malaria risk between wealth quintiles, which could be due to differences in uptake of insecticide treated nets or could be due to environmental/structural differences (more sources of stagnant water, etc). This is a case of small differences that arise from realistic application of a theory of social determinants of health, not two conflicting theories and a carefully designed experiment that can distinguish between them. But it’s an important difference – who to target and how. Your experiment is going to come down to relatively small differences.

The advocates of alternatives to p-values typically don’t think of how to handle these problems. But I can bet you that whatever approach they take will end up using the same logical reasoning (null vs. alternative). The only difference will be the particular mathematics they use to compare their two hypotheses. Do you have a suggestion about some radical alternative that would change the structure of the reasoning? Or is your argument purely that we shouldn’t boil it down to a single number that is formally equivalent to one of the major alternatives?


Karl 03.13.16 at 2:19 pm

My comment @126 got mangled, possibly for my ignoring the role of inequality signs in html. Trying again, the last paragraph should be:

The alternative is not, strictly speaking, an inequality. It’s the opposite of the null. If the null is p = 0.3, then the alternative is p != 0.3, with no information about whether it’s larger or smaller. If the null is p &lt 0.3, then the alternative is p &gt = 0.3. And if the null is p &lt = 0.3, then the alternative is p &gt 0.3.


John Holbo 03.13.16 at 2:53 pm

“You have failed to guard against this reading, John. You said p-values are “unintuitive” because they get the logical process backwards, and identified exactly where. This is obviously a criticism of the structure of the reasoning, not the fact that you think they only work in rare cases.”

Well, here again I can apparently be read in a mad way (that’s good to know) but also in a sensible way and I throw myself on the mercy of the court as to which I meant. (Let the null hypothesis be: Holbo is completely mad. What is the likelihood of my post? Eh, maybe that reasoning could go either way …)

“p-values are “unintuitive” because they get the logical process backwards”

No, read my comment again. I don’t say the logical process is backwards. I say it is backwards with respect to how people expect p-values are going to work (or figure into the work.) But that’s their fault – the people – not the process’. (See OP for details.) I don’t fault p-values themselves for people’s failure to understand what they are and how they work. They are what they are, and they are valuable, but they aren’t necessarily quite what people think.


John Holbo 03.13.16 at 3:04 pm

“Do you have a suggestion about some radical alternative that would change the structure of the reasoning?”

When I wrote “It’s not like I have a better idea about how to reason,” I sincerely meant to deny that I have a better idea about how to reason. Seriously. There is nothing wrong with the structure of the reasoning, inherently. We just need to see it for what it is, the better to appreciate its limits in cases where people may mistakenly think they can use it in ways they can’t, or shouldn’t. (As to whether there is some other way to get those people where they want to go, in such cases? That’s a case-by-case basis, surely.)


Bob 03.13.16 at 6:02 pm

“Do you have a suggestion about some radical alternative that would change the structure of the reasoning?”

First, I agree with John, there is nothing wrong with the reasoning per se. But it can be misunderstood, due to its counter-intuitive nature, and it can also be abused—perform enough experiments and inevitably 5% of them will turn up something “significant.”

But there are also a couple of things that can be done to address the problems inherent in p-values—one frequentist, one Bayesian.

Within the frequentist school, you can introduce the concept of the “power” of the test, as a companion to “significance.” As various commenters have noted, the p-value is good if you primarily want to avoid false positives. (TM’s “important quibble” @123 above is quite correct.) The null hypothesis is like the accused in Anglo-Saxon criminal justice—innocent until proven guilty. And the bar for “guilty” is set very high. Users of p-values, whether they realise it or not, would prefer to see hundreds of “guilty” null hypotheses go free in order to prevent even one “innocent” null hypothesis from being convicted.

The concept of “power” introduces trade-offs between avoiding false positives and avoiding false negatives—it says that avoiding false positives comes at a price that we need to recognize. Specifically the “power” of a test is the probability that the test will lead to acceptance of the alternative to the null hypothesis when the alternative hypothesis is in fact correct. It is equal to 1 minus the false negative rate. It asks the following question: “Conditional on, or given that, (i) the alternative hypothesis is true, AND (ii) the experimenter will only reject the null based on the p-value being less than 5% (i.e., conditional on the null only being rejected if, conditional on the null being true, it would only have a 5% chance of producing the evidence that we saw), what is the probability that the alternative hypothesis will be accepted?” Put in different terms, it asks what is the probability that, if the alternative hypothesis is true, it would produce evidence (e.g., a sample mean) that lies in the zone where we would accept the null IF IT WERE TRUE. If we minimise the latter probability, then we maximise the power. But, all other things equal, greater power only comes at the expense of less significance (i.e., a higher false positive rate). (This is much easier to illustrate with a diagram of overlapping normal distributions–one centered on the null, one centered on the alternative–but cumbersome, I see, to express in words.)

Note that, as was the case with the frequentist concept of significance, the concept of power relies on ASSUMING the truth of a hypothesis, in this case the alternative to the null, and then asking what is the probability of the alternative producing evidence in a certain range of values that would lead to the alternative being accepted. In other words it is the same “backwards” idea of looking at evidence conditional on a hypothesis being true, instead of the more intuitive working from the evidence to the probable truth of an hypothesis. Specifying the alternative hypothesis correctly is also a problem for this approach.

For a description of the Bayesian approach, see the last paragraph to my comment 100 above. As discussed there, this approach allows us to “invert” the logic and make a statement about the probability of the hypothesis GIVEN the evidence.

Some dogmatic frequentists DO believe this to be a radical, and even illegitimate, alternative to the frequentist significance/power paradigm. The charge of illegitimacy gets into some very philosophical, and to me very interesting questions, regarding probability as “degree of belief” versus probability as “relative frequency over time.” The frequentists object to the idea of attaching a probability to a hypothesis. In their view an hypothesis is either true or it is not, and it is only the evidence—the sample mean—that takes on values according to chance. But we speak of probable guilt in criminal cases all the time. And the Bayesian approach gives us, I think, a legitimate way to talk about the probability of hypotheses. Where both Bayesians and frequentists agree, however, is that it is totally wrong to use a p-value to make an inference about the probability of a hypothesis.


anon/portly 03.13.16 at 6:44 pm

OK, obviously it doesn’t really matter if I screwed it up – the details of one case, that is – but I got it right, didn’t I?

Mysterioso’s coins succeed in getting 5-heads 31 out of 32 and fail 1 out of 32. That’s the stipulation. Fair coins get 5-heads one in 32. That’s science fact! (or whatever it is.)

You’re right, I missed that. However, if Mysterioso’s coins (call them 99.37 coins, since (31/32)^.2 is close to that percentage) only get 5 heads 31 out of 32 times, now every time you get 5 heads, whether it’s a fair coin or a Mysterioso 99.37 coin, you do get 5 heads “just by chance.” Sure the “chance” is a lot higher with a Mysterioso 99.37 coin, but it’s still just by chance.

Consider a Mysterioso 75 coin, where it lands Heads 75% of the time. A Mysterioso 75 coin will only come up Heads 5 straight times with a probability of about 24%, which is (more clearly than with a Mysterioso 99.37 coin) “just by chance.”

If I calculated it correctly, which is no sure thing, then if you put 992 of these coins plus 243 fair coins into a jar, and selected one at random and flipped it 5 times and got 5 Heads, once again you would have a 1/32 chance that this would be a fair coin. So once again the p-value would coincide with the actual probability of getting a fair coin, i.e. the actual probability of falsely rejecting the null.

Is the Mysterioso 75 example less insightful, more insightful, or equally insightful in comparison to the Mysterioso 99.37 example? I haven’t the faintest idea….

It seems like there is continuum of Mysterioso coins from 99.99 to 50.01, where, as long as you have the right mix of Mysterioso coins and fair coins, when you get 5 Heads there is a 1/32 chance this happened “just by chance” in the sense of being a fair coin (so the null is falsely rejected) and a 31/32 chance it happened “just by chance” because it was a Mysterioso coin (and the null is correctly rejected). But as you move along this continuum from 99.99 to 50.01, the “chanciness” of getting 5 Heads with the Mysterioso coin increases. This is why I think the Mysterioso example, which is pretty great with the Mysterioso 99.37 coins, is even better with Mysterioso 100 coins, the ones which always come up Heads.

If nothing else, with a Mysterioso 100 coin you can’t make a type-2 error (failing to reject the null when you should) whereas with a Mysterioso 99.37 coin you can. (If I understand any of these things correctly).

[Just to go off the rails even further, consider Mysterioso 25 coins, which land Heads 25% of the time – put 992 of these plus 1 fair coin into a jar, and once again if you pick one at random and get 5 Heads there will be (assuming correct calculation) a 1/32 chance it was a fair coin and a 31/32 chance it was a Mysterioso 25 coin – of course now you falsely reject the null either way].


John Holbo 03.13.16 at 11:28 pm

“you do get 5 heads “just by chance.” Sure the “chance” is a lot higher with a Mysterioso 99.37 coin, but it’s still just by chance.”

Yeah, I thought about writing about this. Obviously, in a sense, everything happens ‘by chance’, even dead certain things. 100% is a perfect respectable chance, after all! What we see here is an example of the pragmatic unclarity of ‘likelihood that it happened by chance’. People use ‘by chance’ to mean ‘significantly beat the odds’. Or ‘there is no underlying mechanism that would tend in this direction’. He only hit the target ‘by chance’. He was shooting blindly. Suppose you are blindfolded and shooting, trying to hit a barn door right next to you and you succeed. Did you succeed by chance? Suppose you are taking a hail mary 3-point shot with 1 second to go and it goes in, and you are actually a pretty good 3-point shooter. Did you succeed by chance? Eh.

Maybe we use ‘by chance’, informally, to refer to things less than 50% likely to happen, happening. But that’s obviously not a viable formal definition.

I tried to make the Mysterioso case match our ordinary pragmatic usage. That probably means I infected it with some confusion, by proxy.


John Holbo 03.13.16 at 11:32 pm

“Is the Mysterioso 75 example less insightful, more insightful, or equally insightful in comparison to the Mysterioso 99.37 example? I haven’t the faintest idea….”

I think its the same in that it shows how a thing we may wrongly suppose is the normal situation – one in which values neatly invert – is really not normal in the least. You need to rig it just so, and the world won’t do that, probably.

But it’s fun to think about it!


faustusnotes 03.14.16 at 1:34 am

I’d like to seize on this as an example of why I think the criticism is wrong, from Bob above:

In other words it is the same “backwards” idea of looking at evidence conditional on a hypothesis being true, instead of the more intuitive working from the evidence to the probable truth of an hypothesis.

We can’t do the latter with observational data; we have to do the former as a surrogate for the latter.

I’m a gamer, I spend a lot of time rolling dice and I hang out with a lot of non-scientists who roll dice. The issue of loaded dice comes up a lot (every die is loaded one way or another so we always seek the die or set of dice that loads in our favour). The mechanism for identifying this loadedness is very simple: assume a hypothesis, roll the dice a bunch, and make a judgment about whether the numbers fit the hypothesis. This is exactly the method of hypothesis testing.

What you can’t do in this instance is design an experiment to work back to your theory. You can do that with, say, the Michelson Morley experiment, because you’re comparing two fundamentally different theories, so in one theory something will happen and in the other it won’t. But often even then, identifying the thing that happened often relies on hypothesis testing (e.g. in Millikan’s oil drop experiment you still need an average charge and its 95% CI).

This idea of careful experimentation to get to a hypothesis is largely a fallacy. What happens in practice is you infer your hypothesis from the data you can construct – which is usually done through a hypothesis test.

Also, isn’t it the case that if you’re doing a test of the dice that I described above, or that has been mentioned above with the coin toss, it can also be seen as a Bayesian experiment with a Dirichlet prior on the probability of the event, p. This has to be the case surely, since in the large sample limit any Bayesian credible interval will converge to the frequentist one. Given the large sample similarity of the two frameworks, there has to surely be some relationship between the testing philosophies in the small sample situation.


John Holbo 03.14.16 at 2:08 am

I am undertaking a bit of self-study in frequentist vs. Bayesian theory. I have not progressed to the point where my knowledge is worth sharing with others, even in a comment box. But damn did I spend 100’s of hours rolling dice and calculating chances in tabletop games. Those were the days.


Dean C. Rowan 03.14.16 at 3:56 am

Time to read our Wesley Salmon. When we’re acknowledging that 100% is “by chance,” we are using language to miss a point.


John Holbo 03.14.16 at 4:46 am

If it weren’t for 0% chance, lots of people would have no chance at all. Have a heart.


faustusnotes 03.14.16 at 5:16 am

Whyever did you stop John? They just get better with age!


John Holbo 03.14.16 at 5:36 am

I had kids. They get better with age, too. Bigger, anyway.


TM 03.14.16 at 8:57 am

Karl 136: A null hypothesis should always be specific (i.e. an equality) because that is what allows us to calculate the sampling distribution for that *specific* hypothesis and determine the likelihood of the observed sample within that distribution.

It is important to understand the concept of a statisticallay testable hypothesis and not mix it up with intuitive notions (which I know is hard). Null hypothesis and alternative hypothesis are not interchangeable. In my experience, formulating correct null hypotheses is among students’ biggest difficulty. Again, in my view statistics instruction is a failure if students don’t really understand these fundamental concepts, even if they somehow manage to do the calculations.


JoB 03.14.16 at 9:58 am

I found instructive at both a theoretical, philosophical and practical level. Anyway, it does not hurt to read about Henry Kyburg:,_Jr. His is an attempt to create a framework where frequentist, logical and Bayesian intuitions on the matter can come together (in a very frequentist way).

With 149 I believe the problem with statistics is not particularly the math but the basics of formulating the probabilistic inference. I believe humans are very good at that informally but in formalizing the models to cope with bigger-than-life data sets people tend to get lost in the models. They apply the formal models in ways which, if they would think it through, clash outright with their intuitions.


Manta 03.14.16 at 2:24 pm

First of all, a wonderful thread.

Second, my 2 cents.
(If I understood things correctly) we are in the following situation: you are, in your life, asked many questions; to each of them you can answer “yes”, “no”, or “I don’t know”.
An answer “I don’t know” is *always* correct; however, you want to answer as many questions as possible with a “yes” or “no”, subject to the condition that you get only 5% of your answers wrong. If you follow the rule “give an answer only when p < 0.05" you will be wrong less than 5% of the times in your life. For instance, if you live in a universe where ALL coins are fair, if you answer "yes, the coin is rigged" to Holbo's problem, you will be wrong: but that will happen only 5% of the times in your life, even in this extreme situation where all coins are fair (or, equivalently, when you KNOW that the coin is fair).

In other words, I think the "p < 0.05" should not be read as a likelihood of being wrong in a particular given case ("is this coins is rigged?", even if we know a priori that the coin is fair!), but as a move in a general strategy on how to minimize errors and still give useful answers.


mdc 03.14.16 at 2:40 pm

“Obviously, in a sense, everything happens ‘by chance’, even dead certain things.”

Wow! This would be a profound- and to me shocking- metaphysical insight, if it were true.


Manta 03.14.16 at 10:36 pm

A different analogy.

Suppose you are playing the following game against the Devil.
The game is in many (ideally, infinitely many) rounds.
In each round, first you choose a strategy for how to answer questions, and tell this strategy to the Devil.
Then the Adversary gives you an urn with a large number of balls, some of them black and some white. You pick 5 balls, look at them, and then guess whether or not at least half the balls in the urns are black; you can either answer “yes” or “I don’t know”.

At the end of the game
1) you must be wrong at most 1/32 of the times (that is, answered “yes” when the correct answer was “no”) ,
and 2) try to guess correctly as many urns as possible (that is, answer “I don’t know” as few many times as possible).
Notice that in this example the Bayesan analysis breaks down: you cannot give any a priori probability to how many black balls are in each urn.

But it still make sense to use the “p <= 1/32" strategy (that is, to answer "yes" when all the 5 balls are black, and "I don't know" in all the other cases).
(The "p <= 1/32" is the only deterministic strategy that satisfies the above conditions: if in a given round e.g. you choose to answer "yes" also when you get 4 out of 5 black balls, the Devil would give you an urn with equal amount of black and whites, and you would guess wrongly too many times; on the other hand, if you decide that in a given round you answer "I don't know" even when you pick 5 black balls, the devil will give you an urn with only black balls, and you will waste a round).

On the other hand, I don't see the justification for the "p < 0.05" as a criterion on which papers to publish and which to reject (that is, the strategy above will ensure that you will answer wrongly at most 1/32 of ALL questions, not of all questions when you say "yes": in fact, the Devil could give you all urns with equal quantities of blacks and white, and 100% of the paper you publish would be wrong).

I hope I didn't make mistakes in thee above analysis…


John Holbo 03.14.16 at 11:35 pm


faustusnotes 03.15.16 at 12:49 am

I’d like to add to the coin toss example a little, though I guess this thread is dead now. If you were to try a Bayesian approach to the coin toss example you would end up using exactly the same logic as a frequentist approach, just with different numbers.

Under a Bayesian approach with some prior you would get an estimate of the posterior distribution of the probability of a head, say p=0.5 with 95% credible interval 0.42 to 0.58. You would then have to make a judgement about whether this is fair. 0.5 seems fair but what about that interval? Is 0.44 fair? What about 0.42? How can you tell? Perhaps you present this interval to a committee of referees before the next world cup and they want to be sure that their coin tosses will be fair?

Fortunately you can check: you calculate hte posterior probability distributions of observing 1 head in a row, two heads in a row, etc. You present this to the committee of refs and they say: Look, the 95% credible interval for the probability of getting 5 heads in a row is below our preferred threshold for an unfair coin. This coin is fair!

And what would the threshold be? Around about 0.05, I bet …


TM 03.15.16 at 12:14 pm

Manta: “general strategy on how to minimize errors and still give useful answers” Correct, but applies only to Type I errors. Type II errors are not minimized, see 123.


RNB 03.15.16 at 7:21 pm

Running out the door, and I am wondering whether there is any connection between Popper’s falsificationism and Ellenberg’s idea that the data have to be inconsistent with the null hypothesis. At any rate, I have always stumbled on Popper’s falsificationism. Can it be related in any way to the discussion we are having? No idea–just asking.


mbw 03.18.16 at 12:04 am

@52 As it happened, this signal wasn’t weak so there was very little uncertainty in its arrival times. The delay was, IIRC, 7msec. That determines a cone of possible directions away from the straight line between the detectors.

I think it’s great that you’re asking these questions and wish moe people would do the same. The less we have to fall back on authority the better.


mbw 03.18.16 at 12:10 am

@ RNB Yes, I believe that the p-value approach has been seen as a sort of implementation of Popperian falsificationism. Perhaps Bayesian approaches are a better fit with more realistic versions of that philosophy, e.g. Lakatos.


mbw 03.18.16 at 12:13 am

@TM One problem with that approach is that many problems have no natural null hypothesis at all. Trying to cram general parameter estimation into that canned framework often leads to absurdities. I think we’ve discussed some of these cases on previous posts here.

Comments on this entry are closed.